Saturday, December 27, 2008
building jaunty using git
Day 1:
Yesterday, I re-visited the task of building the kernel for ubuntu-jaunty. Obtaining the source using git is covered here. Compiling the source is covered here.
Snag1:
I was on Hardy (not Intrepid) and Hardy repositories do not have a needed utility makedumpfile. After a few hours of compiling, the build failed as it couldn't find makedumpfile. You can save yourself the trouble and manually install the package from here. Once you download the makedumpfile package, you can install it like so:
dpkg -i /path/to/file/makedumpfile_1.2.9-0ubuntu3_i386.deb
So I'm now starting a build again, but this time, to save some time, I will build just the generic flavor, like so:
AUTOBUILD=1 NOEXTRAS=1 fakeroot debian/rules binary-generic
I'm also removing the source directory and cloning the repository fresh. This is to
make sure that there are no conflicts with any files left over from the previous
build.
Day 2:
So today I got the build to go all the way. It generated two debs - the linux image
and headers respectively. I installed the image like so:
dpkg -i linux-image-2.6.28-4-generic_2.6.28-4.5_i386.deb
Since I had modified the /boot/grub/menu.1st file, this gave me the option of overwriting
it or keeping the old. I kept the old file as there were entries to all my
kernel versions I wanted to keep. Later I manually added the stanza
for the new kernel to /boot/grub/menu.1st like so:
title Ubuntu 8.04.1, git kernel 2.6.28-4-generic
root (hd0,0)
kernel /vmlinuz-2.6.28-4-generic root=UUID=b79346d4-7799-43ae-a033-99d70b6784be ro quiet splash
initrd /initrd.img-2.6.28-4-generic
quiet
Snag 2:
Then I tried installing the headers like so:
dpkg -i linux-headers-2.6.28-4-generic_2.6.28-4.5_i386.deb
But this failed as there was a dependency on linux-headers-2.6.28-4 which was not
present on the system. I found it here. I installed it like so:
dpkg -i linux-headers-2.6.28-4_2.6.28-4.5_all.deb
After that, I could successfully install the headers.
After which, I could reboot into the new kernel.
Wireless:
However my wireless connection did not work. Looking at the dmesg output I found these:
thushara@gini-sisila:~$ dmesg | grep acx
[ 12.782440] acx: Loaded combined PCI/USB driver, firmware_ver=default
[ 12.782450] acx: compiled to use 32bit I/O access. I/O timing issues might occur, such as non-working firmware upload. Report them
[ 12.782532] acx_pci 0000:00:0b.0: PCI INT A -> Link[LNKD] -> GSI 10 (level, low) -> IRQ 10
[ 12.782758] acx: found ACX111-based wireless network card at 0000:00:0b.0, irq:10, phymem1:0xE2020000, phymem2:0xE2000000, mem1:0xe0edc000, mem1_size:8192, mem2:0xe08c0000, mem2_size:131072
[ 12.783790] acx: loading firmware for acx1111 chipset with radio ID 16
[ 12.783801] acx_pci 0000:00:0b.0: firmware: requesting acx/1.2.1.34/tiacx111c16
[ 20.290162] acx: firmware image 'acx/1.2.1.34/tiacx111c16' was not provided. Check your hotplug scripts
[ 20.290177] acx_pci 0000:00:0b.0: firmware: requesting acx/1.2.1.34/tiacx111
[ 20.772774] acx: firmware image 'acx/1.2.1.34/tiacx111' was not provided. Check your hotplug scripts
[ 20.772786] acx: reset_dev() FAILED
[ 20.772840] acx_pci 0000:00:0b.0: PCI INT A disabled
[ 20.788119] acx_pci: probe of 0000:00:0b.0 failed with error -5
[ 20.788361] usbcore: registered new interface driver acx_usb
So it looked like the firmware files were missing. Doing a quick `locate tiacx` listed
that the firmware files were available on my previous kernel lib path but not on
the current kernel firmware path.
So I created the firmware directory for the acx driver under the directory for
the new kernel and copied just the single file mentioned by the dmesg to that
location, like so:
sudo cp /lib/firmware/2.6.24-19-generic/acx/1.2.1.34/tiacx111c16 /lib/firmware/`uname -r`/acx/1.2.1.34
After this, I could reboot into the new kernel and the wireless worked.
Tuesday, December 16, 2008
Java has no unsigned types
I came across this today as I need to create a hash that maps a tuple to an object. The tuple comprises of two keys. Each key is in reality an unsigned integer (32 bit). I was hoping to construct a key for the hash by shifting one key 32 bits and ORing them together to make a 64bit key.
A good article on this can be found here.
The problem with not being able to use an unsigned type supported by the language is the introduction of subtle bugs.
Here's an example:
int x = 0x865a672b;
long y = (0x12345678<<32) | x
1) you can shift an integer (32 bit value) a maximum of 31 bits. in fact if you were to do:
val << shift, where val is an integer, java internally does val << (shift%32)
Thus in this case, 0x12345678 is not shifted at all.
2) so the 0x12345678 never gets promoted to a 64-bit long. Thus we expected 0x1234567800000000, but we have 0x12345678.
3) 0x865a672b is a negative number (as it is an integer). So the 64-bit representation would be 0xffffffff865a672b
4) so we're in fact performing 0x12345678 | 0xffffffff865a672b = 0xffffffff967e777b
So how about this? (We're using longs now)
long x = 0x865a672b;
long y = (0x12345678L<<32) | x;
This gives us 0xffffffff865a672b which is still not what we want, but close. So what is happening here is that even though we declared x to be a long, 0x865a672b is treated as an integer, and java converts that negative integer to a long and in the process keeps the sign (-ve). So x is really 0xffffffff865a672b. So when we OR this to 0x1234567800000000, the high order 32 bits are lost.
Ok, last try:
long x = 0x865a672bL;
long y = (0x12345678L<<32) | x;
Notice we're coercing 0x865a672bL to be a long (the L at the end). This works, giving us 0x12345678865a672b. Enough already!
So if you want to work with unsigned 32-bit values in java, always work in long data type and convert any literal to long with an L at the end.
Friday, November 21, 2008
apt-get structure of repo listing
http://www.ibiblio.org/gferg/ldp/giles/repository/repository-2.html
Thursday, October 30, 2008
Java string replace is slow
Java String.replace uses regular expressions and can be slow, especially for large strings requiring many replacements.
If you don't need regular expression replacement, you can use this function I wrote that performs better.
static public String replaceString(String orig, String search, String replace) {
int cap = replace.length() > search.length() ? orig.length() + (replace.length() - search.length())*20 : orig.length();
StringBuilder out = new StringBuilder(cap);
int prev = 0;
CharSequence okPart;
for (int i=orig.indexOf(search); i!=-1; i=orig.indexOf(search, prev)) {
okPart = orig.subSequence(prev, i);
out.append(okPart).append(replace);
prev = i + search.length();
}
if (prev < orig.length()) {
okPart = orig.subSequence(prev, orig.length());
out.append(okPart);
}
return out.toString();
}
Here is the performance data with Java String.replace:
Here is the performance data with the new function:
Interpreting the data:
With String.replace, time inside doFlowAction() was broken down 71/29% for enQueueWBImpression/getCPCWidgetCache. The hottest function inside getCPCWidgetCache was String.replace that accounted for 93% of the time.
With the new replacement function,
With String.replace, time inside doFlowAction() was broken down 89/11% for enQueueWBImpression/getCPCWidgetCache. This reduction of time spent inside getCPCWidgetCache (a 62% gain) is due to the replacement of the String.replace with the new string function.
If you don't need regular expression replacement, you can use this function I wrote that performs better.
static public String replaceString(String orig, String search, String replace) {
int cap = replace.length() > search.length() ? orig.length() + (replace.length() - search.length())*20 : orig.length();
StringBuilder out = new StringBuilder(cap);
int prev = 0;
CharSequence okPart;
for (int i=orig.indexOf(search); i!=-1; i=orig.indexOf(search, prev)) {
okPart = orig.subSequence(prev, i);
out.append(okPart).append(replace);
prev = i + search.length();
}
if (prev < orig.length()) {
okPart = orig.subSequence(prev, orig.length());
out.append(okPart);
}
return out.toString();
}
Here is the performance data with Java String.replace:
Here is the performance data with the new function:
Interpreting the data:
With String.replace, time inside doFlowAction() was broken down 71/29% for enQueueWBImpression/getCPCWidgetCache. The hottest function inside getCPCWidgetCache was String.replace that accounted for 93% of the time.
With the new replacement function,
With String.replace, time inside doFlowAction() was broken down 89/11% for enQueueWBImpression/getCPCWidgetCache. This reduction of time spent inside getCPCWidgetCache (a 62% gain) is due to the replacement of the String.replace with the new string function.
Friday, October 24, 2008
Find the libraries your executable needs
LD_DEBUG is an environment variable that can be used to look under the hood of the Linux loader.
For ex: setting LD_DEBUG to libs will show you what the loader is doing to find the shared libraries needed to run an executable.
output:
mpire@sandbox mpire $ LD_DEBUG=libs ls
23994: find library=libacl.so.1; searching
23994: search path=/site/mpire/sys/mysql/lib/mysql/tls/i686/m ....
Thursday, October 16, 2008
mysql - find missing rows in a join
The left join can be used to get all rows from a secondary table matching the rows of the first table, along with those rows from the first table including any rows that doesn't find a match on the secondary table. for rows with no match, NULL would be returned on the secondary table column. this can be useful when you want to find the rows that have no match:
select a.id, b.id from a left join b on (a.secid=b.id) where b.id is null;
And most times, you want to correct this by providing a good value for the NULL column in the secondary table. If you want to set all those rows with a NULL column to the same value, you can do:
update a left join b on (a.secid=b.id) set a.secid=12629002 where b.id is null;
Wednesday, October 15, 2008
Java - you can't return a value through a parameter to a function
In Java all parameters are strictly input. There are no output parameters. Anything you need to return from a function needs to be returned through the function return value.
[Here's a rationale for why Java disallows out parameters.]
Of course what I said is not strictly true that if you pass an object to a function, you can change a member variable of the object within the function and that would be a way to pass a value back to the caller.
But it is often the case, you want two values to be passed back to the caller. With Java, you would need to implement a little class with these two objects and pass that class back. This is the example I'm going through today:
I need to get a resultset from a JDO query. So I would like to move the JDO queries to a Manager class, and code a function like:
Collection getResults(PersistenceManager pm);
But the problem is that inside getResults, I need to instantiate a query object that takes up some database resources on the client side. It is a good idea to clean up the query object so that the database handles could be freed. It is very important if the query is executed repeatedly in a loop, or if the result set is large.
The nice thing about using the above call at the call site is its elegance:
Collection results = mgr.getResults(pm);
for (Object o : results) {
// process the result records
}
Now this would change to something like:
CollectionAndQuery cq = results.getResults(pm);
for (Object o: cq.results) {
// process
}
cq.query.closeAll();
And the CollectionAndQuery class has to be defined at global scope as it is needed in both the call site and the manager.
Now, if Java had a generic Pair object, we could have used that for this purpose. [CollectionAndQuery is really a specialized Pair object] Unfortunately there is no Pair object in Java.
[Here's a rationale for why Java disallows out parameters.]
Of course what I said is not strictly true that if you pass an object to a function, you can change a member variable of the object within the function and that would be a way to pass a value back to the caller.
But it is often the case, you want two values to be passed back to the caller. With Java, you would need to implement a little class with these two objects and pass that class back. This is the example I'm going through today:
I need to get a resultset from a JDO query. So I would like to move the JDO queries to a Manager class, and code a function like:
Collection getResults(PersistenceManager pm);
But the problem is that inside getResults, I need to instantiate a query object that takes up some database resources on the client side. It is a good idea to clean up the query object so that the database handles could be freed. It is very important if the query is executed repeatedly in a loop, or if the result set is large.
The nice thing about using the above call at the call site is its elegance:
Collection results = mgr.getResults(pm);
for (Object o : results) {
// process the result records
}
Now this would change to something like:
CollectionAndQuery cq = results.getResults(pm);
for (Object o: cq.results) {
// process
}
cq.query.closeAll();
And the CollectionAndQuery class has to be defined at global scope as it is needed in both the call site and the manager.
Now, if Java had a generic Pair object, we could have used that for this purpose. [CollectionAndQuery is really a specialized Pair object] Unfortunately there is no Pair object in Java.
Friday, October 10, 2008
Tuesday, October 07, 2008
Java has no Pair object?
Often times we work with pairs of objects vs single objects. We may need to store a list of such pairs. STL provides a pair class. Java does not. It would be a useful addition.
Currently I'm building a TreeMap and I need to use a set of pair objects (int weight -> advertisement) to populate it. Not having a pair object leads me to implement a pair data structure.
Currently I'm building a TreeMap and I need to use a set of pair objects (int weight -> advertisement) to populate it. Not having a pair object leads me to implement a pair data structure.
Java has no upper_bound?
I can't find an upper_bound() equivalent in the Java sorted containers.
Looking at TreeMap, we can use the TreeMap.tailMap.firstKey to get the lower_bound.
No function for upper_bound.
I googled on this for a bit and came across some posts that mis-represented what lower_bound, upper_bound means using the standard definition.
This is a useful function and we can code it up with TreeMap.tailmap, but it is useful enough I feel it should be available in the sorted containers in the JDK.
Today I needed an upper_bound() utility for the following task. I keep track of a set of advertisements to be served to the browser. Each advertisement has a weight associated with it based on its past performance. The weights of all advertisements add up to 100.
So for each web request, I need to return an advertisement based on these weights. If I had 2 advertisements with weights 80,20 then I would return the first advertisement 80% of the time.
I implement this with a TreeMap that maps the cumulative weight to the Advertisement. So for the example above it will look like this:
80 -> Ad 1
100 -> Ad 2
Now, all I need to do is to generate a random number greater >= 0 and < 100, and take its upper_bound. That would point me to the advertisement I should return.
Using lower_bound is technically in-correct. Unless I generate the random numbers >=1 and <=100. This can be done, but ideally if Java had an upper_bound() there will be less such work we have to do as programmers.
Looking at TreeMap, we can use the TreeMap.tailMap.firstKey to get the lower_bound.
No function for upper_bound.
I googled on this for a bit and came across some posts that mis-represented what lower_bound, upper_bound means using the standard definition.
This is a useful function and we can code it up with TreeMap.tailmap, but it is useful enough I feel it should be available in the sorted containers in the JDK.
Today I needed an upper_bound() utility for the following task. I keep track of a set of advertisements to be served to the browser. Each advertisement has a weight associated with it based on its past performance. The weights of all advertisements add up to 100.
So for each web request, I need to return an advertisement based on these weights. If I had 2 advertisements with weights 80,20 then I would return the first advertisement 80% of the time.
I implement this with a TreeMap that maps the cumulative weight to the Advertisement. So for the example above it will look like this:
80 -> Ad 1
100 -> Ad 2
Now, all I need to do is to generate a random number greater >= 0 and < 100, and take its upper_bound. That would point me to the advertisement I should return.
Using lower_bound is technically in-correct. Unless I generate the random numbers >=1 and <=100. This can be done, but ideally if Java had an upper_bound() there will be less such work we have to do as programmers.
Friday, October 03, 2008
What's wrong with this Java code?
//returns a date that is before today
public static Date getBackDate(int numDays) {
Date date = new Date();
date.setTime(date.getTime()-numDays*24*60*60*1000);
return date;
}
public static Date getBackDate(int numDays) {
Date date = new Date();
date.setTime(date.getTime()-numDays*24*60*60*1000);
return date;
}
Wednesday, October 01, 2008
Java PriorityQueue initialCapacity cannot be zero
It is mentioned in the docs, but who reads them, right?
Looking through the PriorityQueue implementation, I couldn't really tell why this restriction is necessary. They use an Object[] array to represent the heap and store items from index 1 vs index 0 to help the math. They also have a - int size - that keeps track of the number of elements in the PriorityQueue.
Seems like they could have made this work even for a PriorityQueue with an initialCapacity of 0. This could avoid some bugs from the call-site where a PriorityQueue is initialized with an initialCapacity obtained from another source (ex: the number of records that match a certain criteria in a table) and this could very well be zero.
Update:
Andrew Hayley who made this patch - http://gcc.gnu.org/ml/java-patches/2006-q3/msg00426.html - did that only to pass the tests. As I hear from other devs involved in the JDK, I will add those details here.
Looking through the PriorityQueue implementation, I couldn't really tell why this restriction is necessary. They use an Object[] array to represent the heap and store items from index 1 vs index 0 to help the math. They also have a - int size - that keeps track of the number of elements in the PriorityQueue.
Seems like they could have made this work even for a PriorityQueue with an initialCapacity of 0. This could avoid some bugs from the call-site where a PriorityQueue is initialized with an initialCapacity obtained from another source (ex: the number of records that match a certain criteria in a table) and this could very well be zero.
Update:
Andrew Hayley who made this patch - http://gcc.gnu.org/ml/java-
Friday, September 19, 2008
perl and mysql - what's wrong with the import data?
mysql> load data infile '/Users/thushara/db/sitecat1.csv' into table sitecategorytoproductsetcategory;
ERROR 1062 (23000): Duplicate entry '0' for key 1
That is what mysql complained on my import file which looks like this:
[~/db] head sitecat1.csv 1000 1 2001 68001
1001 1 2001 98001
1002 1 2001 107001
1003 1 2001 59001
1004 1 2001 94001
Which i generated from a file like this:
[~/db] head sitecat.csv
2001,68001
2001,98001
2001,107001
2001,59001
2001,94001
2001,88001
2001,58001
2001,92001
2001,87001
2072,80001
using this:
[~/db] perl -ne '@x=split(/,/);$n = 1000+$i++;print "$n\t1\t$x[0]\t$x[1]\n"' sitecat.csv > sitecat1.csv
problem identified:
there is an extra new line being generated (not the last \n i add there) which creates a blank line in the output file for every data line. when mysql tries to import the blank line, it gives the "Duplicate entry '0' for key 1" error.
correct perl:
[~/db] perl -ne '@x=split(/,/);$n = 1000+$i++;print "$n\t1\t$x[0]\t$x[1]"' sitecat.csv > sitecat1.csv
which creates:
[~/db] head sitecat1.csv
1000 1 2001 68001
1001 1 2001 98001
1002 1 2001 107001
1003 1 2001 59001
1004 1 2001 94001
1005 1 2001 88001
1006 1 2001 58001
1007 1 2001 92001
1008 1 2001 87001
1009 1 2072 80001
no blank lines which gives:
mysql> load data infile '/Users/thushara/db/sitecat1.csv' into table sitecategorytoproductsetcategory;
Query OK, 229 rows affected (0.01 sec)
Records: 229 Deleted: 0 Skipped: 0 Warnings: 0
finally.
ERROR 1062 (23000): Duplicate entry '0' for key 1
That is what mysql complained on my import file which looks like this:
[~/db] head sitecat1.csv 1000 1 2001 68001
1001 1 2001 98001
1002 1 2001 107001
1003 1 2001 59001
1004 1 2001 94001
Which i generated from a file like this:
[~/db] head sitecat.csv
2001,68001
2001,98001
2001,107001
2001,59001
2001,94001
2001,88001
2001,58001
2001,92001
2001,87001
2072,80001
using this:
[~/db] perl -ne '@x=split(/,/);$n = 1000+$i++;print "$n\t1\t$x[0]\t$x[1]\n"' sitecat.csv > sitecat1.csv
problem identified:
there is an extra new line being generated (not the last \n i add there) which creates a blank line in the output file for every data line. when mysql tries to import the blank line, it gives the "Duplicate entry '0' for key 1" error.
correct perl:
[~/db] perl -ne '@x=split(/,/);$n = 1000+$i++;print "$n\t1\t$x[0]\t$x[1]"' sitecat.csv > sitecat1.csv
which creates:
[~/db] head sitecat1.csv
1000 1 2001 68001
1001 1 2001 98001
1002 1 2001 107001
1003 1 2001 59001
1004 1 2001 94001
1005 1 2001 88001
1006 1 2001 58001
1007 1 2001 92001
1008 1 2001 87001
1009 1 2072 80001
no blank lines which gives:
mysql> load data infile '/Users/thushara/db/sitecat1.csv' into table sitecategorytoproductsetcategory;
Query OK, 229 rows affected (0.01 sec)
Records: 229 Deleted: 0 Skipped: 0 Warnings: 0
finally.
Monday, September 15, 2008
Implementing a Heap in Java
A Heap is a data structure storing a collection of objects, allowing you to pick the object of the highest value and insert objects of arbitrary values efficiently.
Simply looking at the highest value takes O(1) time. Removing the highest value - thus making sure that the next highest value will be available for the next call - takes O(log n). Insertion takes O(log n).
Java has an inbuilt collection class - java.util.PriorityQueue that can be used as a Heap. The caveat is that it is the minimum value, not the maximum that can be retrieved efficiently.
So, there are three ways of dealing with the situation.
1. Implement Comparable interface on the object being stored.
For the object type (class) stored in the PriorityQueue, implement the Comparable interface. Comparable interface has one function - [int CompareTo(T t)] that should return -1 if this object is less than t, +1 if this object is greater than t, 0 if the two objects are equal. So what we could do is "flip" this definition and implement a compareTo function that returns -1 if this object is greater than t, +1 if this object is less than t, 0 if they are equal. Thus when the PriorityQueue insert function tries to compare the value of the object being inserted to the current head of the queue, the greater value will be brought to the top.
2. Pass an object derived from Comparator to the PriorityQueue constructor
It is possible when we construct the PriorityQueue to pass it an object derived from Comparator class. This should implement a [int compare T t1, T t2] similar to above.
3. Negate the value being compared inside the object
It is possible to implement the Comparable interface for the PriorityQueue according the the standard definition, but negate the value being stored in the object that is being compared to. For ex:, imagine there is a guest_count field in an Object WebPage, and we have a PriorityQueue of WebPage objects where we should be able to pop the highest visited (guest_count) Web Page quickly. We store the negation of the number of guests in guest_count, so that the usual comparison ends up pushing the highest visited page to the top of the PriorityQueue.
So which method should you use?
#1 is not recommended as it inherently alters the way any collection of the object is sorted. Imagine, somewhere else in the code, you have a vector of WebPage objects, and you call sort() on it. sort() will use the Comparable interface and sort the WebPage objects in descending order, probably not what you intended.
#3 is quite hacky - i consider it clever though, as it involves the minimum amount of work. But it is much more readable to keep the guest_count field positive vs negative.
#2 is my pick as the "reversed comparison" of the object is only used with respect to the PriorityQueue and is thus isolated from the workings of the object in different contexts.
At work I had to use a Heap data structure to iteratively apply an "ad selection" algorithm to the add with the most number of clicks. I ended up using #2 above.
Simply looking at the highest value takes O(1) time. Removing the highest value - thus making sure that the next highest value will be available for the next call - takes O(log n). Insertion takes O(log n).
Java has an inbuilt collection class - java.util.PriorityQueue that can be used as a Heap. The caveat is that it is the minimum value, not the maximum that can be retrieved efficiently.
So, there are three ways of dealing with the situation.
1. Implement Comparable interface on the object being stored.
For the object type (class) stored in the PriorityQueue, implement the Comparable interface. Comparable interface has one function - [int CompareTo(T t)] that should return -1 if this object is less than t, +1 if this object is greater than t, 0 if the two objects are equal. So what we could do is "flip" this definition and implement a compareTo function that returns -1 if this object is greater than t, +1 if this object is less than t, 0 if they are equal. Thus when the PriorityQueue insert function tries to compare the value of the object being inserted to the current head of the queue, the greater value will be brought to the top.
2. Pass an object derived from Comparator to the PriorityQueue constructor
It is possible when we construct the PriorityQueue to pass it an object derived from Comparator class. This should implement a [int compare T t1, T t2] similar to above.
3. Negate the value being compared inside the object
It is possible to implement the Comparable interface for the PriorityQueue according the the standard definition, but negate the value being stored in the object that is being compared to. For ex:, imagine there is a guest_count field in an Object WebPage, and we have a PriorityQueue of WebPage objects where we should be able to pop the highest visited (guest_count) Web Page quickly. We store the negation of the number of guests in guest_count, so that the usual comparison ends up pushing the highest visited page to the top of the PriorityQueue.
So which method should you use?
#1 is not recommended as it inherently alters the way any collection of the object is sorted. Imagine, somewhere else in the code, you have a vector of WebPage objects, and you call sort() on it. sort() will use the Comparable interface and sort the WebPage objects in descending order, probably not what you intended.
#3 is quite hacky - i consider it clever though, as it involves the minimum amount of work. But it is much more readable to keep the guest_count field positive vs negative.
#2 is my pick as the "reversed comparison" of the object is only used with respect to the PriorityQueue and is thus isolated from the workings of the object in different contexts.
At work I had to use a Heap data structure to iteratively apply an "ad selection" algorithm to the add with the most number of clicks. I ended up using #2 above.
Tuesday, September 09, 2008
unix redirection related pitfall
Today I wrote a little script that fetched around 800 web pages from a blogging site, parsed each page to extract some key information that I then later saved to a text file. This was a work related thing that my employer needed.
So I used curl and redirected everything to a file like so:
curl -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "http://www.site.com/path/to/file.php?start=$COUNTER&type=AdvBlogEntry&SortBy=&Order=" &> z.html
I used cmd &> z.html to get all the output (including anything on stderr)
However, when I ran my parser script on the output file, z.html, sometimes the parsing was off. The problem was the header info that was sent to stderr by curl got mixed in the middle of the z.html file due to the way the O/S wrote the two buffers to the file.
This is the header:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 72883 0 72883 0 0 31259 0 --:--:-- 0:00:02 --:--:-- 34718
When the parser failed, I would examine z.html and see lines like this:
<td align=^M100 46097 0 46097 0 0 21752 0 --:--:-- 0:00:02 --:--:-- 23847"left" class="tabledata">Philosophy & Religion</td>
So the header output got mixed in right after the align= attribute, which confused the parser.
The solution was to simply direct only stdout to the file, and I also used the silent option (-s) to curl so that the parser could print its results directly to stdout. This is the correct cmd:
curl -s -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "http://www.site.com/path/to/file.php?start=$COUNTER&type=AdvBlogEntry&SortBy=&Order=" > z.html
So I used curl and redirected everything to a file like so:
curl -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "http://www.site.com/path/to/file.php?start=$COUNTER&type=AdvBlogEntry&SortBy=&Order=" &> z.html
I used cmd &> z.html to get all the output (including anything on stderr)
However, when I ran my parser script on the output file, z.html, sometimes the parsing was off. The problem was the header info that was sent to stderr by curl got mixed in the middle of the z.html file due to the way the O/S wrote the two buffers to the file.
This is the header:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 72883 0 72883 0 0 31259 0 --:--:-- 0:00:02 --:--:-- 34718
When the parser failed, I would examine z.html and see lines like this:
<td align=^M100 46097 0 46097 0 0 21752 0 --:--:-- 0:00:02 --:--:-- 23847"left" class="tabledata">Philosophy & Religion</td>
So the header output got mixed in right after the align= attribute, which confused the parser.
The solution was to simply direct only stdout to the file, and I also used the silent option (-s) to curl so that the parser could print its results directly to stdout. This is the correct cmd:
curl -s -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "http://www.site.com/path/to/file.php?start=$COUNTER&type=AdvBlogEntry&SortBy=&Order=" > z.html
Monday, September 08, 2008
use ncftp to upload whole directories using ftp
yesterday, i came across the problem of figuring out how to transfer a whole directory to a hosting site which allowed no ssh access. without ssh, it wasn't possible to copy a tar file and untar it. the only protocol available was ftp. ftp doesn't allow you to upload a whole directory. if you use `mput *`, it will upload all the files within the directory, but it will not recurse into sub-directories.
i found a script someone had written that actually allowed downloading a full directory recursively. and there were a few posts that mentioned a `curl -T` way of doing it, but i couldn't get it to work.
then i hit upon ncftp, that worked really well.
simply:
i found a script someone had written that actually allowed downloading a full directory recursively. and there were a few posts that mentioned a `curl -T` way of doing it, but i couldn't get it to work.
then i hit upon ncftp, that worked really well.
simply:
ncftp -u username -p password ftp.somedomain.com
and when inside the ftp shell,put -R somedir
Thursday, August 28, 2008
web 2.0 image clusters
at http://livemocha.com we use a cluster of lighttpd servers to handle the requests for image, audio/video files. the reason to use lighttpd instead of apache is to do with the superior capability of lighttpd to handle large number of concurrent requests with low latency.
the apache server will use one process to serve a request. while the request is being served, that process will not accept any other requests. any other request needs to be served by a different apache process.
for a low volume server, this is not a problem. there are enough apache processes lieing around to serve the concurrent requests.
it does become a problem with high volume. a typical web page may have 20-30 images, a css file, some javascript files. the browser will first make a request to apache (assuming apache is the web server) for the main web page. then as the browser parses the page, it will issue additional requests to the image files etc found on the page.
if these requests go back to apache, they will each have to be served by one apache process one at a time. the KeepAlive option at apache **may** provide some relief by keeping the TCP connection open at apache so that the browser can keep using that for the images. but this doesn't change the fact that one apache process will always block other requests while it is serving the image.
the image request is mostly I/O bound - meaning most of the time to fulfill the request will be spent waiting for the image file to be fetched from a hard drive, perhaps in some network share. in the case of livemocha, most images were served off a NFS server. So it is quite wasteful to have a process sitting around doing nothing but waiting for I/O to complete.
Having a process waiting around is bad because it takes memory. There is only a finite amount of it in the server, and when there are more processes than that can fit memory the O/S will have to swap a processes into disk to make room for another process. When the O/S context switches to the swapped process, it has to be brought back into memory again, possibly resulting in another process being swapped, so on and so forth. this is a vicious cycle that can and will bring the server to its knees - this phenomena is known as the "slashdot effect".
lighttpd can serve multiple requests from one process. it uses asynchronous I/O available in linux 2.6 kernels (ex: epoll) to multiplex multiple requests in one process. this uses very low memory and can perform well under the "slashdot effect".
nginx is another server that performs very well using similar technologies. we decided on lighttpd mainly because some other key sites were known to use it. from benchmarks, nginx seemed to be superior.
the apache server will use one process to serve a request. while the request is being served, that process will not accept any other requests. any other request needs to be served by a different apache process.
for a low volume server, this is not a problem. there are enough apache processes lieing around to serve the concurrent requests.
it does become a problem with high volume. a typical web page may have 20-30 images, a css file, some javascript files. the browser will first make a request to apache (assuming apache is the web server) for the main web page. then as the browser parses the page, it will issue additional requests to the image files etc found on the page.
if these requests go back to apache, they will each have to be served by one apache process one at a time. the KeepAlive option at apache **may** provide some relief by keeping the TCP connection open at apache so that the browser can keep using that for the images. but this doesn't change the fact that one apache process will always block other requests while it is serving the image.
the image request is mostly I/O bound - meaning most of the time to fulfill the request will be spent waiting for the image file to be fetched from a hard drive, perhaps in some network share. in the case of livemocha, most images were served off a NFS server. So it is quite wasteful to have a process sitting around doing nothing but waiting for I/O to complete.
Having a process waiting around is bad because it takes memory. There is only a finite amount of it in the server, and when there are more processes than that can fit memory the O/S will have to swap a processes into disk to make room for another process. When the O/S context switches to the swapped process, it has to be brought back into memory again, possibly resulting in another process being swapped, so on and so forth. this is a vicious cycle that can and will bring the server to its knees - this phenomena is known as the "slashdot effect".
lighttpd can serve multiple requests from one process. it uses asynchronous I/O available in linux 2.6 kernels (ex: epoll) to multiplex multiple requests in one process. this uses very low memory and can perform well under the "slashdot effect".
nginx is another server that performs very well using similar technologies. we decided on lighttpd mainly because some other key sites were known to use it. from benchmarks, nginx seemed to be superior.
Tuesday, August 19, 2008
Ubuntu 8.04 - setup wireless to start on boot
Make sure your /etc/network/interfaces file looks like this:
auto lo
iface lo inet loopback
auto wlan0
iface wlan0 inet dhcp
wireless-essid your-essid
wireless-ap 00:25:43:DE:AC:2B
wireless-mode Managed
How do you find your essid and access point (ap)? In my case, I cheated by using the "Network configuration" icon on the top left. If you click on it, it will show the available interfaces, just pick the correct one. Then issue the iwconfig command on the shell, and you should see the essid and ap.
ex:
thushara@gini-sisila:~$ iwconfig
lo no wireless extensions.
eth0 no wireless extensions.
wlan0 IEEE 802.11b+/g+ ESSID:"my-essid" Nickname:"acx v0.3.36"
Mode:Managed Frequency:2.437 GHz Access Point: 00:25:43:DE:AC:2B
Bit Rate:24 Mb/s Tx-Power=15 dBm Sensitivity=1/3
Retry min limit:7 RTS thr:off
Power Management:off
Link Quality=31/100 Signal level=3/100 Noise level=0/100
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0
You can also find these parameters by logging into your wireless router's admin interface. You need to wire the computer to the router (or use another computer connected to the router) for this.
auto lo
iface lo inet loopback
auto wlan0
iface wlan0 inet dhcp
wireless-essid your-essid
wireless-ap 00:25:43:DE:AC:2B
wireless-mode Managed
How do you find your essid and access point (ap)? In my case, I cheated by using the "Network configuration" icon on the top left. If you click on it, it will show the available interfaces, just pick the correct one. Then issue the iwconfig command on the shell, and you should see the essid and ap.
ex:
thushara@gini-sisila:~$ iwconfig
lo no wireless extensions.
eth0 no wireless extensions.
wlan0 IEEE 802.11b+/g+ ESSID:"my-essid" Nickname:"acx v0.3.36"
Mode:Managed Frequency:2.437 GHz Access Point: 00:25:43:DE:AC:2B
Bit Rate:24 Mb/s Tx-Power=15 dBm Sensitivity=1/3
Retry min limit:7 RTS thr:off
Power Management:off
Link Quality=31/100 Signal level=3/100 Noise level=0/100
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0
You can also find these parameters by logging into your wireless router's admin interface. You need to wire the computer to the router (or use another computer connected to the router) for this.
Thursday, April 03, 2008
Sessions in PHP (and cakephp Session abstraction)- how it works
The session in php is supposed to encapsulate the logic of retrieving the cookie values, mapping the cookie to a value at the web server backend, purging this backend data when cookies expire etc.
CakePHP provides a class called Session, that attempts to hide these details a step further. If you want to identify a web user across many web pages, all you need to do is enable sessions and then you can store information of the user in a Session object which will be always available at the web backend.
CakePHP provides a "helper" and a "component" that provides the next layer of abstraction to their basic Session object. If you're coding the "view" of the MVC, you use the helper - if you're coding the controller, you would use the "component".
This is all good for smallish-scale apps. As you start using more than one web server, you can't keep the cookie mappings in the web server, as the web requests are now going to one of many web servers. CakePHP has a mechanism using php session callbacks to allow you to store the cookie mappings in a central database.
This article attempts to walk you through the architecture of cakephp + php that allows this customization.
I would be referring to cake/libs/session.php so you should have it open in an editor as you go through this.
If we look in cake/libs/session.php, Session::__construct() function, we see that there is a call to $this->__initSession() followed by a call to session_start(). The __initSession() function sets some options in php that configures session handling at that level. Follow the switch statement on CAKE_SESSION_SAVE, and if it is set to 'database', you will see that session_set_save_handler() is used to tell php about some callbacks, namely:
__open : an initialization function
__close: a cleanup function, that is generally used to purge unnecessary data (gc)
__read: php passes this function a "session id" and expects the app to return the data that is mapped to this session id, which we shall call the "session value"
__write: php passes this function a "session id" and a "session value" and expects the app to store that mapping in the db
So, going back to the Session constructor, after the call to set the php callbacks, session_start() is called. This is obviously a function implemented in php core. It will now look at the cookie passed by the browser.
If this is a first time web user, there will be no cookie, so session_start() will generate a "session id" - an 128 bit value that is random enough to have a very low probability of collision. Then since session_start()
knows about the callback functions that were already set, it will call __open and __read
If you look at Session::__open() you will see it does nothing except return "true", signaling php that everything was ok on the initialization step. Now let's look at Session::__read().
As I said earlier, php passes a "key", which is nothing but the session id it just generated. The Session::read() code is expected to now retrieve the "session value" mapped to this "session id" by querying the db. Again if this is a first time user, there will be no value in the db and nothing will be returned to session_start(). Now session_start() will create a $_SESSION hash, that will be empty as nothing was returned to session_start() from Session_read(). It will also send this cookie value to the browser using a Set-Cookie response header. The cookie value looks like this:
CAKEPHP=session_id
Here the string CAKEPHP is defined in app/config/core.php and can be changed. For the curious, notice that this is passed into session.name in the Session::__initSession() function, that is how session_start() knows what name to use for the cookie. The session id in the cookie is of course what session_start() generated - that 128 bit random number we talked about.
At this point your application can write values to $_SESSION, whatever values you write will be saved to the database, upon a session_write_close() call to php. session_write_close() will use the 2 other callbacks, __write and __close to do that.
So at what point should the application call session_write_close()? It should be called before control leaves the current web request. One such method is when Controller::redirect() is called - see cake/libs/controller/controller.php. This function results in a redirect to the browser which takes the user to a different page, potentially hitting a different web server. So the current state must be saved to the database before that. The other method is the more natural path, of the request simply completing. The way CakePHP is designed, we're always guaranteed to have an instance of the ConnectionManager class running. The destructor of ConnectionManager calls session_write_close() - look in cake/libs/model/connection_manager.php.
So as you can see, if you use CakePHP, your app doesn't need to call session_write_close(), it is automatically handled for you by the framework.
So what happens inside session_write_close()? It will call Session::__write() passing the session id as the "key" and the $_SESSION hash as the "value". Session__write() first checks if a mapping exists for this session id, if so it updates it with the new values from the $_SESSION object - remember these may be values you set in you application, like user name, user email etc. For a first time user, there will be no mapping in the database, so Session::__write() inserts a new row.
Then session_write_close() will call Session::__close() which according to a coin flip (well actually using a function with lower odds than 0.5), will clean session mappings that have expired. How does CakePHP know if the cookies have expired? - interestingly enough the session cleanup here has nothing to do with the actual cookie expiration in the browser. CakePHP in fact uses a long lived cookie on the browser, but sets a configurable timeout for the mappings at the server. The timeout is calculated using the formula:
$timeout = CAKE_SESSION_TIMEOUT * factor,
where $factor is one of 10,100 or 300 for CAKE_SECURITY settings of 'high', 'medium' and 'low'. Both CAKE_SESSION_TIMEOUT and CAKE_SECURITY settings can be changed in app/config/core.php.
So we followed through what happens for a first time web user to your site - before the request leaves the page, a cookie is generated, sent to the browser, an empty $_SESSION object created that can be manipulated by the application, and then the $_SESSION object is saved to the database.
Now when the user visits the site again, the browser will send the cookie back to the site. Now session_start() has a cookie and it will extract the session id from it. Then when in calls Session::__read() it will pass the session id as the "key". The row will be found in the table and the session object will be returned to session_start(). session_start() will now create the $_SESSION object based on this value that Session::__read() returned.
Again, you're free to change, add values to the $_SESSION object and CakePHP guarantees that any changes will be written to the database before control leaves the web request logic.
CakePHP also provides some helper functions to write/read to/from the $_SESSION object. They are Session::write() and Session::read() respectively.
Hope this de-mystifies how sessions work a bit and helps in using the Session object with confidence.
CakePHP provides a class called Session, that attempts to hide these details a step further. If you want to identify a web user across many web pages, all you need to do is enable sessions and then you can store information of the user in a Session object which will be always available at the web backend.
CakePHP provides a "helper" and a "component" that provides the next layer of abstraction to their basic Session object. If you're coding the "view" of the MVC, you use the helper - if you're coding the controller, you would use the "component".
This is all good for smallish-scale apps. As you start using more than one web server, you can't keep the cookie mappings in the web server, as the web requests are now going to one of many web servers. CakePHP has a mechanism using php session callbacks to allow you to store the cookie mappings in a central database.
This article attempts to walk you through the architecture of cakephp + php that allows this customization.
I would be referring to cake/libs/session.php so you should have it open in an editor as you go through this.
If we look in cake/libs/session.php, Session::__construct() function, we see that there is a call to $this->__initSession() followed by a call to session_start(). The __initSession() function sets some options in php that configures session handling at that level. Follow the switch statement on CAKE_SESSION_SAVE, and if it is set to 'database', you will see that session_set_save_handler() is used to tell php about some callbacks, namely:
__open : an initialization function
__close: a cleanup function, that is generally used to purge unnecessary data (gc)
__read: php passes this function a "session id" and expects the app to return the data that is mapped to this session id, which we shall call the "session value"
__write: php passes this function a "session id" and a "session value" and expects the app to store that mapping in the db
So, going back to the Session constructor, after the call to set the php callbacks, session_start() is called. This is obviously a function implemented in php core. It will now look at the cookie passed by the browser.
If this is a first time web user, there will be no cookie, so session_start() will generate a "session id" - an 128 bit value that is random enough to have a very low probability of collision. Then since session_start()
knows about the callback functions that were already set, it will call __open and __read
If you look at Session::__open() you will see it does nothing except return "true", signaling php that everything was ok on the initialization step. Now let's look at Session::__read().
As I said earlier, php passes a "key", which is nothing but the session id it just generated. The Session::read() code is expected to now retrieve the "session value" mapped to this "session id" by querying the db. Again if this is a first time user, there will be no value in the db and nothing will be returned to session_start(). Now session_start() will create a $_SESSION hash, that will be empty as nothing was returned to session_start() from Session_read(). It will also send this cookie value to the browser using a Set-Cookie response header. The cookie value looks like this:
CAKEPHP=session_id
Here the string CAKEPHP is defined in app/config/core.php and can be changed. For the curious, notice that this is passed into session.name in the Session::__initSession() function, that is how session_start() knows what name to use for the cookie. The session id in the cookie is of course what session_start() generated - that 128 bit random number we talked about.
At this point your application can write values to $_SESSION, whatever values you write will be saved to the database, upon a session_write_close() call to php. session_write_close() will use the 2 other callbacks, __write and __close to do that.
So at what point should the application call session_write_close()? It should be called before control leaves the current web request. One such method is when Controller::redirect() is called - see cake/libs/controller/controller.php. This function results in a redirect to the browser which takes the user to a different page, potentially hitting a different web server. So the current state must be saved to the database before that. The other method is the more natural path, of the request simply completing. The way CakePHP is designed, we're always guaranteed to have an instance of the ConnectionManager class running. The destructor of ConnectionManager calls session_write_close() - look in cake/libs/model/connection_manager.php.
So as you can see, if you use CakePHP, your app doesn't need to call session_write_close(), it is automatically handled for you by the framework.
So what happens inside session_write_close()? It will call Session::__write() passing the session id as the "key" and the $_SESSION hash as the "value". Session__write() first checks if a mapping exists for this session id, if so it updates it with the new values from the $_SESSION object - remember these may be values you set in you application, like user name, user email etc. For a first time user, there will be no mapping in the database, so Session::__write() inserts a new row.
Then session_write_close() will call Session::__close() which according to a coin flip (well actually using a function with lower odds than 0.5), will clean session mappings that have expired. How does CakePHP know if the cookies have expired? - interestingly enough the session cleanup here has nothing to do with the actual cookie expiration in the browser. CakePHP in fact uses a long lived cookie on the browser, but sets a configurable timeout for the mappings at the server. The timeout is calculated using the formula:
$timeout = CAKE_SESSION_TIMEOUT * factor,
where $factor is one of 10,100 or 300 for CAKE_SECURITY settings of 'high', 'medium' and 'low'. Both CAKE_SESSION_TIMEOUT and CAKE_SECURITY settings can be changed in app/config/core.php.
So we followed through what happens for a first time web user to your site - before the request leaves the page, a cookie is generated, sent to the browser, an empty $_SESSION object created that can be manipulated by the application, and then the $_SESSION object is saved to the database.
Now when the user visits the site again, the browser will send the cookie back to the site. Now session_start() has a cookie and it will extract the session id from it. Then when in calls Session::__read() it will pass the session id as the "key". The row will be found in the table and the session object will be returned to session_start(). session_start() will now create the $_SESSION object based on this value that Session::__read() returned.
Again, you're free to change, add values to the $_SESSION object and CakePHP guarantees that any changes will be written to the database before control leaves the web request logic.
CakePHP also provides some helper functions to write/read to/from the $_SESSION object. They are Session::write() and Session::read() respectively.
Hope this de-mystifies how sessions work a bit and helps in using the Session object with confidence.
Monday, March 10, 2008
Asynchronous file uploads
Here is a screen-shot of the async file upload I talked about in the earlier post.
The iframe is the "Upload File" UI element. It is embedded in the main form.
The main form has no idea of there being a file input element. All the file input logic is in the form element within the iframe. This way the file upload can even be re-used in different forms.
In fact livemocha has 2 forms where users can upload pictures. The first is when the user activates their account, the next is when they later would want to edit their profile information. I use the iframe code for both of those views. And that includes the controler logic that processes the image as well.
GMail like Asynchronous file upload from a form
Web forms that allow you to upload files typically insist on clicking on a "Submit" button to upload the file to the web server.
Most input data can be submitted to the web server from the client using Ajax, thus bypassing a "Submit" button. However file data cannot be serialized and passed to the web server with Ajax.
It is possible to use Javascript (NOT Ajax) and call form.submit() directly, thus bypassing the "Submit" button.
The only issue here is that, we have no control over how the submit() call renders data. One specific issue is that this call would cause the whole form to re-load which results in the user losing any other data she may have entered on the same page.
It is possible to use an iFrame within the form, include the file input field within the iFrame and call iFrame.form.submit() to upload the file. Then only the iFrame will be re-loaded and the rest of the form will remain intact.
GMail uses a technique similar to this, that allows them to "see" the attachment before the user clicks "Send" on the email - this is a very useful feature, as users immediately get feedback, GMail can perform validations - eg: virus checks - on the file as the user is composing the email etc.
Examine the html at http://livemocha.com/profiles/create to get an idea
Notice the "src" parameter points to a different form for the file upload part. That is the mini-form that posts the file data asynchronously (using javascript) to the server.
The javascript does the direct submit to the webserver bypassing the "Submit" button - in fact the mini-form inside the iframe does not have a Submit button as you can see. The javascript is triggered via the "onchange" notification on the file input field, i.e: when a file is selected for upload by the user.
This is a feature that is now available at http://www.livemocha.com/profiles/basic - check it out so you can see some more nice interactions. For example, you can immediately show the new picture uploaded to the user - normally this is difficult to do on the web browser, as on the client we have no access to the local file system.
Most input data can be submitted to the web server from the client using Ajax, thus bypassing a "Submit" button. However file data cannot be serialized and passed to the web server with Ajax.
It is possible to use Javascript (NOT Ajax) and call form.submit() directly, thus bypassing the "Submit" button.
The only issue here is that, we have no control over how the submit() call renders data. One specific issue is that this call would cause the whole form to re-load which results in the user losing any other data she may have entered on the same page.
It is possible to use an iFrame within the form, include the file input field within the iFrame and call iFrame.form.submit() to upload the file. Then only the iFrame will be re-loaded and the rest of the form will remain intact.
GMail uses a technique similar to this, that allows them to "see" the attachment before the user clicks "Send" on the email - this is a very useful feature, as users immediately get feedback, GMail can perform validations - eg: virus checks - on the file as the user is composing the email etc.
Examine the html at http://livemocha.com/profiles/create to get an idea
Notice the "src" parameter points to a different form for the file upload part. That is the mini-form that posts the file data asynchronously (using javascript) to the server.
The javascript does the direct submit to the webserver bypassing the "Submit" button - in fact the mini-form inside the iframe does not have a Submit button as you can see. The javascript is triggered via the "onchange" notification on the file input field, i.e: when a file is selected for upload by the user.
This is a feature that is now available at http://www.livemocha.com/profiles/basic - check it out so you can see some more nice interactions. For example, you can immediately show the new picture uploaded to the user - normally this is difficult to do on the web browser, as on the client we have no access to the local file system.
Subscribe to:
Posts (Atom)