Friday, September 19, 2008

perl and mysql - what's wrong with the import data?

mysql> load data infile '/Users/thushara/db/sitecat1.csv' into table sitecategorytoproductsetcategory;
ERROR 1062 (23000): Duplicate entry '0' for key 1

That is what mysql complained on my import file which looks like this:

[~/db] head sitecat1.csv 1000 1 2001 68001

1001 1 2001 98001

1002 1 2001 107001

1003 1 2001 59001

1004 1 2001 94001

Which i generated from a file like this:

[~/db] head sitecat.csv

using this:

[~/db] perl -ne '@x=split(/,/);$n = 1000+$i++;print "$n\t1\t$x[0]\t$x[1]\n"' sitecat.csv > sitecat1.csv

problem identified:

there is an extra new line being generated (not the last \n i add there) which creates a blank line in the output file for every data line. when mysql tries to import the blank line, it gives the "Duplicate entry '0' for key 1" error.

correct perl:

[~/db] perl -ne '@x=split(/,/);$n = 1000+$i++;print "$n\t1\t$x[0]\t$x[1]"' sitecat.csv > sitecat1.csv

which creates:

[~/db] head sitecat1.csv
1000 1 2001 68001
1001 1 2001 98001
1002 1 2001 107001
1003 1 2001 59001
1004 1 2001 94001
1005 1 2001 88001
1006 1 2001 58001
1007 1 2001 92001
1008 1 2001 87001
1009 1 2072 80001

no blank lines which gives:

mysql> load data infile '/Users/thushara/db/sitecat1.csv' into table sitecategorytoproductsetcategory;
Query OK, 229 rows affected (0.01 sec)
Records: 229 Deleted: 0 Skipped: 0 Warnings: 0


Monday, September 15, 2008

Implementing a Heap in Java

A Heap is a data structure storing a collection of objects, allowing you to pick the object of the highest value and insert objects of arbitrary values efficiently.

Simply looking at the highest value takes O(1) time. Removing the highest value - thus making sure that the next highest value will be available for the next call - takes O(log n). Insertion takes O(log n).

Java has an inbuilt collection class - java.util.PriorityQueue that can be used as a Heap. The caveat is that it is the minimum value, not the maximum that can be retrieved efficiently.

So, there are three ways of dealing with the situation.

1. Implement Comparable interface on the object being stored.

For the object type (class) stored in the PriorityQueue, implement the Comparable interface. Comparable interface has one function - [int CompareTo(T t)] that should return -1 if this object is less than t, +1 if this object is greater than t, 0 if the two objects are equal. So what we could do is "flip" this definition and implement a compareTo function that returns -1 if this object is greater than t, +1 if this object is less than t, 0 if they are equal. Thus when the PriorityQueue insert function tries to compare the value of the object being inserted to the current head of the queue, the greater value will be brought to the top.

2. Pass an object derived from Comparator to the PriorityQueue constructor

It is possible when we construct the PriorityQueue to pass it an object derived from Comparator class. This should implement a [int compare T t1, T t2] similar to above.

3. Negate the value being compared inside the object

It is possible to implement the Comparable interface for the PriorityQueue according the the standard definition, but negate the value being stored in the object that is being compared to. For ex:, imagine there is a guest_count field in an Object WebPage, and we have a PriorityQueue of WebPage objects where we should be able to pop the highest visited (guest_count) Web Page quickly. We store the negation of the number of guests in guest_count, so that the usual comparison ends up pushing the highest visited page to the top of the PriorityQueue.

So which method should you use?

#1 is not recommended as it inherently alters the way any collection of the object is sorted. Imagine, somewhere else in the code, you have a vector of WebPage objects, and you call sort() on it. sort() will use the Comparable interface and sort the WebPage objects in descending order, probably not what you intended.

#3 is quite hacky - i consider it clever though, as it involves the minimum amount of work. But it is much more readable to keep the guest_count field positive vs negative.

#2 is my pick as the "reversed comparison" of the object is only used with respect to the PriorityQueue and is thus isolated from the workings of the object in different contexts.

At work I had to use a Heap data structure to iteratively apply an "ad selection" algorithm to the add with the most number of clicks. I ended up using #2 above.

Tuesday, September 09, 2008

unix redirection related pitfall

Today I wrote a little script that fetched around 800 web pages from a blogging site, parsed each page to extract some key information that I then later saved to a text file. This was a work related thing that my employer needed.

So I used curl and redirected everything to a file like so:

curl -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "$COUNTER&type=AdvBlogEntry&SortBy=&Order=" &> z.html

I used cmd &> z.html to get all the output (including anything on stderr)

However, when I ran my parser script on the output file, z.html, sometimes the parsing was off. The problem was the header info that was sent to stderr by curl got mixed in the middle of the z.html file due to the way the O/S wrote the two buffers to the file.

This is the header:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 72883 0 72883 0 0 31259 0 --:--:-- 0:00:02 --:--:-- 34718

When the parser failed, I would examine z.html and see lines like this:

<td align=^M100 46097 0 46097 0 0 21752 0 --:--:-- 0:00:02 --:--:-- 23847"left" class="tabledata">Philosophy & Religion</td>

So the header output got mixed in right after the align= attribute, which confused the parser.

The solution was to simply direct only stdout to the file, and I also used the silent option (-s) to curl so that the parser could print its results directly to stdout. This is the correct cmd:

curl -s -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "$COUNTER&type=AdvBlogEntry&SortBy=&Order=" > z.html

Monday, September 08, 2008

use ncftp to upload whole directories using ftp

yesterday, i came across the problem of figuring out how to transfer a whole directory to a hosting site which allowed no ssh access. without ssh, it wasn't possible to copy a tar file and untar it. the only protocol available was ftp. ftp doesn't allow you to upload a whole directory. if you use `mput *`, it will upload all the files within the directory, but it will not recurse into sub-directories.

i found a script someone had written that actually allowed downloading a full directory recursively. and there were a few posts that mentioned a `curl -T` way of doing it, but i couldn't get it to work.

then i hit upon ncftp, that worked really well.


ncftp -u username -p password

and when inside the ftp shell,

put -R somedir