Tuesday, September 09, 2008

unix redirection related pitfall

Today I wrote a little script that fetched around 800 web pages from a blogging site, parsed each page to extract some key information that I then later saved to a text file. This was a work related thing that my employer needed.

So I used curl and redirected everything to a file like so:

curl -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "http://www.site.com/path/to/file.php?start=$COUNTER&type=AdvBlogEntry&SortBy=&Order=" &> z.html

I used cmd &> z.html to get all the output (including anything on stderr)

However, when I ran my parser script on the output file, z.html, sometimes the parsing was off. The problem was the header info that was sent to stderr by curl got mixed in the middle of the z.html file due to the way the O/S wrote the two buffers to the file.

This is the header:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 72883 0 72883 0 0 31259 0 --:--:-- 0:00:02 --:--:-- 34718

When the parser failed, I would examine z.html and see lines like this:

<td align=^M100 46097 0 46097 0 0 21752 0 --:--:-- 0:00:02 --:--:-- 23847"left" class="tabledata">Philosophy & Religion</td>

So the header output got mixed in right after the align= attribute, which confused the parser.

The solution was to simply direct only stdout to the file, and I also used the silent option (-s) to curl so that the parser could print its results directly to stdout. This is the correct cmd:

curl -s -b PHPSESSID=6aad03adf83b4c73b36e4d33edccb698 "http://www.site.com/path/to/file.php?start=$COUNTER&type=AdvBlogEntry&SortBy=&Order=" > z.html

No comments: