Tuesday, June 30, 2009

mv is not atomic in Mac OS

you shouldn't rely on `mv` being atomic on the regular file system under MacOS. i had a script that had to regularly update a file that is read by a different script. under this scenario i resorted to writing a temporary file and then `mv`ing the file to the permanent location. while this works for linux, it doesn't work for MacOS.

to demonstrate, open two command windows in you Mac and in one type this:

while true; do echo this better be a whole sentence > x1.txt; mv x1.txt x.txt; done


on the other, run this script:

while true
do
F=`cat x.txt`
echo $F
if [ "$F" = "this better be a whole sentence" ]
then
echo ok
else
echo bad
exit -1
fi
done


notice the output:

mpire@brwdbs02:~$ ./x.sh
this better be a whole sentence
ok
this better be a whole sentence
ok
ok
this better be a whole sentence
ok
this better be a whole sentence
ok
cat: x.txt: No such file or directory

bad
[~]


bad
mpire@brwdbs02:~$

Performance Improvement in org.apache.hadoop.io.Text class


I wrote earlier on a performance improvement I made to Hadoop. Upon discussing with Hadoop devs, notably Chris Douglas, this change was made to the core org.apache.hadoop.io.Text class. This has the additional benefit of improving a core text handling class used commonly in Hadoop, and we avoid the additional memory foot-print created by having an additional instance of OutputStream.

This improvement will be available in hadoop 0.21.0:

Note the difference in YourKit profiling data with the new Text class:

Thursday, June 25, 2009

running Hadoop tests


install jdk 1.5 and Apache Forrest. then,
run this command:

ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Users/thushara/apache-forrest-0.8 -Djavac.args="-Xlint -Xmaxwarns 1000" clean test tar

Monday, June 22, 2009

bash: single line for loop

run commands on multiple files at once:

[~/hadoop-src] for f in *; do echo $f; done
common
hdfs
mapreduce
[~/hadoop-src]for f in *; do svn up $f/trunk; done
At revision 787534.

Fetching external item into 'hdfs/trunk/src/test/bin'
External at revision 787534.

At revision 787534.

Fetching external item into 'mapreduce/trunk/src/test/bin'
External at revision 787534.

At revision 787534.
[~/hadoop-src]

Friday, June 19, 2009

date -d is different from Linux to Mac


familiar with:

date -d '1 hour ago'

well, it will work on the Linux command line, but no such luck on the Mac

here is the code you need if your script is to work on both OSes:

OS=`uname -a`
if [[ $OS == Darwin* ]]
then
TODAY=`date -v-1H +"%Y-%m-%d.%H"`
else
TODAY=`date -d '1 hour ago' +"%Y-%m-%d.%H"`
fi

Mac Office Excel - all text exports appear as one line


If you use Mac Office to convert a spread-sheet to text, you will see one long line with any Unix utility. The reason is that Office inserts the windows carriage return (^M) with no line feed.

To fix this, bring up the text file inside vi editor type:

:s/^V^M/^V^M/g

this will appear on the editor like

:s/^M/^M/g

as ^V is really an escape for control characters

Thursday, June 18, 2009

Downgrading Subversion Working Copy

Sometimes, an IDE coupled to SVN (ex: Intellij) might upgrade your subversion working copy to the latest SVN server format that is incompatible with the command line svn client you have. In this case, if your distro has a command line svn of the latest version, you are probably fine. But if you don't have such luck (for ex: being a MacOS user who has v 1.4 of svn command line), you might want to downgrade the working copy using instructions here.

Friday, June 05, 2009

Hadoop - reading large lines (several MB) is slow


I ran into a performance issue running a Hadoop map/reduce job on an input that at times contained lines as long as 200MB. The issue was in org.apache.hadoop.util.LineReader.

LineReader uses org.apache.hadoop.io.Text to store potentially large lines of text. Unfortunately Text class does not behave well for large text.

Here is the yourKit profile of a simple block of code using Text class:



Here is a profile when Text is replaced with ByteArrayOutputStream:



Notice the Text.append version took 10 times longer to run.

I could get my map/reduce task that initially took over 20 minutes and crashed (as hadoop TaskTracker was timing out on child tasks that took too long) to work under 30s, with this simple change to LineReader:

  public int readLine(Text str, int maxLineLength,
int maxBytesToConsume) throws IOException {
str.clear();
boolean hadFinalNewline = false;
boolean hadFinalReturn = false;
boolean hitEndOfFile = false;
int startPosn = bufferPosn;
long bytesConsumed = 0;
ByteArrayOutputStream os = new ByteArrayOutputStream();
outerLoop: while (true) {
if (bufferPosn >= bufferLength) {
if (!backfill()) {
hitEndOfFile = true;
break;
}
}
startPosn = bufferPosn;
for(; bufferPosn < bufferLength; ++bufferPosn) {
switch (buffer[bufferPosn]) {
case '\n':
hadFinalNewline = true;
bufferPosn += 1;
break outerLoop;
case '\r':
if (hadFinalReturn) {
// leave this \r in the stream, so we'll get it next time
break outerLoop;
}
hadFinalReturn = true;
break;
default:
if (hadFinalReturn) {
break outerLoop;
}
}
}
bytesConsumed += bufferPosn - startPosn;
int length = bufferPosn - startPosn - (hadFinalReturn ? 1 : 0);
length = (int)Math.min(length, maxLineLength - os.size());
if (length >= 0) {
os.write(buffer, startPosn, length);
LOG.info("os.size= " + os.size() + " just wrote from " + startPosn + " to " + length + " bytes");
}
if (bytesConsumed >= maxBytesToConsume) {
str.set(os.toByteArray());
return (int)Math.min(bytesConsumed, (long)Integer.MAX_VALUE);
}
}
LOG.info("finished reading line");
int newlineLength = (hadFinalNewline ? 1 : 0) + (hadFinalReturn ? 1 : 0);
if (!hitEndOfFile) {
bytesConsumed += bufferPosn - startPosn;
int length = bufferPosn - startPosn - newlineLength;
length = (int)Math.min(length, maxLineLength - os.size());
if (length > 0) {
os.write(buffer, startPosn, length);
}
}

str.set(os.toByteArray());
return (int)Math.min(bytesConsumed, (long)Integer.MAX_VALUE);
}