Wednesday, November 23, 2011

HDFS namenode won't start after disk full error

If you have trouble restarting the NameNode in a Hadoop cluster after a disk full error, if you don't mind losing some data, you can do the following to get it back up.

Find the 'edits' file in the hadoop dfs.name.dir/current and write this sequence to it :

printf "\xff\xff\xff\xee\xff" > edits

After that, you should be able to start hadoop. Credit here.

Friday, November 04, 2011

Bash - prevent multiple copies of script from running

Since bash commands each spawn its own process, we can't lock files to achieve single copy running semantics. Why? Because file locks are per process and they are automatically cleared when the process dies. Thus it is nonsensical to expect a linux command to lock a file, why, when that command returns, the lock file will be automatically unlocked defeating the purpose of the lock completely!

One easy way to prevent multiple copies running is to find an atomic Linux command that can both do a certain operation and return whether that operation succeeded atomically. This command must fail on the second time. The command to make a directory - mkdir - is one such command.

So the script could try to mkdir a particular directory - let's call this the lock directory. If it fails, we don't start. Now if it works, we must remove the lock directory when the script ends so that the script can run again. We do this using the trap command - trap will make sure a given command will execute when the script exits at any point.

Here is the code:

#!/bin/bash                                                                                                                                                                                        
mkdir /tmp/locka 2>/dev/null || {
    exit
}
trap "rmdir /tmp/locka" EXIT
#script work, the sleep 10 below is to test this
#without having a real script.
sleep 10