tech
Friday, January 13, 2012
Thursday, January 05, 2012
Wednesday, November 23, 2011
HDFS namenode won't start after disk full error
If you have trouble restarting the NameNode in a Hadoop cluster after a disk full error, if you don't mind losing some data, you can do the following to get it back up.
Find the 'edits' file in the hadoop dfs.name.dir/current and write this sequence to it :
After that, you should be able to start hadoop. Credit here.
Find the 'edits' file in the hadoop dfs.name.dir/current and write this sequence to it :
printf "\xff\xff\xff\xee\xff" > edits
After that, you should be able to start hadoop. Credit here.
Friday, November 04, 2011
Bash - prevent multiple copies of script from running
Since bash commands each spawn its own process, we can't lock files to achieve single copy running semantics. Why? Because file locks are per process and they are automatically cleared when the process dies. Thus it is nonsensical to expect a linux command to lock a file, why, when that command returns, the lock file will be automatically unlocked defeating the purpose of the lock completely!
One easy way to prevent multiple copies running is to find an atomic Linux command that can both do a certain operation and return whether that operation succeeded atomically. This command must fail on the second time. The command to make a directory - mkdir - is one such command.
So the script could try to mkdir a particular directory - let's call this the lock directory. If it fails, we don't start. Now if it works, we must remove the lock directory when the script ends so that the script can run again. We do this using the trap command - trap will make sure a given command will execute when the script exits at any point.
Here is the code:
One easy way to prevent multiple copies running is to find an atomic Linux command that can both do a certain operation and return whether that operation succeeded atomically. This command must fail on the second time. The command to make a directory - mkdir - is one such command.
So the script could try to mkdir a particular directory - let's call this the lock directory. If it fails, we don't start. Now if it works, we must remove the lock directory when the script ends so that the script can run again. We do this using the trap command - trap will make sure a given command will execute when the script exits at any point.
Here is the code:
#!/bin/bash
mkdir /tmp/locka 2>/dev/null || {
exit
}
trap "rmdir /tmp/locka" EXIT
#script work, the sleep 10 below is to test this
#without having a real script.
sleep 10
Thursday, July 28, 2011
Java : write binary data to a mysql out file
I had the need to generate - within Java code - a mysql out file with both text and binary data. The binary data is for some content that has been gzipped and stored as a blob in a mysql table. While it is trivial to write binary data to a blob field directly using JDO, for performance reasons, we had to use the "load infile" approach. Thus the first step was to create an outfile.
Here is the function that would convert binary data to a form that can be written to an out file. It follows the algorithm implemented by mysql for its "SELECT INTO outfile" functionality as described here.
This is how you would use this function to generate a mysql outfile.
This writes some integer data followed by the blob data to the outfile, which can then be loaded back using "LOAD INFILE".
Here is the function that would convert binary data to a form that can be written to an out file. It follows the algorithm implemented by mysql for its "SELECT INTO outfile" functionality as described here.
public static byte[] getEscapedBlob(byte[] blob) {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
for (int i=0; i<blob.length; i++) {
if (blob[i]=='\t' || blob[i]=='\n' || blob[i]=='\\') {
bos.write('\\');
bos.write(blob[i]);
} else if (blob[i] == 0) {
bos.write('\\');
bos.write('0');
} else {
bos.write(blob[i]);
}
}
return bos.toByteArray();
}
This is how you would use this function to generate a mysql outfile.
//gen infile for mysql
byte[] out = getEscapedBlob(data);
BufferedOutputStream f = new BufferedOutputStream(new FileOutputStream("/path/to/data.csv")) ;
String nonBlobFields = "\\N\t10\t20100301\t18\t1102010\t2010-03-01 00:00:00\t";
byte[] nonBlobData = nonBlobFields.getBytes("UTF-8");
f.write(nonBlobData, 0, nonBlobData.length);
f.write(out, 0, out.length);
f.write('\n');
f.close();
This writes some integer data followed by the blob data to the outfile, which can then be loaded back using "LOAD INFILE".
Thursday, July 21, 2011
Ubuntu : Install packages on a cluster of machines
Sometimes, you have a cluster of machines where some packages need to be installed. It would be nice to be able to automate this so that you could do everything from a single terminal. We have seen how a command can be run on multiple machines from a single terminal before. This only works if you have password-less ssh set up between the machine that you are running the command from and the cluster on which you want the command to actually run. The only aspect that makes this a little harder for installing software is that you need to be root to install packages and ssh keys are not generally set-up for root.
However, there is an option -S that you can provide sudo that will make sudo read the password from stdin. We can use this combined with the bash loop to come up with a one liner that would install a package across a cluster of machines.
The -S option makes sure that the command will not prompt you for a password or complain about a missing tty. The -y option for apt-get prevents it from prompting you prior to the install.
However, there is an option -S that you can provide sudo that will make sudo read the password from stdin. We can use this combined with the bash loop to come up with a one liner that would install a package across a cluster of machines.
for m in m1 m2 m3 m4 ; do echo $m; ssh $m "echo password | sudo -S apt-get -y install curl" ; done
The -S option makes sure that the command will not prompt you for a password or complain about a missing tty. The -y option for apt-get prevents it from prompting you prior to the install.
Friday, July 15, 2011
Mac / Microsoft Excel / newlines (\r \n)
It is a frequently the case that the business department hands over Excel files to the engineering department for some type of data processing. The first step here is to convert this to a proper comma separated text file (csv).
If you are doing this conversion using Microsoft Excel on a Mac, you'll note that the resulting file does not have Unix-style newlines. A Unix new line is the 0x0a character, also written as \n. What Excel produces is the 0x0d character, also written as \r.
Most Linux commands do not recognize \r as a line ending. There are several ways to convert the \r characters to proper Linux style line endings. Using the vi editor is a common method. However, there is also the issue that sometimes if the Excel spreadsheet has blank columns, Excel insists on writing a possibly large number of \r characters at the end of the csv file. The vi method would write a newline per each of these \r characters and that is not ideal.
Instead, you could use this perl one-liner to accomplish both : turn all \r into \n except for the trailing \r characters :
The regular expression replaces any non \r character followed by \r with the non \r character followed by a \n. Since the trailing \r characters do not match this pattern, they are thus ignored. The second regexp removes these \r characters.
If you are doing this conversion using Microsoft Excel on a Mac, you'll note that the resulting file does not have Unix-style newlines. A Unix new line is the 0x0a character, also written as \n. What Excel produces is the 0x0d character, also written as \r.
Most Linux commands do not recognize \r as a line ending. There are several ways to convert the \r characters to proper Linux style line endings. Using the vi editor is a common method. However, there is also the issue that sometimes if the Excel spreadsheet has blank columns, Excel insists on writing a possibly large number of \r characters at the end of the csv file. The vi method would write a newline per each of these \r characters and that is not ideal.
Instead, you could use this perl one-liner to accomplish both : turn all \r into \n except for the trailing \r characters :
perl -ne 's/([^\r])\r/$1\n/g; s/\r//g; print;' imported.csv
The regular expression replaces any non \r character followed by \r with the non \r character followed by a \n. Since the trailing \r characters do not match this pattern, they are thus ignored. The second regexp removes these \r characters.
Subscribe to:
Posts (Atom)


