Thursday, July 28, 2011

Java : write binary data to a mysql out file

I had the need to generate - within Java code - a mysql out file with both text and binary data. The binary data is for some content that has been gzipped and stored as a blob in a mysql table. While it is trivial to write binary data to a blob field directly using JDO, for performance reasons, we had to use the "load infile" approach. Thus the first step was to create an outfile.

Here is the function that would convert binary data to a form that can be written to an out file. It follows the algorithm implemented by mysql for its "SELECT INTO outfile" functionality as described here.

    public static byte[] getEscapedBlob(byte[] blob) {
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        for (int i=0; i<blob.length; i++) {
            if (blob[i]=='\t' || blob[i]=='\n' || blob[i]=='\\') {
                bos.write('\\');
                bos.write(blob[i]);
            } else if (blob[i] == 0) {
                bos.write('\\');
                bos.write('0');
            } else {
                bos.write(blob[i]);
            }
        }
        return bos.toByteArray();
    }

This is how you would use this function to generate a mysql outfile.

                //gen infile for mysql
                byte[] out = getEscapedBlob(data);
                BufferedOutputStream f = new BufferedOutputStream(new FileOutputStream("/path/to/data.csv")) ;
                String nonBlobFields = "\\N\t10\t20100301\t18\t1102010\t2010-03-01 00:00:00\t";
                byte[] nonBlobData = nonBlobFields.getBytes("UTF-8");
                f.write(nonBlobData, 0, nonBlobData.length);
                f.write(out, 0, out.length);
                f.write('\n');
                f.close();


This writes some integer data followed by the blob data to the outfile, which can then be loaded back using "LOAD INFILE".

Thursday, July 21, 2011

Ubuntu : Install packages on a cluster of machines

Sometimes, you have a cluster of machines where some packages need to be installed. It would be nice to be able to automate this so that you could do everything from a single terminal. We have seen how a command can be run on multiple machines from a single terminal before. This only works if you have password-less ssh set up between the machine that you are running the command from and the cluster on which you want the command to actually run. The only aspect that makes this a little harder for installing software is that you need to be root to install packages and ssh keys are not generally set-up for root.

However, there is an option -S that you can provide sudo that will make sudo read the password from stdin. We can use this combined with the bash loop to come up with a one liner that would install a package across a cluster of machines.

for m in m1 m2 m3 m4 ; do echo $m; ssh $m "echo password | sudo -S apt-get -y install curl" ; done

The -S option makes sure that the command will not prompt you for a password or complain about a missing tty. The -y option for apt-get prevents it from prompting you prior to the install.

Friday, July 15, 2011

Mac / Microsoft Excel / newlines (\r \n)

It is a frequently the case that the business department hands over Excel files to the engineering department for some type of data processing. The first step here is to convert this to a proper comma separated text file (csv).

If you are doing this conversion using Microsoft Excel on a Mac, you'll note that the resulting file does not have Unix-style newlines. A Unix new line is the 0x0a character, also written as \n. What Excel produces is the 0x0d character, also written as \r.

Most Linux commands do not recognize \r as a line ending. There are several ways to convert the \r characters to proper Linux style line endings. Using the vi editor is a common method. However, there is also the issue that sometimes if the Excel spreadsheet has blank columns, Excel insists on writing a possibly large number of \r characters at the end of the  csv file. The vi method would write a newline per each of these \r characters and that is not ideal.

Instead, you could use this perl one-liner to accomplish both : turn all \r into \n except for the trailing \r characters :

perl -ne 's/([^\r])\r/$1\n/g; s/\r//g; print;'  imported.csv

The regular expression replaces any non \r character followed by \r with the non \r character followed by a \n. Since the trailing \r characters do not match this pattern, they are thus ignored. The second regexp removes these \r characters.

Wednesday, July 06, 2011

Linux Shell, HUP and process status on logout

It used to be the case that all processes a user starts are killed by the shell upon logout. Not any more, as recent experiments with Ubuntu 10.04 shows.

The shell can be configured to send a HUP signal to its children when the shell exits. This is controlled by the huponexit shell option as explained in the bash man page:

If the huponexit shell option has been set with shopt, bash sends a SIGHUP to all jobs when an interactive login shell exits.

Determine the setting of huponexit with:

shopt huponexit

If it is "off", then processes started by the user will remain running after logout. This setting makes it easier to start a long running process simply from within the shell, without invoking a screen and without having to wrap the process in nohup.

Here is a discussion on the issue.

However, this setting seems to cause problems for interactive sessions when a new user could start referring to an old user's now invalid processes.