Wednesday, December 29, 2010

bash one liner : find process start time

Sometimes you want to find the start time of a process. You might want to check the start time of a particular process running on many Linux servers. Rather than logging into each machine, we can use password-less ssh to do this from one machine. But first, we need to craft the bash one liner to do this on one particular machine.

Say, you want to find out the start time of process called 'foobar'. This is what you could do:

ps auxww | grep foobar | grep -v '/bin/sh' | grep -v grep | tr -s '\t' ' ' | cut -f 9 -d ' '

Notice I use "grep -v" to eliminate certain processes that are not relevant. I omit the process that starts foobar ("/bin/sh"). I also omit the "grep" command we are using from the output. If your process is not started explicitly by the shell, you need not do the former, but filtering out "grep" is always useful.

The interesting parts are to the end of the command line. We are looking for the 9th column which has the "start time" of the process. However, since the "ps" output may have multiple tab characters separating the columns, we need to convert multiple tabs to a single tab or a space. Here I have chosen to use the "tr" command to convert multiple repeating tabs to a single space.

Now that we have this handy command, and you are tasked with checking the process start time across a dozen or more machines, it is simple enough to wrap this in a nice one-liner bash loop:

for m in host1 host2 host3 host4 ; do echo $m; ssh $m "ps auxww | grep foobar | grep -v '/bin/sh' | grep -v grep | tr -s '\t' ' ' | cut -f 9 -d ' '"  ; done

Monday, December 20, 2010

python : don't use sys.exit() inside signal handlers

It is common to want to exit the program on handling a kill signal. But you should probably not use the standard sys.exit() function for this. Instead use the os._exit() function.

The reason is that python implements sys.exit() to throw an exception to the stack frame that was executing at the time the kill signal was received by the interpreter. If the kill signal was intercepted within a _try/_except block, control will be given back to this block and this is probably not what you intended.

This happened to me on an automated script last night, and since I wasn't aware of this feature of sys.exit(), it puzzled me a bit. The logs showed that the script was stopping, but then it kept continuing from the point where the kill interrupted it.

Here is the relevant part of the log:

running update table set somedate="2010-12-20 10:34:27" where id=4329
running update table set somedate="2010-12-20 10:34:27" where id=4330
Stopping as requested..
commiting mysql buffers
stopped
failed: update table set somedate="2010-12-20 10:34:27" where id=4330
running update table set somedate="2010-12-20 10:34:27" where id=4346
failed: update table set somedate="2010-12-20 10:34:27" where id=4346


Notice how the script just carried on from the point of interruption, but notice how everything is failing after the failed stop. The failure is due to the cleanup done in the signal handler, the db connection is closed.

Here is the stack trace at the point where the signal was received (I could get this by doing another "kill", as the _try/_except logic was particularly long and it was still stuck there, you might not be so lucky!) :

Traceback (most recent call last):
  File "/path/to/script.py", line 165, in <module>
    exec_retry(cursor,mysql,1)
  File "/path/to/script.py", line 72, in exec_retry
    time.sleep(secs)
  File "/path/to/script.py", line 59, in kill_handler
    conn.commit()
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')


This is the point where the signal was received, particularly within the time.sleep() call:

def exec_retry(cursor, cmd, secs):
    retries=0
    while retries<2:
        try:
            return cursor.execute(cmd)
    except:
            retries+=1
            time.sleep(secs)
    print "failed: %s" % cmd
    return 0


Friday, December 10, 2010

use perl BEGIN / END blocks for summations

Various Perl one-liners are very useful in data manipulation. the "-ne" mode in perl allows the command specified to be run over each line of stdin. However, if you want to do a summation and only print the final tally, you can make use of the BEGIN / END blocks in Perl. Initialize the counter in the BEGIN block, print the sum in the END block.

Say, there is a file of numbers called "nums" , each number seperated by a newline, and we want to sum the numbers:

cat nums | perl -ne 'BEGIN{$s=0;} chomp; $s+=$_; END {print "$s\n"}'

Tuesday, November 23, 2010

bash one liners to setup password-less SSH


ssh user@host "cat >> .ssh/authorized_keys2" < .ssh/id_rsa.pub

This will append the ssh public key on the local machine to the authorized_keys file in the remote machine so that the local machine will in the future be able to ssh to the remote without a password.

If you are setting up multiple machines this way, this one liner is faster than having to ssh into each remote to update the authorized_keys file.

You could build on this to setup multiple machines with a single command:


for m in host1 host2 host3 host4; do ssh user@$m "cat >> .ssh/authorized_keys2" < .ssh/id_rsa.pub; done

Monday, November 15, 2010

C2 A0 characters confusing bash?

Have you had a perfectly typed shell command fail on you, like this :

user@host:$ ps auxww | grep java
 grep: command not found


Here is the hex output of a correct "grep" and an incorrect "grep" command line:

user@host$ hexdump -C /tmp/x
00000000  70 73 20 61 75 78 77 77  20 7c 20 67 72 65 70 20  |ps auxww | grep |
00000010  6a 61 76 61 0a                                    |java.|
00000015
user@host$ hexdump -C /tmp/y
00000000  70 73 20 61 75 78 77 77  20 7c c2 a0 67 72 65 70  |ps auxww |..grep|
00000010  20 6a 61 76 61 0a                                 | java.|
00000016


The second output is the faulty one, notice the characters "C2 A0" cause the problem. A0 is the non-breaking space, and somehow, my keyboard at times produces these instead of "20" for the space character, thus confusing the shell.

This is on an ssh session to Linux 2.6, from a Mac.

Saturday, November 13, 2010

mysql deadlocks with concurrent inserts

It is possible to cause deadlocks in mysql (Innodb) on concurrent insert statements, without there being any transactions in progress. Deadlocks are possible even when the inserts don't collide on any key.

The deadlocks occur due to gap locking done by mysql. There are several reasons for gap locking, and in this particular case, it has to do with preserving a unique key constraint on an index. The situation presents itself to us this way: There is a unique key constraint on a column and we are doing an insert. Mysql has to make sure that the lock it takes is sufficient to prevent another concurrent insert from adding a record with the same key, thus breaking the unique key constraint.

Mysql innodb engine performs row locking on inserts. If column A has a unique key constraint, and we are adding the value "bbb" for column A in an insert statement, mysql needs to lock any gap in the index between the two current records where "bbb" will be inserted at.

To illustrate the deadlock, let us start with a table schema:

TABLE vegetable (
   id bigint(10) NOT NULL auto_increment,
   name varchar(255) NOT NULL,
   PRIMARY KEY (id),
   UNIQUE KEY uk_name (name)
) ENGINE=InnoDB

Let us assume the existence of these records in the table, and look at them in 'name' index order:

id name
10 ggg
05 jjj

Now, imagine two concurrent connections executing the following inserts in the following order:

Connection 1:

insert ignore into vegetable values(null, "ppp");

For this insert to proceed, connection 1 will lock the gap between "jjj" and "ppp" in the name index.

Connection 2:

insert ignore into vegetable values (null,"iii");

This will require locking the gap after "ggg", upto "iii". Since the lock from connection 1 does not span this, it will succeed.

insert ignore into vegetable values (null, "mmm");

This needs to lock the gap after "jjj" upto "mmm". Since connection 1 has a lock between "jjj" and "ppp", effectively spanning the lock connection 2 is attempting to take, this will block.

Connection 1:

insert ignore into vegetable values (null, "hhh");

This requires the gap lock between "ggg" and "hhh". This will block as it spans the the lock ["ggg" to "iii"] held by connection 2.


Thus we have both connections blocked on each other. This is the deadlock.

Here is a diagram. Transactions to the left are done by Connection 2. Transactions to the right are done by Connection 1. The sequence of transactions is donated by numbers 1) through 4).

Connection 2                         Connection1

---------------------------  ggg
G                                           G
A                                           AP
P                                            <------------------- 4) hhh
Lock                                                                  blocks (deadlock)
2) iii -------------------->

---------------------------  jjj 
G                                           G
A
P                                            A
Lock
3) mmm --------------->            P  
blocks
                                              L
                                              o
                                              c
                                              k
                                             <--------------------- 1) ppp

You can avoid this if you can order the inserts on each connection on the same direction. The deadlock happens as connection 2 inserts in ascending order of the index, while connection 1 inserts on descending order.

If you can't do this for practical reasons, you could retry the operation. Unless there is a high level of concurrency with a high load on the db where each transaction takes a heavy hit, a simple retry should work.

Monday, November 08, 2010

Bash math, command expansion and pipes


How do you add the number of lines in two files?


echo $((`wc -l /path/to/file1.txt|cut -f 1 -d ' '`+`wc -l /path/to/file2.txt|cut -f 1 -d ' '`)) 

Saturday, October 23, 2010

Handling gzipped HTTP response with Transfer-Encoding: chunked

This explains the basic protocol of sending data using Transfer-Encoding: chunked. Quite a number of servers send gzipped data this way. I needed to handle this for the asynchronous crawler I built.

The chunked transfer consists of many chunks. Before each chunk, we have the size of the chunk terminated by a CRLF pair. There is a final zero length chunk.

The first time I wrote code to handle the chunked zipped transfer, I had the code decompress each chunk. This worked. But then later, content from other URLs did not decompress properly. Probing with Wireshark, I came to realize that each individual chunk cannot be reliably decompressed. This is because the server does not compress each chunk before sending it. The server compresses the full content, then chunks it up and sends each chunk on the wire. Thus, I needed to first build the full message and then decompress the full message. The reasons that my first attempt worked for the URL I was testing had to do with the fact that in that particular case, all data came in a single chunk.

Here is the relevant code as I couldn't find this easily anywhere:


String enc = httpHeaders.get("Content-Encoding");

if (enc != null && enc.toLowerCase().equals("gzip")) {
  String te = httpHeaders.get("Transfer-Encoding");
  if (te != null && te.toLowerCase().equals("chunked")) {
    int idx = httpHeaders.length;
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    int numBytes = -1;

     try {
       do {
         StringBuilder numBytesBuf = new StringBuilder();
         for (;idx<bytes.length && bytes[idx]!='\r';idx++) {
           if (Utils.isHex((char)bytes[idx]))
             numBytesBuf.append((char)bytes[idx]);
           }
           if (idx >= bytes.length)
             throw new IOException("incorrect chunked encoding for : " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
           idx+=2; //skip over '\r\n'
           try {
             numBytes = Integer.parseInt(numBytesBuf.toString(), 16);
           } catch (NumberFormatException e) {
              throw new IOException("incorrect chunked encoding for : " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL(), e);
           }
           if (numBytes > 0) {
               //idx points to start and numBytes is the length
               os.write(bytes, idx, idx+numBytes <= bytes.length ? numBytes : bytes.length-idx);
               if (idx+numBytes > bytes.length) {
                  System.err.println("incorrect chunked encoding, " + (idx+numBytes) + " is outside " + bytes.length + " for: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
                  break;
               }
           }
             idx += (numBytes+2); //+2 for '\r\n'
        } while (numBytes > 0);
        
        GZIPInputStream zip = new GZIPInputStream(new ByteArrayInputStream(os.toByteArray(), 0, os.size()));
        byte[] buf = new byte[1024];
        int len;

        try {
          for (;(len = zip.read(buf, 0, buf.length)) > 0;) { //decompress from <zip> to <buf>
            f.write(buf,0,len);            //transfer from fixed size <buf> to var size <os>
          }
        } catch (IOException e) {
          if (!e.getMessage().equals("Corrupt GZIP trailer") && !e.getMessage().equals("Unexpected end of ZLIB input stream"))
            throw e;
          else {
            System.out.println("handled spurious " + e.getMessage() + " on: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
          }
        }
        os.close();
         zip.close();
     } catch (Exception e) {
       System.err.println("failed on: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
       System.err.println(ExceptionsUtil.toString(e));
     }
}

GZipInputStream throws spurious exceptions in how it handles the end of the file

Sun bug database mentions here and here about spurious errors thrown by GZipInputStream.

But unfortunately, there are no fixes. Also these spurious exceptions are not thrown just for large files. I have seen these errors for files as small as 5K. Here is a set of decompressed files, sorted by size that I managed to generate by ignoring these spurious errors:

x02:~$ cat /tmp/z | perl -ne 'if (/based on (.*)$/) {system("ls -ltr `echo -n $1|md5`.url");}' | sort -n -t ' ' -k 5
-rw-r--r-- 1 mpire mpire 5501 2010-10-23 14:03 409ff1c2b7ce2887db8a5c98d395b543.url
-rw-r--r-- 1 mpire mpire 38681 2010-10-23 14:03 8bc8f64132dd2e4bdbccff555cfa6966.url
-rw-r--r-- 1 mpire mpire 44554 2010-10-23 14:01 ba1c53b23f747efb3aa3a7531da80fb1.url
-rw-r--r-- 1 mpire mpire 45415 2010-10-23 14:03 073cd89a26dd69bac8f8a734bcaec7f1.url
-rw-r--r-- 1 mpire mpire 46058 2010-10-23 14:00 f0ae11a51e1975838c26c428ce14308a.url
-rw-r--r-- 1 mpire mpire 46192 2010-10-23 14:03 73972b286ec326e91404008b4c125e5a.url
-rw-r--r-- 1 mpire mpire 46414 2010-10-23 14:00 c6389b488bf912ddece9075884ff7c80.url
-rw-r--r-- 1 mpire mpire 47030 2010-10-23 14:00 07d9fe64764458b55626d9eb047d5d4b.url
-rw-r--r-- 1 mpire mpire 47565 2010-10-23 14:03 67a9869337a777c7c6a8411fd55e1b39.url
-rw-r--r-- 1 mpire mpire 49034 2010-10-23 14:01 0792cd32f0ef59cbe3592a5c2a7b5744.url
-rw-r--r-- 1 mpire mpire 58397 2010-10-23 14:03 3c973b623e9d27cd36234222f6542788.url
-rw-r--r-- 1 mpire mpire 58981 2010-10-23 14:03 780f0be38cbe0e73de788597cb482af4.url
-rw-r--r-- 1 mpire mpire 59177 2010-10-23 14:01 15dd65994702db5e360704144874f3f8.url
-rw-r--r-- 1 mpire mpire 60043 2010-10-23 14:03 c355cd84d17be2cf405b04fb3663d181.url
-rw-r--r-- 1 mpire mpire 63189 2010-10-23 14:03 43ab8fabf72b5564bc0e8b1ff3fcebe7.url
-rw-r--r-- 1 mpire mpire 70235 2010-10-23 14:01 3a24135855df237d829960a15cd8b170.url
-rw-r--r-- 1 mpire mpire 71536 2010-10-23 14:01 48f80b3fb1c51ea52313fe76b55a3849.url
-rw-r--r-- 1 mpire mpire 76932 2010-10-23 14:03 8c789913e5666e793c38d02001486532.url
-rw-r--r-- 1 mpire mpire 78825 2010-10-23 14:01 49728436874d94d8b1ab0ef17d8b4736.url
-rw-r--r-- 1 mpire mpire 80459 2010-10-23 14:05 c52a68d48d2c23ef846950dec999f084.url
-rw-r--r-- 1 mpire mpire 80459 2010-10-23 14:05 c52a68d48d2c23ef846950dec999f084.url
-rw-r--r-- 1 mpire mpire 83001 2010-10-23 14:00 a963a7357bf480fe24040ecf05e8927d.url
-rw-r--r-- 1 mpire mpire 105473 2010-10-23 14:01 db76aed3163beb6ad49670866045a9d8.url
-rw-r--r-- 1 mpire mpire 109405 2010-10-23 14:01 70da57c0544c122766a9ad6772757f2b.url
-rw-r--r-- 1 mpire mpire 110921 2010-10-23 14:01 579e0f08a036befe374ff8c70126a2bb.url
-rw-r--r-- 1 mpire mpire 111880 2010-10-23 14:00 3091fb91ae4a92b06392acead7170a57.url
-rw-r--r-- 1 mpire mpire 116796 2010-10-23 14:05 a326a52188abc263e2f8804444b48c8c.url
-rw-r--r-- 1 mpire mpire 154209 2010-10-23 14:06 e90211e8f799a88dc70ff83d1aba0748.url
-rw-r--r-- 1 mpire mpire 159089 2010-10-23 14:01 8ef49b62542f8283acba4e573e28ea59.url
-rw-r--r-- 1 mpire mpire 168786 2010-10-23 14:01 487d9c54fbd3e4b760c91dbf5f754d32.url
-rw-r--r-- 1 mpire mpire 212561 2010-10-23 14:03 b3cb42c946716cfe7e31096bd458e9f2.url
-rw-r--r-- 1 mpire mpire 222257 2010-10-23 14:03 f803ac8246dd2ab7dd16f7d28d5b4594.url


The decompressed files are good. I ran into this issue reading gzipped files from web servers that zip content and send this zipped data in chunks using Transfer-Encoding: chunked.

The error seems to be not related to the Java libraries. I saved the gzip content and tried to unzip with gunzip. This failed as well:

x02:/tmp$ gunzip 0792cd32f0ef59cbe3592a5c2a7b5744.gz 

gzip: 0792cd32f0ef59cbe3592a5c2a7b5744.gz: unexpected end of file

Thursday, October 21, 2010

Extracting HTTP headers, handling partial headers correctly

Here is the correct code for the previous post:

    //extract all HTTP request response headers locally and return the size of the headers
    private int extractHttpHeaders(byte[] httpHeader) {
        final String delimiter = ": ";
        List<StringBuilder> headers = new ArrayList<StringBuilder>();
        for (int i=0; i<httpHeader.length; i++) {
            StringBuilder line = new StringBuilder();
            for (; i<httpHeader.length && (char)httpHeader[i]!='\n'; i++) {
                line.append((char)httpHeader[i]);
            }
            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                //this will break out of the loop
            }
            else if (line.length()==0 || (line.length()==1 && (char)httpHeader[i-1]=='\r')) {
                //all headers received
                httpHeaders.length = i+1;
                return i+1;
            } else {
                //line has a header, add it
                int colonAt = line.indexOf(delimiter);
                if (colonAt != -1) {
                    String value = line.substring(colonAt+delimiter.length(), line.charAt(line.length()-1)=='\r' ? line.length()-1 : line.length()).trim();
                    if (value.length() > 0)
                        httpHeaders.put(line.substring(0,colonAt), value);
                }
            }
        }
        //partial header, full headers have not been received yet
        httpHeaders.clear();
        return -1; //full headers have not been received yet
    }

The only difference is unconditionally clearing the headers map if we don't see a blank line in the headers. The correct termination of the headers is a single CRLF pair (blank line). If that has not been seen, then it is a partial header, even though each line in the partial header may have been fully received.

Extracting HTTP headers, handling partial headers (incorrectly)

This one is for the network fiends who are crazy enough to read raw HTTP packets and construct HTML content. I had to do this to take advantage of the highly efficient epoll() mechanism in Linux. The full framework is described here.

The problem is that, each time I get some data from a server, I need to anticipate partial data. It is possible that the client receives a partial HTTP header. If this is the case, the logic is written to discard building the header data so that this can be attempted again, hopefully after we have full headers delivered.

I want to illustrate this version that has a subtle bug:

    //extract all HTTP request response headers locally and return the size of the headers
    private int extractHttpHeaders(byte[] httpHeader) {
        final String delimiter = ": ";
        List<StringBuilder> headers = new ArrayList<StringBuilder>();
        for (int i=0; i<httpHeader.length; i++) {
            StringBuilder line = new StringBuilder();
            for (; i<httpHeader.length && (char)httpHeader[i]!='\n'; i++) {
                line.append((char)httpHeader[i]);
            }
            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                httpHeaders.clear();
            }
            else if (line.length()==0 || (line.length()==1 && (char)httpHeader[i-1]=='\r')) {
                //all headers received
                httpHeaders.length = i+1;
                return i+1;
            } else {
                //line has a header, add it
                int colonAt = line.indexOf(delimiter);
                if (colonAt != -1) {
                    String value = line.substring(colonAt+delimiter.length(), line.charAt(line.length()-1)=='\r' ? line.length()-1 : line.length()).trim();
                    if (value.length() > 0)
                        httpHeaders.put(line.substring(0,colonAt), value);
                }
            }
        }
        return -1; //full headers have not been received yet
    }


It uses the Headers class - the httpHeaders you see there is an object of that type. Here is the code for the Headers class:

    class Headers {
        private Map<String, String> httpHeaders = new HashMap<String, String>();
        private int length = -1;
        private int getContentLength() {
            String s = httpHeaders.get("content-length");
            try {
            return s != null ? Integer.valueOf(s) : Integer.MAX_VALUE;
            } catch (NumberFormatException e) {
                return Integer.MAX_VALUE;
            }
        }
        private void clear() {
            httpHeaders.clear();
        }
        private void put(String k, String v) {
            httpHeaders.put(k.toLowerCase(),v);
        }
        private String get(String k) {
            return httpHeaders.get(k.toLowerCase());
        }
        private int numHeaders() {
            return httpHeaders.size();
        }
        private boolean isBinary() {
            return Utils.isContentTypeBinary(httpHeaders.get("content-type"));
        }
    };

Before I point out the bug, here is how this is supposed to work. I have an asynchronous network probing loop that fills byte arrays with data from the servers. Whenever I have new data in these byte arrays, I call into the extractHttpHeaders() call. But that code has to guard against working with partial headers. If there is partial headers, I abandon parsing, clear the headers data structure I was building and return.

In the main probe code, I would use Headers.numHeaders() to determine if all headers have been received. Since I clear the headers on partial headers, this is an acceptable method.

Except, there is a subtle bug that sometimes, I can receive a partial header but return without clearing the headers map. Then, next time the probe code gets some data, it will incorrectly assume that full headers had already been recived (as Headers.numHeaders() will return a +ve number)

To appreciate the bug, imagine I receive headers on an exact line boundary. For example assume I receive this:

Date: Thu, 21 Oct 2010 18:39:20 GMT
Server: Apache/2.2.15 (Fedora)
X-Powered-By: PHP/5.2.13

Then, I would not hit the condition of getting a partial line of the header, as all lines of the header are fully returned in this case, along with the terminating CRLF.

This is the condition that clears the headers on partial lines:

            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                httpHeaders.clear();
            }

But that won't be hit in this case, as we have full lines. In this case, the code would exit the for loop without clearing the headers map.

A simple change in the first code block is all that is required to fix this. I will have that on the next post.

Monday, October 18, 2010

Mysql - LOAD DATA INFILE could lose data

Last week, when we encountered a sudden data loss in mysql, the first thing to do was to pour through the bin logs looking for "DELETE" statements. It took quite a while to figure out that the data "loss" was more of a change in Ids in one table, and that had to do with a LOAD DATA INFILE statement.

To illustrate assume we have these two tables:


vegetables (
 
  id bigint(10) NOT NULL AUTO_INCREMENT,
 
  name varchar(255) NOT NULL,
 
  UNIQUE KEY uk_name (name)
);

sales (
 
  id bigint(10) NOT NULL AUTO_INCREMENT,
 
  saledate datetime,
 
  vegetableId  bigint(10) NOT NULL
);


The sales table has information about vegetable sales, each sales record has a connection to a vegetable record.

If we load data into the vegetables table using a LOAD csv of this sort, it is possible to break the existing foreign key relationships from the sales records to the vegetables records:


load data infile '/path/to/file.csv' replace into table vegetables;


The problem is the keyword "REPLACE" used here. It tells mysql what to do if it finds that any record in the csv file violates a key constraint. In this example, if the csv file had a record with the vegetable name "brocolli", and brocolli was already in the vegetables table, then the keyword "REPLACE" tells mysql to replace the row on the table with the new data from the csv file.

This still seems to be not a problem as replacing "brocolli" with "brocolli" is certainly not earth shattering. But what about the id? This is an auto_increment field which means that mysql will over-write the old id with a new auto incremented one. And that means that any sales record that referenced brocolli is now left with a pointer to nowhere. In other words, as far as practical matters go, all sales of brocolli have now disappeared.

This problem would not have happened if the sales table explicitly specified the FOREIGN KEY constraint on its vegetableId field.

In any case, is this data loss so terrible? Once we see the error we made, could we not quickly correct it, after all it is just a change in the ids, and the old sales records are still there in the table. Well think about it, even though those records are there, what if we had replacements for more than one vegetable? Let's assume both "brocolli" and "arugala" were replaced. How can we determine which vegetableIds were for brocolli and which were for arugala? The ids in the vegetable table have been replaced so it is impossible to make the connection to here from the sales table. No, this is a bad one.

In fact, an incident similar to this happened where I work. We had to restore the db to get this data back, it was not a pleasant experience.

Monday, October 11, 2010

Java URLConnection provides no fail-safe timeout on reads

While it does sport a setReadTimeout() call, it does not work if the server returns some data and then gets stuck. There are servers like this in the real world, for example try this:

curl http://www.geo-mobile.com/nuke/index.php

You have to use a separate timer thread to timeout any reads.


                try {
                    URLConnection con = urlObj.openConnection();
                    con.setConnectTimeout(5000);
                    con.setReadTimeout(5000);
                    new Thread(new InterruptThread(Thread.currentThread(), con)).start();
                    retVal = new Source(con);
                } catch (IOException e) {
                    System.err.println("aborted parsing due to I/O: " + e.getMessage());
                    throw new IndexingException("aborted parsing due to timeout - " + aUrl);
                }

In this example, the Source constructor, using the Jericho parser retrieves a web page and it can get stuck on a socket read call. The timeouts don't have any effect. And it is not possible to interrupt a thread waiting on I/O using Thread.interrupt().

The solution is the hand-crafted InterruptThread class:


public class InterruptThread implements Runnable {
    Thread parent;
    URLConnection con;
    public InterruptThread(Thread parent, URLConnection con) {
        this.parent = parent;
        this.con = con;
    }

    public void run() {
        try {
            Thread.sleep(5000);
        } catch (InterruptedException e) {

        }
        System.out.println("Timer thread forcing parent to quit connection");
        ((HttpURLConnection)con).disconnect();
        System.out.println("Timer thread closed connection held by parent, exiting");
    }
}

Here, the InterruptThread forcibly closes the URLConnection held by the parent thread, forcing the Source constructor of the parent thread to throw an IOException, which is handled by its catch block, thus avoiding the hang.

Tuesday, October 05, 2010

Java : Get difference between two sets

Use this generic method to get the difference between to Sets. The sets should have ordered iterators. I used TreeSet in the calling code.



    // returns the difference between two sets <set1>, <set2> as a pair
    // where pair.one is the set of new entries and pair.two is the set of deleted entries
    // between <set1> and <set2>
    public static <T extends Comparable<T>> Pair<Set<T>,Set<T>> diff(Set<T> set1, Set<T> set2) {
        Set<T> deleted = new TreeSet<T>();
        Set<T> added = new TreeSet<T>();

        Iterator<T> it1 = set1.iterator();
        Iterator<T> it2 = set2.iterator();

        int diff = 0;
        T obj1 = null;
        T obj2 = null;
        while (it1.hasNext() && it2.hasNext()) {
            if (diff == 0) {
                obj1 = it1.next();
                obj2 = it2.next();
            } else if (diff < 0) {
                deleted.add(obj1);
                obj1 = it1.next();
            } else {
                added.add(obj2);
                obj2 = it2.next();
            }

            diff = obj1.compareTo(obj2);
        }

        if (diff < 0) {
            deleted.add(obj1);
            boolean addFlag = true;
            while (it1.hasNext()) {
                obj1 = it1.next();
                diff = obj1.compareTo(obj2);
                if (diff != 0) {
                   deleted.add(obj1);
                } else {
                   addFlag = false; 
                }
            }
            if (addFlag) {
                added.add(obj2);
            }
        } else if (diff > 0){
            added.add(obj2);
            boolean addFlag = true;
            while (it2.hasNext()) {
                obj2 = it2.next();
                diff = obj1.compareTo(obj2);
                if (diff != 0) {
                    added.add(obj2);
                } else {
                    addFlag = false;
                }
            }
            if (addFlag) {
                deleted.add(obj1);
            }
        }
        Iterator<T> remIt = it1.hasNext() ? it1 : it2;
        Set<T> remSet = it1.hasNext() ?  deleted : added;

        T obj;
        while (remIt.hasNext()) {
            obj = remIt.next();
            remSet.add(obj);
        }
        return new Pair<Set<T>, Set<T>>(added, deleted);


    }

Comparing two Integers - don't expect auto-unboxing

If you wish to compare two Integers, explicitly compare their intValue(). Otherwise, what happens is an object comparison.

I was bitten by this as I was developing a test for a generic diff utility I had written. Here is the test code:


    public void test_diff_sets_rnd() {
        Random r = new Random();
        int n = r.nextInt(100);
        Set<Integer> set1 = new TreeSet<Integer>();
        Set<Integer> set2 = new TreeSet<Integer>();
        List<Integer> list = new ArrayList<Integer>();
        for (int i=0; i<n; i++) {
            int num = r.nextInt(100);
            set1.add(num);
            if (set2.add(num))
                list.add(num);
        }

        Set<Integer> removed = new TreeSet<Integer>();
        Set<Integer> added = new TreeSet<Integer>();

        n = set1.size()>0 ? r.nextInt(set1.size()) : 0;
        for (int i=0; i<n; i++) {
            int num = r.nextInt(set1.size());
            int remElt = list.get(num);
            if (set2.remove(remElt))
                removed.add(remElt);
        }

        n = r.nextInt(100);
        for (int i=0; i<n; i++) {
            int num = 100+r.nextInt(100);
            if (set2.add(num))
                added.add(num);
        }

        Pair<Set<Integer>, Set<Integer>> p = Utils.diff(set1,set2);

        System.out.println("set1:");
        for (Integer elt : set1) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        System.out.println("set2:");
        for (Integer elt : set2) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        System.out.println("added:");
        for (Integer elt : p.one) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        System.out.println("removed:");
        for (Integer elt : p.two) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        assertTrue(p.one.size()==added.size());
        assertTrue(p.two.size()==removed.size());

        for (Iterator<Integer> it1 = p.one.iterator(), it2 = added.iterator(); it1.hasNext();) {
            assertTrue(it1.next().intValue()==it2.next().intValue());
        }
        for (Iterator<Integer> it1 = p.two.iterator(), it2 = removed.iterator(); it1.hasNext();) {
            assertTrue(it1.next().intValue()==it2.next().intValue());
        }

    }

I had to use the intValue() explicitly in the last two for loops.

Tuesday, July 13, 2010

Why am I writing lower_bound() in Java again?

The second and the third time I find myself looking for a lower_bound() function according to generally accepted STL standard, I know that I'm not a Java-type developer. Java developers must have no need for these functions that I find very useful. A search for "java lower bound" or "java upper bound" does not yield the predictable result.

The Apache commons has some good collections utilities but not the lower_bound(), upper_bound() pair.

It is still a mystery why the Java Lib folks decided to ignore this function. The use of it is quite common for look-up tables. Today, I wanted to use these functions to look up for some limit values for input ranges from a table. The easiest way to explain this is that different input ranges require a look-up into a number determined by the range. The ranges are consecutive, say 0-25, 25-50, 50-100,100-500. Each range would map to a number, say 100, 212, 320, 400. The look-up function needs to return the correct number given an input number within a range.

For ex:

10 : 100
25 : 212
30 : 212
200 : 400

To satisfy the lower_bound, upper_bound standard specs, a number below the smallest range (say -10) would map to 100, and a number above the largest range will map to nothing.

lower_bound() should return an index to the first element in the sorted container that is equal to or above the number being looked up, and -1 if there is no such element.

upper_bound() should return an index to the first element in the sorted container that is above the number being looked up, and -1 if there is no such element.

Here are the functions I ended up writing:

    public static int lower_bound(Comparable[] arr, Comparable key) {
        int len = arr.length;
        int lo = 0;
        int hi = len-1;
        int mid = (lo + hi)/2;
        while (true) {
            int cmp = arr[mid].compareTo(key);
            if (cmp == 0 || cmp > 0) {
                hi = mid-1;
                if (hi < lo)
                    return mid;
            } else {
                lo = mid+1;
                if (hi < lo)
                    return mid<len-1?mid+1:-1;
            }
            mid = (lo + hi)/2;
        }
    }

    public static int upper_bound(Comparable[] arr, Comparable key) {
        int len = arr.length;
        int lo = 0;
        int hi = len-1;
        int mid = (lo + hi)/2;
        while (true) {
            int cmp = arr[mid].compareTo(key);
            if (cmp == 0 || cmp < 0) {
                lo = mid+1;
                if (hi < lo)
                    return mid<len-1?mid+1:-1;
            } else {
                hi = mid-1;
                if (hi < lo)
                    return mid;
            }
            mid = (lo + hi)/2;
        }
    }

Here are the tests:

    public void test_lower_bound() {
        final int arrSize = 500;
        final int maxNum = 1000;
        Integer[] bArr = new Integer[arrSize];
        Random r = new Random();
        for (int k=0; k<arrSize; k++) {
            bArr[k]=r.nextInt(maxNum);
        }
        Arrays.sort(bArr);

        final int numIter = 2000;
        for (int k=0; k<numIter; k++) {
            int n = r.nextInt(maxNum);
            int j = Utils.lower_bound(bArr, n);
            assertTrue((bArr[arrSize-1]<n && j==-1) || (bArr[j]>=n && (j==0 || bArr[j-1]<n)));
        }
    }

    public void test_upper_bound() {
        final int arrSize = 500;
        final int maxNum = 1000;
        Integer[] bArr = new Integer[arrSize];
        Random r = new Random();
        for (int k=0; k<arrSize; k++) {
            bArr[k]=r.nextInt(maxNum);
        }
        Arrays.sort(bArr);

        final int numIter = 2000;
        for (int k=0; k<numIter; k++) {
            int n = r.nextInt(maxNum);
            int j = Utils.upper_bound(bArr, n);
            assertTrue((bArr[arrSize-1]<=n && j==-1) || (bArr[j]>n && (j==0 || bArr[j-1]<=n)));
        }

    }

Saturday, July 10, 2010

beware of NoClassDefFoundError in java.util.Concurrent callbacks

If you use a particular class inside a Callable method used with java concurrency, and that class happens to be missing, it is likely that the NoClassDefFoundError thrown will be captured by the concurrency library and a different exception - ExecutionException - being thrown.

This can catch you by surprise - pun intended, and could involve some time in debugging, specially as this could happen after code is deployed on a new machine which has a missing jar, for example.

Here is an example of the issue:

    class Work implements Callable<Pair<String,byte[]>> {
        private String domain;
        public Work(String dom) {
            this.domain = dom;
        }
        public Pair<String, byte[]> call() {
            try {
                return new Pair<String, byte[]>(domain, InetAddress.getByName(domain).getAddress());
            } catch (UnknownHostException e) {
                return null;
            } catch (Exception e) {
                System.err.println("unexpected DNS error for: " + domain + " "+e.getMessage());
                return null;
            }
        }
    }


  //caller code

  ExecutorService pool = Executors.newFixedThreadPool(16);
  Future<Pair<String,byte[]>> future = pool.submit(new Work("http://lynx.com"));
  
  //sometime later, retrieve the future...

  if (future.isDone()) {
    try {
      Pair<String,byte[]> pair = future.get();
      if (pair != null) {
        //work with the results
      }

    } catch (ExecutionException e) {
        System.err.println("thread pool error " + e.getMessage());
    } catch (InterruptedException e) {
        System.err.println("thread pool interrupted " + e.getMessage());
    }
  } 

Here, the scenario is that the Pair class is missing. This class is invoked in the Callable method that returns the result from the thread to the caller. When the caller issues the future.get() call, the concurrency library calls the Callable.call() method, which then causes the JVM to throw the NoClassDefFoundError. The concurrency library dutifully catches this and re-throws the ExecutionException.

It pays to catch the ExecutionException and trace it in your code, even if you would never run the application in resource-tight scenarios when you really have to worry about catching these.

If the ExecutionException was not handled, this application would not fail visibly, but will likely not do what was intended.

Friday, July 09, 2010

block quotes in bash

:<<supercalifragilisticexpialidocious
echo oki doki goobledi loks
supercalifragilisticexpialidocious

The long string better not appear in the block. It could be any string of your choosing.

Thursday, May 13, 2010

The annoying preciseness of Java : Charset.isSupported

If the charset is supported, return true, else return false - it does sound pretty simple, doesn't it.

It would be simple if the language designers focused on usability vs pristine accuracy. Java, obviously went for the latter.

Thus if you were to ask whether an illegal charset is supported, it won't return false, it will decide to throw the IllegalCharsetNameException exception. How precise.


How annoying. Now when all you wanted was to check for the availability of a charset, suddenly you end up checking for exceptions as well as the return value from Charset.isSupported.


Such is the art of programming in Java.

Jericho Parser new version fixes choking on unusual charset

Each day I find something completely wacko on the Net. Today it is an extremely interesting charset present in the headers of a certain site:


mpire@brwdbs01:~$ curl -I http://uk.real.com/realplayer/
HTTP/1.1 200 OK
Expires: 0
Date: Thu, 13 May 2010 22:44:16 GMT
Content-Length: 2690
Server: Caudium
Connection: close
Content-Type: text/html; charset='.().'
pragma: no-cache
X-Host-Name: hhnode21.euro.real.com
X-Got-Fish: Yes
Accept-Ranges: bytes
MIME-Version: 1.0
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate, proxy-revalidate

a charset of '.().' Strange indeed. This manages to choke the JerichoParser I'm using and I'm not too sure what to do about it. The parser does a pretty nice job parsing all kinds of encodings, and since it has no idea of this kind, it gives up. I could try and make it use the default (ISO-8859-1).

This has been fixed on a newer version of the Jericho Parser.

Wednesday, May 12, 2010

Normalizing a URL

Today, I had this interesting problem to do with fetching a web page.
I was processing HTTP meta-refresh headers and ran into this type of header:

<meta http-equiv="REFRESH" content="0; URL=../cgi-bin/main2.cgi">

If I try to turn this into an absolute url (the url this content was fetched being http://popyard.com) I would be trying to do this:

> curl -I http://popyard.com/../cgi-bin/main2.cgi
> HTTP/1.1 400 Bad Request

However, turns out that the browser deals with this just fine. Faced with a malformed URL, it guesses and keeps going, fetching http://popyard.com/cgi-bin/main2.cgi

This required me to do some cleanup of the url to handle this dot segments (..). There is a well-known protocol for doing this. I coded this up simply using a stack based approach.

Here is the code:

    //removes .. sequences from the url string handling extra .. sequences by stopping
    //at the domain, thus always returning a correct url.
    //ex: http://www.ex.com/a/xt/../myspace.html => http://www.ex.com/a/myspace.html
    //    http://www.ex.com/a/xt/../../../myspace.html => http://www.ex.com/myspace.html    
    public static String normalizePath(String url) {
        if (url.indexOf("..") == -1)
            return url; //no .. seqs, no need to normalize
        String[] toks = url.split("/");
        int i;
        for (i=0; i<toks.length && (toks[i].length() == 0 || toks[i].toLowerCase().indexOf(":") != -1); i++);
        if (i==toks.length)
            return url;     //no proper path found, simply return the url
        // toks[i] is the domain

        LinkedList<String> s = new LinkedList<String>();
        for (; i<toks.length; i++) {
            if (!toks[i].equals(".."))
                s.push(toks[i]);
            else if (s.size()>1)
                s.pop();
        }

        if (s.size()<1)
            return url;     //no proper domain found, simply return the url

        int idx = url.indexOf("://");
        StringBuilder sb = new StringBuilder();
        sb.append( (idx != -1 ? url.substring(0, idx+3) : "")).append(s.removeLast()); //get proto://domain

        while (s.size()>0) {
            sb.append("/").append(s.removeLast());
        }

        return sb.toString();
    }


Basically, what this code does is strip the url into the domain and the path components and use a stack to manipulate these terms. We push a term onto the stack, unless the term is a dot segment - ".." - in which case, if there is at least two terms on the stack, we pop the first one. This way, we never pop the last term on the stack, which is the domain.

Then we can find the normalized URL following the terms in the stack from bottom to top.

This is the reason we can't really use a stack, as we can't traverse from the bottom to the top. So we use a LinkedList instead.

Saturday, February 20, 2010

Interesting performance characteristic in Lucene.Analysis.StopFilter

I was tracking down a slow indexing process and got this callstack with YourKit:




There is a Collection.addAll() being called for each invocation of tokenStream() which is called many times per document. The StopFilter is the culprit. When I checked the constructor, it looks like this:

  public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase)
{
super(input);
if (stopWords instanceof CharArraySet) {
this.stopWords = (CharArraySet)stopWords;
} else {
this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
this.stopWords.addAll(stopWords);
}
this.enablePositionIncrements = enablePositionIncrements;
init();
}



As you can see, if the stopWords set is not of type org.apache.lucene.analysis.CharArraySet, a whole copy is made. I was using a regular Set for the stopWords and that was the problem.

This was very instructive, as the reason I didn't use lucene's StopFilter.makeStopSet() and used a regular Set was assuming that under the covers Lucene made a regular Set, and I wanted to save it the trouble as I could generate a Set directly from my input.

Another reason to profile, profile and profile....

After changing my code to use the Lucene Set (by calling StopFilter.makeStopSet) here is the result, faster for sure:



The performance improvements could be significant. In a real-world scenario where this code was used in creating a 10G size index, I saw a 4X speed-up in indexing.