Saturday, October 23, 2010

Handling gzipped HTTP response with Transfer-Encoding: chunked

This explains the basic protocol of sending data using Transfer-Encoding: chunked. Quite a number of servers send gzipped data this way. I needed to handle this for the asynchronous crawler I built.

The chunked transfer consists of many chunks. Before each chunk, we have the size of the chunk terminated by a CRLF pair. There is a final zero length chunk.

The first time I wrote code to handle the chunked zipped transfer, I had the code decompress each chunk. This worked. But then later, content from other URLs did not decompress properly. Probing with Wireshark, I came to realize that each individual chunk cannot be reliably decompressed. This is because the server does not compress each chunk before sending it. The server compresses the full content, then chunks it up and sends each chunk on the wire. Thus, I needed to first build the full message and then decompress the full message. The reasons that my first attempt worked for the URL I was testing had to do with the fact that in that particular case, all data came in a single chunk.

Here is the relevant code as I couldn't find this easily anywhere:


String enc = httpHeaders.get("Content-Encoding");

if (enc != null && enc.toLowerCase().equals("gzip")) {
  String te = httpHeaders.get("Transfer-Encoding");
  if (te != null && te.toLowerCase().equals("chunked")) {
    int idx = httpHeaders.length;
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    int numBytes = -1;

     try {
       do {
         StringBuilder numBytesBuf = new StringBuilder();
         for (;idx<bytes.length && bytes[idx]!='\r';idx++) {
           if (Utils.isHex((char)bytes[idx]))
             numBytesBuf.append((char)bytes[idx]);
           }
           if (idx >= bytes.length)
             throw new IOException("incorrect chunked encoding for : " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
           idx+=2; //skip over '\r\n'
           try {
             numBytes = Integer.parseInt(numBytesBuf.toString(), 16);
           } catch (NumberFormatException e) {
              throw new IOException("incorrect chunked encoding for : " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL(), e);
           }
           if (numBytes > 0) {
               //idx points to start and numBytes is the length
               os.write(bytes, idx, idx+numBytes <= bytes.length ? numBytes : bytes.length-idx);
               if (idx+numBytes > bytes.length) {
                  System.err.println("incorrect chunked encoding, " + (idx+numBytes) + " is outside " + bytes.length + " for: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
                  break;
               }
           }
             idx += (numBytes+2); //+2 for '\r\n'
        } while (numBytes > 0);
        
        GZIPInputStream zip = new GZIPInputStream(new ByteArrayInputStream(os.toByteArray(), 0, os.size()));
        byte[] buf = new byte[1024];
        int len;

        try {
          for (;(len = zip.read(buf, 0, buf.length)) > 0;) { //decompress from <zip> to <buf>
            f.write(buf,0,len);            //transfer from fixed size <buf> to var size <os>
          }
        } catch (IOException e) {
          if (!e.getMessage().equals("Corrupt GZIP trailer") && !e.getMessage().equals("Unexpected end of ZLIB input stream"))
            throw e;
          else {
            System.out.println("handled spurious " + e.getMessage() + " on: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
          }
        }
        os.close();
         zip.close();
     } catch (Exception e) {
       System.err.println("failed on: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
       System.err.println(ExceptionsUtil.toString(e));
     }
}

GZipInputStream throws spurious exceptions in how it handles the end of the file

Sun bug database mentions here and here about spurious errors thrown by GZipInputStream.

But unfortunately, there are no fixes. Also these spurious exceptions are not thrown just for large files. I have seen these errors for files as small as 5K. Here is a set of decompressed files, sorted by size that I managed to generate by ignoring these spurious errors:

x02:~$ cat /tmp/z | perl -ne 'if (/based on (.*)$/) {system("ls -ltr `echo -n $1|md5`.url");}' | sort -n -t ' ' -k 5
-rw-r--r-- 1 mpire mpire 5501 2010-10-23 14:03 409ff1c2b7ce2887db8a5c98d395b543.url
-rw-r--r-- 1 mpire mpire 38681 2010-10-23 14:03 8bc8f64132dd2e4bdbccff555cfa6966.url
-rw-r--r-- 1 mpire mpire 44554 2010-10-23 14:01 ba1c53b23f747efb3aa3a7531da80fb1.url
-rw-r--r-- 1 mpire mpire 45415 2010-10-23 14:03 073cd89a26dd69bac8f8a734bcaec7f1.url
-rw-r--r-- 1 mpire mpire 46058 2010-10-23 14:00 f0ae11a51e1975838c26c428ce14308a.url
-rw-r--r-- 1 mpire mpire 46192 2010-10-23 14:03 73972b286ec326e91404008b4c125e5a.url
-rw-r--r-- 1 mpire mpire 46414 2010-10-23 14:00 c6389b488bf912ddece9075884ff7c80.url
-rw-r--r-- 1 mpire mpire 47030 2010-10-23 14:00 07d9fe64764458b55626d9eb047d5d4b.url
-rw-r--r-- 1 mpire mpire 47565 2010-10-23 14:03 67a9869337a777c7c6a8411fd55e1b39.url
-rw-r--r-- 1 mpire mpire 49034 2010-10-23 14:01 0792cd32f0ef59cbe3592a5c2a7b5744.url
-rw-r--r-- 1 mpire mpire 58397 2010-10-23 14:03 3c973b623e9d27cd36234222f6542788.url
-rw-r--r-- 1 mpire mpire 58981 2010-10-23 14:03 780f0be38cbe0e73de788597cb482af4.url
-rw-r--r-- 1 mpire mpire 59177 2010-10-23 14:01 15dd65994702db5e360704144874f3f8.url
-rw-r--r-- 1 mpire mpire 60043 2010-10-23 14:03 c355cd84d17be2cf405b04fb3663d181.url
-rw-r--r-- 1 mpire mpire 63189 2010-10-23 14:03 43ab8fabf72b5564bc0e8b1ff3fcebe7.url
-rw-r--r-- 1 mpire mpire 70235 2010-10-23 14:01 3a24135855df237d829960a15cd8b170.url
-rw-r--r-- 1 mpire mpire 71536 2010-10-23 14:01 48f80b3fb1c51ea52313fe76b55a3849.url
-rw-r--r-- 1 mpire mpire 76932 2010-10-23 14:03 8c789913e5666e793c38d02001486532.url
-rw-r--r-- 1 mpire mpire 78825 2010-10-23 14:01 49728436874d94d8b1ab0ef17d8b4736.url
-rw-r--r-- 1 mpire mpire 80459 2010-10-23 14:05 c52a68d48d2c23ef846950dec999f084.url
-rw-r--r-- 1 mpire mpire 80459 2010-10-23 14:05 c52a68d48d2c23ef846950dec999f084.url
-rw-r--r-- 1 mpire mpire 83001 2010-10-23 14:00 a963a7357bf480fe24040ecf05e8927d.url
-rw-r--r-- 1 mpire mpire 105473 2010-10-23 14:01 db76aed3163beb6ad49670866045a9d8.url
-rw-r--r-- 1 mpire mpire 109405 2010-10-23 14:01 70da57c0544c122766a9ad6772757f2b.url
-rw-r--r-- 1 mpire mpire 110921 2010-10-23 14:01 579e0f08a036befe374ff8c70126a2bb.url
-rw-r--r-- 1 mpire mpire 111880 2010-10-23 14:00 3091fb91ae4a92b06392acead7170a57.url
-rw-r--r-- 1 mpire mpire 116796 2010-10-23 14:05 a326a52188abc263e2f8804444b48c8c.url
-rw-r--r-- 1 mpire mpire 154209 2010-10-23 14:06 e90211e8f799a88dc70ff83d1aba0748.url
-rw-r--r-- 1 mpire mpire 159089 2010-10-23 14:01 8ef49b62542f8283acba4e573e28ea59.url
-rw-r--r-- 1 mpire mpire 168786 2010-10-23 14:01 487d9c54fbd3e4b760c91dbf5f754d32.url
-rw-r--r-- 1 mpire mpire 212561 2010-10-23 14:03 b3cb42c946716cfe7e31096bd458e9f2.url
-rw-r--r-- 1 mpire mpire 222257 2010-10-23 14:03 f803ac8246dd2ab7dd16f7d28d5b4594.url


The decompressed files are good. I ran into this issue reading gzipped files from web servers that zip content and send this zipped data in chunks using Transfer-Encoding: chunked.

The error seems to be not related to the Java libraries. I saved the gzip content and tried to unzip with gunzip. This failed as well:

x02:/tmp$ gunzip 0792cd32f0ef59cbe3592a5c2a7b5744.gz 

gzip: 0792cd32f0ef59cbe3592a5c2a7b5744.gz: unexpected end of file

Thursday, October 21, 2010

Extracting HTTP headers, handling partial headers correctly

Here is the correct code for the previous post:

    //extract all HTTP request response headers locally and return the size of the headers
    private int extractHttpHeaders(byte[] httpHeader) {
        final String delimiter = ": ";
        List<StringBuilder> headers = new ArrayList<StringBuilder>();
        for (int i=0; i<httpHeader.length; i++) {
            StringBuilder line = new StringBuilder();
            for (; i<httpHeader.length && (char)httpHeader[i]!='\n'; i++) {
                line.append((char)httpHeader[i]);
            }
            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                //this will break out of the loop
            }
            else if (line.length()==0 || (line.length()==1 && (char)httpHeader[i-1]=='\r')) {
                //all headers received
                httpHeaders.length = i+1;
                return i+1;
            } else {
                //line has a header, add it
                int colonAt = line.indexOf(delimiter);
                if (colonAt != -1) {
                    String value = line.substring(colonAt+delimiter.length(), line.charAt(line.length()-1)=='\r' ? line.length()-1 : line.length()).trim();
                    if (value.length() > 0)
                        httpHeaders.put(line.substring(0,colonAt), value);
                }
            }
        }
        //partial header, full headers have not been received yet
        httpHeaders.clear();
        return -1; //full headers have not been received yet
    }

The only difference is unconditionally clearing the headers map if we don't see a blank line in the headers. The correct termination of the headers is a single CRLF pair (blank line). If that has not been seen, then it is a partial header, even though each line in the partial header may have been fully received.

Extracting HTTP headers, handling partial headers (incorrectly)

This one is for the network fiends who are crazy enough to read raw HTTP packets and construct HTML content. I had to do this to take advantage of the highly efficient epoll() mechanism in Linux. The full framework is described here.

The problem is that, each time I get some data from a server, I need to anticipate partial data. It is possible that the client receives a partial HTTP header. If this is the case, the logic is written to discard building the header data so that this can be attempted again, hopefully after we have full headers delivered.

I want to illustrate this version that has a subtle bug:

    //extract all HTTP request response headers locally and return the size of the headers
    private int extractHttpHeaders(byte[] httpHeader) {
        final String delimiter = ": ";
        List<StringBuilder> headers = new ArrayList<StringBuilder>();
        for (int i=0; i<httpHeader.length; i++) {
            StringBuilder line = new StringBuilder();
            for (; i<httpHeader.length && (char)httpHeader[i]!='\n'; i++) {
                line.append((char)httpHeader[i]);
            }
            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                httpHeaders.clear();
            }
            else if (line.length()==0 || (line.length()==1 && (char)httpHeader[i-1]=='\r')) {
                //all headers received
                httpHeaders.length = i+1;
                return i+1;
            } else {
                //line has a header, add it
                int colonAt = line.indexOf(delimiter);
                if (colonAt != -1) {
                    String value = line.substring(colonAt+delimiter.length(), line.charAt(line.length()-1)=='\r' ? line.length()-1 : line.length()).trim();
                    if (value.length() > 0)
                        httpHeaders.put(line.substring(0,colonAt), value);
                }
            }
        }
        return -1; //full headers have not been received yet
    }


It uses the Headers class - the httpHeaders you see there is an object of that type. Here is the code for the Headers class:

    class Headers {
        private Map<String, String> httpHeaders = new HashMap<String, String>();
        private int length = -1;
        private int getContentLength() {
            String s = httpHeaders.get("content-length");
            try {
            return s != null ? Integer.valueOf(s) : Integer.MAX_VALUE;
            } catch (NumberFormatException e) {
                return Integer.MAX_VALUE;
            }
        }
        private void clear() {
            httpHeaders.clear();
        }
        private void put(String k, String v) {
            httpHeaders.put(k.toLowerCase(),v);
        }
        private String get(String k) {
            return httpHeaders.get(k.toLowerCase());
        }
        private int numHeaders() {
            return httpHeaders.size();
        }
        private boolean isBinary() {
            return Utils.isContentTypeBinary(httpHeaders.get("content-type"));
        }
    };

Before I point out the bug, here is how this is supposed to work. I have an asynchronous network probing loop that fills byte arrays with data from the servers. Whenever I have new data in these byte arrays, I call into the extractHttpHeaders() call. But that code has to guard against working with partial headers. If there is partial headers, I abandon parsing, clear the headers data structure I was building and return.

In the main probe code, I would use Headers.numHeaders() to determine if all headers have been received. Since I clear the headers on partial headers, this is an acceptable method.

Except, there is a subtle bug that sometimes, I can receive a partial header but return without clearing the headers map. Then, next time the probe code gets some data, it will incorrectly assume that full headers had already been recived (as Headers.numHeaders() will return a +ve number)

To appreciate the bug, imagine I receive headers on an exact line boundary. For example assume I receive this:

Date: Thu, 21 Oct 2010 18:39:20 GMT
Server: Apache/2.2.15 (Fedora)
X-Powered-By: PHP/5.2.13

Then, I would not hit the condition of getting a partial line of the header, as all lines of the header are fully returned in this case, along with the terminating CRLF.

This is the condition that clears the headers on partial lines:

            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                httpHeaders.clear();
            }

But that won't be hit in this case, as we have full lines. In this case, the code would exit the for loop without clearing the headers map.

A simple change in the first code block is all that is required to fix this. I will have that on the next post.

Monday, October 18, 2010

Mysql - LOAD DATA INFILE could lose data

Last week, when we encountered a sudden data loss in mysql, the first thing to do was to pour through the bin logs looking for "DELETE" statements. It took quite a while to figure out that the data "loss" was more of a change in Ids in one table, and that had to do with a LOAD DATA INFILE statement.

To illustrate assume we have these two tables:


vegetables (
 
  id bigint(10) NOT NULL AUTO_INCREMENT,
 
  name varchar(255) NOT NULL,
 
  UNIQUE KEY uk_name (name)
);

sales (
 
  id bigint(10) NOT NULL AUTO_INCREMENT,
 
  saledate datetime,
 
  vegetableId  bigint(10) NOT NULL
);


The sales table has information about vegetable sales, each sales record has a connection to a vegetable record.

If we load data into the vegetables table using a LOAD csv of this sort, it is possible to break the existing foreign key relationships from the sales records to the vegetables records:


load data infile '/path/to/file.csv' replace into table vegetables;


The problem is the keyword "REPLACE" used here. It tells mysql what to do if it finds that any record in the csv file violates a key constraint. In this example, if the csv file had a record with the vegetable name "brocolli", and brocolli was already in the vegetables table, then the keyword "REPLACE" tells mysql to replace the row on the table with the new data from the csv file.

This still seems to be not a problem as replacing "brocolli" with "brocolli" is certainly not earth shattering. But what about the id? This is an auto_increment field which means that mysql will over-write the old id with a new auto incremented one. And that means that any sales record that referenced brocolli is now left with a pointer to nowhere. In other words, as far as practical matters go, all sales of brocolli have now disappeared.

This problem would not have happened if the sales table explicitly specified the FOREIGN KEY constraint on its vegetableId field.

In any case, is this data loss so terrible? Once we see the error we made, could we not quickly correct it, after all it is just a change in the ids, and the old sales records are still there in the table. Well think about it, even though those records are there, what if we had replacements for more than one vegetable? Let's assume both "brocolli" and "arugala" were replaced. How can we determine which vegetableIds were for brocolli and which were for arugala? The ids in the vegetable table have been replaced so it is impossible to make the connection to here from the sales table. No, this is a bad one.

In fact, an incident similar to this happened where I work. We had to restore the db to get this data back, it was not a pleasant experience.

Monday, October 11, 2010

Java URLConnection provides no fail-safe timeout on reads

While it does sport a setReadTimeout() call, it does not work if the server returns some data and then gets stuck. There are servers like this in the real world, for example try this:

curl http://www.geo-mobile.com/nuke/index.php

You have to use a separate timer thread to timeout any reads.


                try {
                    URLConnection con = urlObj.openConnection();
                    con.setConnectTimeout(5000);
                    con.setReadTimeout(5000);
                    new Thread(new InterruptThread(Thread.currentThread(), con)).start();
                    retVal = new Source(con);
                } catch (IOException e) {
                    System.err.println("aborted parsing due to I/O: " + e.getMessage());
                    throw new IndexingException("aborted parsing due to timeout - " + aUrl);
                }

In this example, the Source constructor, using the Jericho parser retrieves a web page and it can get stuck on a socket read call. The timeouts don't have any effect. And it is not possible to interrupt a thread waiting on I/O using Thread.interrupt().

The solution is the hand-crafted InterruptThread class:


public class InterruptThread implements Runnable {
    Thread parent;
    URLConnection con;
    public InterruptThread(Thread parent, URLConnection con) {
        this.parent = parent;
        this.con = con;
    }

    public void run() {
        try {
            Thread.sleep(5000);
        } catch (InterruptedException e) {

        }
        System.out.println("Timer thread forcing parent to quit connection");
        ((HttpURLConnection)con).disconnect();
        System.out.println("Timer thread closed connection held by parent, exiting");
    }
}

Here, the InterruptThread forcibly closes the URLConnection held by the parent thread, forcing the Source constructor of the parent thread to throw an IOException, which is handled by its catch block, thus avoiding the hang.

Tuesday, October 05, 2010

Java : Get difference between two sets

Use this generic method to get the difference between to Sets. The sets should have ordered iterators. I used TreeSet in the calling code.



    // returns the difference between two sets <set1>, <set2> as a pair
    // where pair.one is the set of new entries and pair.two is the set of deleted entries
    // between <set1> and <set2>
    public static <T extends Comparable<T>> Pair<Set<T>,Set<T>> diff(Set<T> set1, Set<T> set2) {
        Set<T> deleted = new TreeSet<T>();
        Set<T> added = new TreeSet<T>();

        Iterator<T> it1 = set1.iterator();
        Iterator<T> it2 = set2.iterator();

        int diff = 0;
        T obj1 = null;
        T obj2 = null;
        while (it1.hasNext() && it2.hasNext()) {
            if (diff == 0) {
                obj1 = it1.next();
                obj2 = it2.next();
            } else if (diff < 0) {
                deleted.add(obj1);
                obj1 = it1.next();
            } else {
                added.add(obj2);
                obj2 = it2.next();
            }

            diff = obj1.compareTo(obj2);
        }

        if (diff < 0) {
            deleted.add(obj1);
            boolean addFlag = true;
            while (it1.hasNext()) {
                obj1 = it1.next();
                diff = obj1.compareTo(obj2);
                if (diff != 0) {
                   deleted.add(obj1);
                } else {
                   addFlag = false; 
                }
            }
            if (addFlag) {
                added.add(obj2);
            }
        } else if (diff > 0){
            added.add(obj2);
            boolean addFlag = true;
            while (it2.hasNext()) {
                obj2 = it2.next();
                diff = obj1.compareTo(obj2);
                if (diff != 0) {
                    added.add(obj2);
                } else {
                    addFlag = false;
                }
            }
            if (addFlag) {
                deleted.add(obj1);
            }
        }
        Iterator<T> remIt = it1.hasNext() ? it1 : it2;
        Set<T> remSet = it1.hasNext() ?  deleted : added;

        T obj;
        while (remIt.hasNext()) {
            obj = remIt.next();
            remSet.add(obj);
        }
        return new Pair<Set<T>, Set<T>>(added, deleted);


    }

Comparing two Integers - don't expect auto-unboxing

If you wish to compare two Integers, explicitly compare their intValue(). Otherwise, what happens is an object comparison.

I was bitten by this as I was developing a test for a generic diff utility I had written. Here is the test code:


    public void test_diff_sets_rnd() {
        Random r = new Random();
        int n = r.nextInt(100);
        Set<Integer> set1 = new TreeSet<Integer>();
        Set<Integer> set2 = new TreeSet<Integer>();
        List<Integer> list = new ArrayList<Integer>();
        for (int i=0; i<n; i++) {
            int num = r.nextInt(100);
            set1.add(num);
            if (set2.add(num))
                list.add(num);
        }

        Set<Integer> removed = new TreeSet<Integer>();
        Set<Integer> added = new TreeSet<Integer>();

        n = set1.size()>0 ? r.nextInt(set1.size()) : 0;
        for (int i=0; i<n; i++) {
            int num = r.nextInt(set1.size());
            int remElt = list.get(num);
            if (set2.remove(remElt))
                removed.add(remElt);
        }

        n = r.nextInt(100);
        for (int i=0; i<n; i++) {
            int num = 100+r.nextInt(100);
            if (set2.add(num))
                added.add(num);
        }

        Pair<Set<Integer>, Set<Integer>> p = Utils.diff(set1,set2);

        System.out.println("set1:");
        for (Integer elt : set1) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        System.out.println("set2:");
        for (Integer elt : set2) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        System.out.println("added:");
        for (Integer elt : p.one) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        System.out.println("removed:");
        for (Integer elt : p.two) {
            System.out.print(elt + "\t");
        }
        System.out.println();

        assertTrue(p.one.size()==added.size());
        assertTrue(p.two.size()==removed.size());

        for (Iterator<Integer> it1 = p.one.iterator(), it2 = added.iterator(); it1.hasNext();) {
            assertTrue(it1.next().intValue()==it2.next().intValue());
        }
        for (Iterator<Integer> it1 = p.two.iterator(), it2 = removed.iterator(); it1.hasNext();) {
            assertTrue(it1.next().intValue()==it2.next().intValue());
        }

    }

I had to use the intValue() explicitly in the last two for loops.