Saturday, October 23, 2010

Handling gzipped HTTP response with Transfer-Encoding: chunked

This explains the basic protocol of sending data using Transfer-Encoding: chunked. Quite a number of servers send gzipped data this way. I needed to handle this for the asynchronous crawler I built.

The chunked transfer consists of many chunks. Before each chunk, we have the size of the chunk terminated by a CRLF pair. There is a final zero length chunk.

The first time I wrote code to handle the chunked zipped transfer, I had the code decompress each chunk. This worked. But then later, content from other URLs did not decompress properly. Probing with Wireshark, I came to realize that each individual chunk cannot be reliably decompressed. This is because the server does not compress each chunk before sending it. The server compresses the full content, then chunks it up and sends each chunk on the wire. Thus, I needed to first build the full message and then decompress the full message. The reasons that my first attempt worked for the URL I was testing had to do with the fact that in that particular case, all data came in a single chunk.

Here is the relevant code as I couldn't find this easily anywhere:


String enc = httpHeaders.get("Content-Encoding");

if (enc != null && enc.toLowerCase().equals("gzip")) {
  String te = httpHeaders.get("Transfer-Encoding");
  if (te != null && te.toLowerCase().equals("chunked")) {
    int idx = httpHeaders.length;
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    int numBytes = -1;

     try {
       do {
         StringBuilder numBytesBuf = new StringBuilder();
         for (;idx<bytes.length && bytes[idx]!='\r';idx++) {
           if (Utils.isHex((char)bytes[idx]))
             numBytesBuf.append((char)bytes[idx]);
           }
           if (idx >= bytes.length)
             throw new IOException("incorrect chunked encoding for : " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
           idx+=2; //skip over '\r\n'
           try {
             numBytes = Integer.parseInt(numBytesBuf.toString(), 16);
           } catch (NumberFormatException e) {
              throw new IOException("incorrect chunked encoding for : " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL(), e);
           }
           if (numBytes > 0) {
               //idx points to start and numBytes is the length
               os.write(bytes, idx, idx+numBytes <= bytes.length ? numBytes : bytes.length-idx);
               if (idx+numBytes > bytes.length) {
                  System.err.println("incorrect chunked encoding, " + (idx+numBytes) + " is outside " + bytes.length + " for: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
                  break;
               }
           }
             idx += (numBytes+2); //+2 for '\r\n'
        } while (numBytes > 0);
        
        GZIPInputStream zip = new GZIPInputStream(new ByteArrayInputStream(os.toByteArray(), 0, os.size()));
        byte[] buf = new byte[1024];
        int len;

        try {
          for (;(len = zip.read(buf, 0, buf.length)) > 0;) { //decompress from <zip> to <buf>
            f.write(buf,0,len);            //transfer from fixed size <buf> to var size <os>
          }
        } catch (IOException e) {
          if (!e.getMessage().equals("Corrupt GZIP trailer") && !e.getMessage().equals("Unexpected end of ZLIB input stream"))
            throw e;
          else {
            System.out.println("handled spurious " + e.getMessage() + " on: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
          }
        }
        os.close();
         zip.close();
     } catch (Exception e) {
       System.err.println("failed on: " + retryURL.getCurrentURL() + " based on " + retryURL.getOrigURL());
       System.err.println(ExceptionsUtil.toString(e));
     }
}

1 comment:

Anonymous said...

Please post complete program