Thursday, October 21, 2010

Extracting HTTP headers, handling partial headers (incorrectly)

This one is for the network fiends who are crazy enough to read raw HTTP packets and construct HTML content. I had to do this to take advantage of the highly efficient epoll() mechanism in Linux. The full framework is described here.

The problem is that, each time I get some data from a server, I need to anticipate partial data. It is possible that the client receives a partial HTTP header. If this is the case, the logic is written to discard building the header data so that this can be attempted again, hopefully after we have full headers delivered.

I want to illustrate this version that has a subtle bug:

    //extract all HTTP request response headers locally and return the size of the headers
    private int extractHttpHeaders(byte[] httpHeader) {
        final String delimiter = ": ";
        List<StringBuilder> headers = new ArrayList<StringBuilder>();
        for (int i=0; i<httpHeader.length; i++) {
            StringBuilder line = new StringBuilder();
            for (; i<httpHeader.length && (char)httpHeader[i]!='\n'; i++) {
                line.append((char)httpHeader[i]);
            }
            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                httpHeaders.clear();
            }
            else if (line.length()==0 || (line.length()==1 && (char)httpHeader[i-1]=='\r')) {
                //all headers received
                httpHeaders.length = i+1;
                return i+1;
            } else {
                //line has a header, add it
                int colonAt = line.indexOf(delimiter);
                if (colonAt != -1) {
                    String value = line.substring(colonAt+delimiter.length(), line.charAt(line.length()-1)=='\r' ? line.length()-1 : line.length()).trim();
                    if (value.length() > 0)
                        httpHeaders.put(line.substring(0,colonAt), value);
                }
            }
        }
        return -1; //full headers have not been received yet
    }


It uses the Headers class - the httpHeaders you see there is an object of that type. Here is the code for the Headers class:

    class Headers {
        private Map<String, String> httpHeaders = new HashMap<String, String>();
        private int length = -1;
        private int getContentLength() {
            String s = httpHeaders.get("content-length");
            try {
            return s != null ? Integer.valueOf(s) : Integer.MAX_VALUE;
            } catch (NumberFormatException e) {
                return Integer.MAX_VALUE;
            }
        }
        private void clear() {
            httpHeaders.clear();
        }
        private void put(String k, String v) {
            httpHeaders.put(k.toLowerCase(),v);
        }
        private String get(String k) {
            return httpHeaders.get(k.toLowerCase());
        }
        private int numHeaders() {
            return httpHeaders.size();
        }
        private boolean isBinary() {
            return Utils.isContentTypeBinary(httpHeaders.get("content-type"));
        }
    };

Before I point out the bug, here is how this is supposed to work. I have an asynchronous network probing loop that fills byte arrays with data from the servers. Whenever I have new data in these byte arrays, I call into the extractHttpHeaders() call. But that code has to guard against working with partial headers. If there is partial headers, I abandon parsing, clear the headers data structure I was building and return.

In the main probe code, I would use Headers.numHeaders() to determine if all headers have been received. Since I clear the headers on partial headers, this is an acceptable method.

Except, there is a subtle bug that sometimes, I can receive a partial header but return without clearing the headers map. Then, next time the probe code gets some data, it will incorrectly assume that full headers had already been recived (as Headers.numHeaders() will return a +ve number)

To appreciate the bug, imagine I receive headers on an exact line boundary. For example assume I receive this:

Date: Thu, 21 Oct 2010 18:39:20 GMT
Server: Apache/2.2.15 (Fedora)
X-Powered-By: PHP/5.2.13

Then, I would not hit the condition of getting a partial line of the header, as all lines of the header are fully returned in this case, along with the terminating CRLF.

This is the condition that clears the headers on partial lines:

            if (i==httpHeader.length) {
                //partial header, full headers have not been received yet
                httpHeaders.clear();
            }

But that won't be hit in this case, as we have full lines. In this case, the code would exit the for loop without clearing the headers map.

A simple change in the first code block is all that is required to fix this. I will have that on the next post.

No comments: