Thursday, May 13, 2010

The annoying preciseness of Java : Charset.isSupported

If the charset is supported, return true, else return false - it does sound pretty simple, doesn't it.

It would be simple if the language designers focused on usability vs pristine accuracy. Java, obviously went for the latter.

Thus if you were to ask whether an illegal charset is supported, it won't return false, it will decide to throw the IllegalCharsetNameException exception. How precise.


How annoying. Now when all you wanted was to check for the availability of a charset, suddenly you end up checking for exceptions as well as the return value from Charset.isSupported.


Such is the art of programming in Java.

Jericho Parser new version fixes choking on unusual charset

Each day I find something completely wacko on the Net. Today it is an extremely interesting charset present in the headers of a certain site:


mpire@brwdbs01:~$ curl -I http://uk.real.com/realplayer/
HTTP/1.1 200 OK
Expires: 0
Date: Thu, 13 May 2010 22:44:16 GMT
Content-Length: 2690
Server: Caudium
Connection: close
Content-Type: text/html; charset='.().'
pragma: no-cache
X-Host-Name: hhnode21.euro.real.com
X-Got-Fish: Yes
Accept-Ranges: bytes
MIME-Version: 1.0
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate, proxy-revalidate

a charset of '.().' Strange indeed. This manages to choke the JerichoParser I'm using and I'm not too sure what to do about it. The parser does a pretty nice job parsing all kinds of encodings, and since it has no idea of this kind, it gives up. I could try and make it use the default (ISO-8859-1).

This has been fixed on a newer version of the Jericho Parser.

Wednesday, May 12, 2010

Normalizing a URL

Today, I had this interesting problem to do with fetching a web page.
I was processing HTTP meta-refresh headers and ran into this type of header:

<meta http-equiv="REFRESH" content="0; URL=../cgi-bin/main2.cgi">

If I try to turn this into an absolute url (the url this content was fetched being http://popyard.com) I would be trying to do this:

> curl -I http://popyard.com/../cgi-bin/main2.cgi
> HTTP/1.1 400 Bad Request

However, turns out that the browser deals with this just fine. Faced with a malformed URL, it guesses and keeps going, fetching http://popyard.com/cgi-bin/main2.cgi

This required me to do some cleanup of the url to handle this dot segments (..). There is a well-known protocol for doing this. I coded this up simply using a stack based approach.

Here is the code:

    //removes .. sequences from the url string handling extra .. sequences by stopping
    //at the domain, thus always returning a correct url.
    //ex: http://www.ex.com/a/xt/../myspace.html => http://www.ex.com/a/myspace.html
    //    http://www.ex.com/a/xt/../../../myspace.html => http://www.ex.com/myspace.html    
    public static String normalizePath(String url) {
        if (url.indexOf("..") == -1)
            return url; //no .. seqs, no need to normalize
        String[] toks = url.split("/");
        int i;
        for (i=0; i<toks.length && (toks[i].length() == 0 || toks[i].toLowerCase().indexOf(":") != -1); i++);
        if (i==toks.length)
            return url;     //no proper path found, simply return the url
        // toks[i] is the domain

        LinkedList<String> s = new LinkedList<String>();
        for (; i<toks.length; i++) {
            if (!toks[i].equals(".."))
                s.push(toks[i]);
            else if (s.size()>1)
                s.pop();
        }

        if (s.size()<1)
            return url;     //no proper domain found, simply return the url

        int idx = url.indexOf("://");
        StringBuilder sb = new StringBuilder();
        sb.append( (idx != -1 ? url.substring(0, idx+3) : "")).append(s.removeLast()); //get proto://domain

        while (s.size()>0) {
            sb.append("/").append(s.removeLast());
        }

        return sb.toString();
    }


Basically, what this code does is strip the url into the domain and the path components and use a stack to manipulate these terms. We push a term onto the stack, unless the term is a dot segment - ".." - in which case, if there is at least two terms on the stack, we pop the first one. This way, we never pop the last term on the stack, which is the domain.

Then we can find the normalized URL following the terms in the stack from bottom to top.

This is the reason we can't really use a stack, as we can't traverse from the bottom to the top. So we use a LinkedList instead.