Wednesday, May 12, 2010

Normalizing a URL

Today, I had this interesting problem to do with fetching a web page.
I was processing HTTP meta-refresh headers and ran into this type of header:

<meta http-equiv="REFRESH" content="0; URL=../cgi-bin/main2.cgi">

If I try to turn this into an absolute url (the url this content was fetched being I would be trying to do this:

> curl -I
> HTTP/1.1 400 Bad Request

However, turns out that the browser deals with this just fine. Faced with a malformed URL, it guesses and keeps going, fetching

This required me to do some cleanup of the url to handle this dot segments (..). There is a well-known protocol for doing this. I coded this up simply using a stack based approach.

Here is the code:

    //removes .. sequences from the url string handling extra .. sequences by stopping
    //at the domain, thus always returning a correct url.
    //ex: =>
    // =>    
    public static String normalizePath(String url) {
        if (url.indexOf("..") == -1)
            return url; //no .. seqs, no need to normalize
        String[] toks = url.split("/");
        int i;
        for (i=0; i<toks.length && (toks[i].length() == 0 || toks[i].toLowerCase().indexOf(":") != -1); i++);
        if (i==toks.length)
            return url;     //no proper path found, simply return the url
        // toks[i] is the domain

        LinkedList<String> s = new LinkedList<String>();
        for (; i<toks.length; i++) {
            if (!toks[i].equals(".."))
            else if (s.size()>1)

        if (s.size()<1)
            return url;     //no proper domain found, simply return the url

        int idx = url.indexOf("://");
        StringBuilder sb = new StringBuilder();
        sb.append( (idx != -1 ? url.substring(0, idx+3) : "")).append(s.removeLast()); //get proto://domain

        while (s.size()>0) {

        return sb.toString();

Basically, what this code does is strip the url into the domain and the path components and use a stack to manipulate these terms. We push a term onto the stack, unless the term is a dot segment - ".." - in which case, if there is at least two terms on the stack, we pop the first one. This way, we never pop the last term on the stack, which is the domain.

Then we can find the normalized URL following the terms in the stack from bottom to top.

This is the reason we can't really use a stack, as we can't traverse from the bottom to the top. So we use a LinkedList instead.

No comments: