Thursday, May 13, 2010

Jericho Parser new version fixes choking on unusual charset

Each day I find something completely wacko on the Net. Today it is an extremely interesting charset present in the headers of a certain site:


mpire@brwdbs01:~$ curl -I http://uk.real.com/realplayer/
HTTP/1.1 200 OK
Expires: 0
Date: Thu, 13 May 2010 22:44:16 GMT
Content-Length: 2690
Server: Caudium
Connection: close
Content-Type: text/html; charset='.().'
pragma: no-cache
X-Host-Name: hhnode21.euro.real.com
X-Got-Fish: Yes
Accept-Ranges: bytes
MIME-Version: 1.0
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate, proxy-revalidate

a charset of '.().' Strange indeed. This manages to choke the JerichoParser I'm using and I'm not too sure what to do about it. The parser does a pretty nice job parsing all kinds of encodings, and since it has no idea of this kind, it gives up. I could try and make it use the default (ISO-8859-1).

This has been fixed on a newer version of the Jericho Parser.

No comments: