Thursday, May 13, 2010

Jericho Parser new version fixes choking on unusual charset

Each day I find something completely wacko on the Net. Today it is an extremely interesting charset present in the headers of a certain site:

mpire@brwdbs01:~$ curl -I
HTTP/1.1 200 OK
Expires: 0
Date: Thu, 13 May 2010 22:44:16 GMT
Content-Length: 2690
Server: Caudium
Connection: close
Content-Type: text/html; charset='.().'
pragma: no-cache
X-Got-Fish: Yes
Accept-Ranges: bytes
MIME-Version: 1.0
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate, proxy-revalidate

a charset of '.().' Strange indeed. This manages to choke the JerichoParser I'm using and I'm not too sure what to do about it. The parser does a pretty nice job parsing all kinds of encodings, and since it has no idea of this kind, it gives up. I could try and make it use the default (ISO-8859-1).

This has been fixed on a newer version of the Jericho Parser.

1 comment:

Blogger said...

BlueHost is definitely one of the best web-hosting provider for any hosting services you might need.