Thursday, January 15, 2009

Java HTMLParser and meta http-equiv redirects



Java HTMLParser can be used to parse a web page into more meaningful chunks. For example, we can parse for links, H1 tags or images.

Some web pages (urls) in fact return a HTTP redirect response to the browser (instead of the 200 code for good content). The browser then makes a request to the redirected page (found in the location information supplied with the redirect response)

The HTMLParser handles this type of redirection automatically. However, there is another form of a redirect that is accomplished by an HTML meta tag like so:

<meta http-equiv="refresh" content="2;url=http://webdesign.about.com">

If a web page contains this tag, that instructs the browser to fetch the url mentioned under the content parameter after a 2 second delay. The content parameter thus provides two data inputs to the browser - the delay for the redirect and the new url.

Also note that if the url portion was omitted from the content parameter, this serves as an automatic refresh of the same url.

HTMLParser will not automatically redirect from a meta http-equiv tag.
Even HttpURLConnection has no facility for redirecting automatically from this type of directive. (It is possible to construct a HttpURLConnection and pass it to the HTMLParser, but like I said, this won't cover this special form of redirect either.)

So this is a case where we need to follow the redirect ourself. It is possible to extend HTMLParser to have this ability. Here is the bare-bones code one would use for following the redirect:


HttpURLConnection.setFollowRedirects(true);
HttpURLConnection urlCon = (HttpURLConnection)new
URL(url).openConnection();

Parser parser = new Parser(urlCon);

NodeList nodes;

//handle any meta-equiv refresh redirects
nodes = parser.extractAllNodesThatMatch(new NodeClassFilter(MetaTag.class));
for (int i=0; i<nodes.size(); i++) {


Node node = nodes.elementAt(i);
MetaTag meta = (MetaTag)node;
String httpEquiv = meta.getHttpEquiv();
if (httpEquiv != null &&
httpEquiv.toLowerCase().equals("refresh")) {

String content = meta.getMetaContent();
Pattern pattern = Pattern.compile("\\d+\\;url\\=(.*)");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {

url = matcher.group(1);
urlCon = (HttpURLConnection)new URL(url).openConnection();
parser = new Parser(urlCon);

}

}

}

2 comments:

Neno said...

Great post... almost there to solve my problem.

Can I just ask which package did you take the Parser from?

Cheers,

Nenad Bartonicek

Neno said...

Found it!

org.htmlparser

Cheers,

N.