Wednesday, January 28, 2009

java Collections.copy behavior is unexpected


Yesterday I was trying to sort a Collection returned by JDO. Of course this is not possible as JDO resultsets are read-only. So I looked up for a java function that will allow me to copy the JDO collection to a new collection, which I would then sort.

The first function that popped up was Collections.copy(Collection dest, Collection src). After I coded up the copy, it crashed with an IndexOutOfBounds exception inside the copy method.

Upon further investigation, I found that the destination Collection has to have a size no less than the source list. This is not what one expects from a copy method. It seems like Sun introduced this function as a convenience in overwriting an existing collection with a different collection. You can read about it here.

Fortunately, if your collection is a List, there is a List.addAll(Collection c) method that would copy a collection to the list.

Thursday, January 15, 2009

Java HTMLParser and meta http-equiv redirects



Java HTMLParser can be used to parse a web page into more meaningful chunks. For example, we can parse for links, H1 tags or images.

Some web pages (urls) in fact return a HTTP redirect response to the browser (instead of the 200 code for good content). The browser then makes a request to the redirected page (found in the location information supplied with the redirect response)

The HTMLParser handles this type of redirection automatically. However, there is another form of a redirect that is accomplished by an HTML meta tag like so:

<meta http-equiv="refresh" content="2;url=http://webdesign.about.com">

If a web page contains this tag, that instructs the browser to fetch the url mentioned under the content parameter after a 2 second delay. The content parameter thus provides two data inputs to the browser - the delay for the redirect and the new url.

Also note that if the url portion was omitted from the content parameter, this serves as an automatic refresh of the same url.

HTMLParser will not automatically redirect from a meta http-equiv tag.
Even HttpURLConnection has no facility for redirecting automatically from this type of directive. (It is possible to construct a HttpURLConnection and pass it to the HTMLParser, but like I said, this won't cover this special form of redirect either.)

So this is a case where we need to follow the redirect ourself. It is possible to extend HTMLParser to have this ability. Here is the bare-bones code one would use for following the redirect:


HttpURLConnection.setFollowRedirects(true);
HttpURLConnection urlCon = (HttpURLConnection)new
URL(url).openConnection();

Parser parser = new Parser(urlCon);

NodeList nodes;

//handle any meta-equiv refresh redirects
nodes = parser.extractAllNodesThatMatch(new NodeClassFilter(MetaTag.class));
for (int i=0; i<nodes.size(); i++) {


Node node = nodes.elementAt(i);
MetaTag meta = (MetaTag)node;
String httpEquiv = meta.getHttpEquiv();
if (httpEquiv != null &&
httpEquiv.toLowerCase().equals("refresh")) {

String content = meta.getMetaContent();
Pattern pattern = Pattern.compile("\\d+\\;url\\=(.*)");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {

url = matcher.group(1);
urlCon = (HttpURLConnection)new URL(url).openConnection();
parser = new Parser(urlCon);

}

}

}