Wednesday, June 05, 2013

Java: splitting on a single character

If you want to split a Java String on a single character, on a compute intensive path in your code, you might want to stay clear of String.split. The JDK function uses a regular expression for splitting and before JDK 1.7, the String.split had no optimization for single characters.

An optimization was introduced in JDK 1.7, but if your split character happens to have special meaning in a regular expression (ex: ^ |), then the optimization will not apply.

I used org.apache.commons.lang.StringUtils.split to gain a roughly 3X advantage over the split call used in our servers.

Here is the performance test:

import org.apache.commons.lang.StringUtils;

public class TSplit {

    public static void main(String[] args) {
        if (args.length==0) {
            System.err.println("TSplit jdk|nojdk");
            System.exit(-1);
        }
        String var = "here|is|a|string|that|must|be|split";
        if (args[0].compareTo("jdk")==0) {
            for (int i=0;i<10000000;i++) {
                String[] splits = var.split("\\|");
            }
        } else {
            for (int i=0;i<10000000;i++) {
                String[] splits = StringUtils.split(var, '|');
            }
        }
    }
    
}
The results from the test :
[~/] time java -cp `echo /path/to/jars/*.jar|tr ' ' :` TSplit jdk

real 0m16.027s
user 0m16.245s
sys 0m0.412s
[~/] time java -cp `echo /path/to/jars/*.jar|tr ' ' :` TSplit nojdk

real 0m5.354s
user 0m5.395s
sys 0m0.304s
[~/] 
As this post shows, Users who encountered these problems pre-1.7 have sometimes hacked their code to even pre-compile the single split character to a regular expression. This unfortunately means, that if and when they upgrade to 1.7, the optimization that Sun added will have no effect.

No comments: