Wednesday, June 19, 2013

Java : remove specified characters from a string (and quickly!)

Recently, I had to remove extraneous carriage returns and line feeds from URLs on a path that had real-time performance concerns. There was already some code that did this job, albeit inefficiently :

    public static String remove(final String string,
                                final char remove)
    {
        if (string == null) throw new IllegalArgumentException("string is null");

        int index = 0;
        final StringBuilder stringBuilder = new StringBuilder(string);
        while ((index = indexOf(stringBuilder, remove, index)) != -1) {
            stringBuilder.deleteCharAt(index);
        }
        return stringBuilder.toString();
    }

This code, is inefficient as it forces the StringBuilder.deleteCharAt to shift all characters in the buffer left, each time a character that need to be removed is found. Here is how StringBuilder.deleteCharAt is implemented :
764 public AbstractStringBuilder deleteCharAt(int index) {
765         if ((index < 0) || (index >= count))
766             throw new StringIndexOutOfBoundsException(index);
767         System.arraycopy(value, index+1, value, index, count-index-1);
768         count--;
769         return this;
770     } 

So I searched around for a hopefully more efficient implementation of this basic function. There was a StringUtils (apache commons lang) function that looked hopeful. However, it was not really meant to remove characters from a string, but to replace them: StringUtils.replaceChars(String str, String searchChars, String replaceChars) Here is how that looks like:
public static String replaceChars(String str, String searchChars, String replaceChars) {
4168        if (isEmpty(str) || isEmpty(searchChars)) {
4169            return str;
4170        }
4171        if (replaceChars == null) {
4172            replaceChars = EMPTY;
4173        }
4174        boolean modified = false;
4175        int replaceCharsLength = replaceChars.length();
4176        int strLength = str.length();
4177        StrBuilder buf = new StrBuilder(strLength);
4178        for (int i = 0; i < strLength; i++) {
4179            char ch = str.charAt(i);
4180            int index = searchChars.indexOf(ch);
4181            if (index >= 0) {
4182                modified = true;
4183                if (index < replaceCharsLength) {
4184                    buf.append(replaceChars.charAt(index));
4185                }
4186            } else {
4187                buf.append(ch);
4188            }
4189        }
4190        if (modified) {
4191            return buf.toString();
4192        }
4193        return str;
4194    }

The code does not optimize for the case where the resulting string would be shorter than the original, so this code is likely to be slow. Next, I found a Guava function in com.google.common.base.CharMatcher : String removeFrom(CharSequence sequence). That used quite an interesting algorithm as follows:
  public String removeFrom(CharSequence sequence) {
    String string = sequence.toString();
    int pos = indexIn(string);
    if (pos == -1) {
      return string;
    }

    char[] chars = string.toCharArray();
    int spread = 1;

    // This unusual loop comes from extensive benchmarking                                                                                
    OUT: while (true) {
      pos++;
      while (true) {
        if (pos == chars.length) {
          break OUT;
        }
        if (matches(chars[pos])) {
          break;
        }
        chars[pos - spread] = chars[pos];
        pos++;
      }
      spread++;
    }
    return new String(chars, 0, pos - spread);
  }

It keeps track of the distance between the source and destination in the "spread" variable. However, this means that for each left shift, it needs to do a subtraction. Since these all seemed less than optimal, I decided to code my own. This was my first function, where I try to keep track of the first index for the destination.
    public static String remove2(final String string, final char... chars) {
        char[] arr = string.toCharArray();
        int dst = -1;
        for (int src=0; src<arr.length; src++) {
            boolean rm = false;
            for (char c : chars) {
                if (c == arr[src]) {
                    rm = true;
                    if (dst == -1)
                        dst = src; //set first dst pos 
                }
            }
            if (!rm && dst != -1) {
                arr[dst++] = arr[src];
            }
        }
        return dst == -1 ? string : new String(arr, 0, dst);
    }


Upon a hunch that the JVM should be smart enough to not copy a value from an address to itsef, I then decided to remove the check for the first destination index. Here is that version:
   public static String remove3(final String string, final char... chars) {
        char[] arr = string.toCharArray();
        int dst=0;
        for (int src=0; src<arr.length; src++) {
            boolean rm = false;
            for (char c : chars) {
                if (c == arr[src]) {
                    rm = true;
                }
            }
            if (!rm) {
                arr[dst++] = arr[src];
            }
        }
        return dst == string.length() ? string : new String(arr, 0, dst);
    }


Next I benchmarked the different functions with a simple test, using just one input. Depending on input, your results may vary, so it best to work with inputs you are likely to see in your scenario.

remove : 20.092s
Guava : 25.164s
apache commons langs : 1m33.573s
remove2 : 19.272s
remove3 : 12.109s

So for this simple test, Guava was worse than our hand-optimized remove3(). The apache common langs function took a big hit as it was not optimizing for removal. I'm curious why the Googlers went with their approach on Guava. It may be that all JVMs are not as smart as to prevent a dumb copy from the same address to itself, and Guava makes sure such code does not have a chance to get to the JVM, since the variable "spread" always being greater than 0, this assignment can never be from one address to itself :

chars[pos - spread] = chars[pos];


2 comments:

Ashok said...

It would perform better if u put a break after "rm = true".

thushara said...

Thanks, yes - if there are many characters to be removed, that will become an important optimization.