public static String remove(final String string,
final char remove)
{
if (string == null) throw new IllegalArgumentException("string is null");
int index = 0;
final StringBuilder stringBuilder = new StringBuilder(string);
while ((index = indexOf(stringBuilder, remove, index)) != -1) {
stringBuilder.deleteCharAt(index);
}
return stringBuilder.toString();
}
This code, is inefficient as it forces the StringBuilder.deleteCharAt to shift all characters in the buffer left, each time a character that need to be removed is found. Here is how StringBuilder.deleteCharAt is implemented :
764 public AbstractStringBuilder deleteCharAt(int index) {
765 if ((index < 0) || (index >= count))
766 throw new StringIndexOutOfBoundsException(index);
767 System.arraycopy(value, index+1, value, index, count-index-1);
768 count--;
769 return this;
770 }
So I searched around for a hopefully more efficient implementation of this basic function. There was a StringUtils (apache commons lang) function that looked hopeful. However, it was not really meant to remove characters from a string, but to replace them: StringUtils.replaceChars(String str, String searchChars, String replaceChars) Here is how that looks like:
public static String replaceChars(String str, String searchChars, String replaceChars) {
4168 if (isEmpty(str) || isEmpty(searchChars)) {
4169 return str;
4170 }
4171 if (replaceChars == null) {
4172 replaceChars = EMPTY;
4173 }
4174 boolean modified = false;
4175 int replaceCharsLength = replaceChars.length();
4176 int strLength = str.length();
4177 StrBuilder buf = new StrBuilder(strLength);
4178 for (int i = 0; i < strLength; i++) {
4179 char ch = str.charAt(i);
4180 int index = searchChars.indexOf(ch);
4181 if (index >= 0) {
4182 modified = true;
4183 if (index < replaceCharsLength) {
4184 buf.append(replaceChars.charAt(index));
4185 }
4186 } else {
4187 buf.append(ch);
4188 }
4189 }
4190 if (modified) {
4191 return buf.toString();
4192 }
4193 return str;
4194 }
The code does not optimize for the case where the resulting string would be shorter than the original, so this code is likely to be slow. Next, I found a Guava function in com.google.common.base.CharMatcher : String removeFrom(CharSequence sequence). That used quite an interesting algorithm as follows:
public String removeFrom(CharSequence sequence) {
String string = sequence.toString();
int pos = indexIn(string);
if (pos == -1) {
return string;
}
char[] chars = string.toCharArray();
int spread = 1;
// This unusual loop comes from extensive benchmarking
OUT: while (true) {
pos++;
while (true) {
if (pos == chars.length) {
break OUT;
}
if (matches(chars[pos])) {
break;
}
chars[pos - spread] = chars[pos];
pos++;
}
spread++;
}
return new String(chars, 0, pos - spread);
}
It keeps track of the distance between the source and destination in the "spread" variable. However, this means that for each left shift, it needs to do a subtraction. Since these all seemed less than optimal, I decided to code my own. This was my first function, where I try to keep track of the first index for the destination.
public static String remove2(final String string, final char... chars) {
char[] arr = string.toCharArray();
int dst = -1;
for (int src=0; src<arr.length; src++) {
boolean rm = false;
for (char c : chars) {
if (c == arr[src]) {
rm = true;
if (dst == -1)
dst = src; //set first dst pos
}
}
if (!rm && dst != -1) {
arr[dst++] = arr[src];
}
}
return dst == -1 ? string : new String(arr, 0, dst);
}
Upon a hunch that the JVM should be smart enough to not copy a value from an address to itsef, I then decided to remove the check for the first destination index. Here is that version:
public static String remove3(final String string, final char... chars) {
char[] arr = string.toCharArray();
int dst=0;
for (int src=0; src<arr.length; src++) {
boolean rm = false;
for (char c : chars) {
if (c == arr[src]) {
rm = true;
}
}
if (!rm) {
arr[dst++] = arr[src];
}
}
return dst == string.length() ? string : new String(arr, 0, dst);
}
Next I benchmarked the different functions with a simple test, using just one input. Depending on input, your results may vary, so it best to work with inputs you are likely to see in your scenario.
remove : 20.092s
Guava : 25.164s
apache commons langs : 1m33.573s
remove2 : 19.272s
remove3 : 12.109s
So for this simple test, Guava was worse than our hand-optimized remove3(). The apache common langs function took a big hit as it was not optimizing for removal. I'm curious why the Googlers went with their approach on Guava. It may be that all JVMs are not as smart as to prevent a dumb copy from the same address to itself, and Guava makes sure such code does not have a chance to get to the JVM, since the variable "spread" always being greater than 0, this assignment can never be from one address to itself :
chars[pos - spread] = chars[pos];
2 comments:
It would perform better if u put a break after "rm = true".
Thanks, yes - if there are many characters to be removed, that will become an important optimization.
Post a Comment