tech: October 2008

Thursday, October 30, 2008

Java string replace is slow

Java String.replace uses regular expressions and can be slow, especially for large strings requiring many replacements.

If you don't need regular expression replacement, you can use this function I wrote that performs better.

static public String replaceString(String orig, String search, String replace) {
  int cap = replace.length() > search.length() ? orig.length() + (replace.length() - search.length())*20 : orig.length();
  StringBuilder out = new StringBuilder(cap);

  int prev = 0;
  CharSequence okPart;
  for (int i=orig.indexOf(search); i!=-1; i=orig.indexOf(search, prev)) {
    okPart = orig.subSequence(prev, i);
    out.append(okPart).append(replace);
    prev = i + search.length();
  }
  if (prev < orig.length()) {
    okPart = orig.subSequence(prev, orig.length());
    out.append(okPart);
  }
  return out.toString();
}

Here is the performance data with Java String.replace:

Here is the performance data with the new function:

Interpreting the data:

With String.replace, time inside doFlowAction() was broken down 71/29% for enQueueWBImpression/getCPCWidgetCache. The hottest function inside getCPCWidgetCache was String.replace that accounted for 93% of the time.

With the new replacement function,
With String.replace, time inside doFlowAction() was broken down 89/11% for enQueueWBImpression/getCPCWidgetCache. This reduction of time spent inside getCPCWidgetCache (a 62% gain) is due to the replacement of the String.replace with the new string function.

Friday, October 24, 2008

Find the libraries your executable needs

LD_DEBUG is an environment variable that can be used to look under the hood of the Linux loader.
For ex: setting LD_DEBUG to libs will show you what the loader is doing to find the shared libraries needed to run an executable.

output:

mpire@sandbox mpire $ LD_DEBUG=libs ls
23994: find library=libacl.so.1; searching
23994: search path=/site/mpire/sys/mysql/lib/mysql/tls/i686/m ....

Thursday, October 16, 2008

mysql - find missing rows in a join

The left join can be used to get all rows from a secondary table matching the rows of the first table, along with those rows from the first table including any rows that doesn't find a match on the secondary table. for rows with no match, NULL would be returned on the secondary table column. this can be useful when you want to find the rows that have no match:

select a.id, b.id from a left join b on (a.secid=b.id) where b.id is null;

And most times, you want to correct this by providing a good value for the NULL column in the secondary table. If you want to set all those rows with a NULL column to the same value, you can do:

update a left join b on (a.secid=b.id) set a.secid=12629002 where b.id is null;

Wednesday, October 15, 2008

Java - you can't return a value through a parameter to a function

In Java all parameters are strictly input. There are no output parameters. Anything you need to return from a function needs to be returned through the function return value.

[Here's a rationale for why Java disallows out parameters.]

Of course what I said is not strictly true that if you pass an object to a function, you can change a member variable of the object within the function and that would be a way to pass a value back to the caller.

But it is often the case, you want two values to be passed back to the caller. With Java, you would need to implement a little class with these two objects and pass that class back. This is the example I'm going through today:

I need to get a resultset from a JDO query. So I would like to move the JDO queries to a Manager class, and code a function like:

Collection getResults(PersistenceManager pm);

But the problem is that inside getResults, I need to instantiate a query object that takes up some database resources on the client side. It is a good idea to clean up the query object so that the database handles could be freed. It is very important if the query is executed repeatedly in a loop, or if the result set is large.

The nice thing about using the above call at the call site is its elegance:

Collection results = mgr.getResults(pm);
for (Object o : results) {
// process the result records
}

Now this would change to something like:

CollectionAndQuery cq = results.getResults(pm);
for (Object o: cq.results) {
// process
}
cq.query.closeAll();

And the CollectionAndQuery class has to be defined at global scope as it is needed in both the call site and the manager.

Now, if Java had a generic Pair object, we could have used that for this purpose. [CollectionAndQuery is really a specialized Pair object] Unfortunately there is no Pair object in Java.

Friday, October 10, 2008

got bitten by rand() % count again

Old Habits Die Hard (OHDH)

Coming from a C background, and having to work with hash buckets often, I'm a big fan of the modulus. If you have a key that you need to map to a number where the number is potentially smaller than the value of the key, you just modulus the key, right?
So when the number is generated randomly, I'm used to this code a lot:

int slot = rand() % size;

where size would be some kind of an array that probably is bounded by "size".

However, the rand() functions in perl, java,ruby do not work like this.

perl: rand() returns a floating point number between 0 and 1.
java: rand() returns the full range of the int, including negative numbers, so the modulus gets whacked.
php: ah, just like the C library implementation!
ruby: similar to perl, rand() returns a floating point number between 0 and 1.

so there are other variations of rand() functions that you can use in these language/libraries. for java we can use:

int num = new Random.nextInt(size);

this returns numbers >= 0 and < size

perl is similar.

Here's how you install perl mysql modules on the Mac

http://www.quicomm.com/apm_dbddbi.htm

nuf said!

Tuesday, October 07, 2008

Java has no Pair object?

Often times we work with pairs of objects vs single objects. We may need to store a list of such pairs. STL provides a pair class. Java does not. It would be a useful addition.

Currently I'm building a TreeMap and I need to use a set of pair objects (int weight -> advertisement) to populate it. Not having a pair object leads me to implement a pair data structure.

Java has no upper_bound?

I can't find an upper_bound() equivalent in the Java sorted containers.
Looking at TreeMap, we can use the TreeMap.tailMap.firstKey to get the lower_bound.
No function for upper_bound.

I googled on this for a bit and came across some posts that mis-represented what lower_bound, upper_bound means using the standard definition.

This is a useful function and we can code it up with TreeMap.tailmap, but it is useful enough I feel it should be available in the sorted containers in the JDK.

Today I needed an upper_bound() utility for the following task. I keep track of a set of advertisements to be served to the browser. Each advertisement has a weight associated with it based on its past performance. The weights of all advertisements add up to 100.

So for each web request, I need to return an advertisement based on these weights. If I had 2 advertisements with weights 80,20 then I would return the first advertisement 80% of the time.

I implement this with a TreeMap that maps the cumulative weight to the Advertisement. So for the example above it will look like this:

80 -> Ad 1
100 -> Ad 2

Now, all I need to do is to generate a random number greater >= 0 and < 100, and take its upper_bound. That would point me to the advertisement I should return.

Using lower_bound is technically in-correct. Unless I generate the random numbers >=1 and <=100. This can be done, but ideally if Java had an upper_bound() there will be less such work we have to do as programmers.

Friday, October 03, 2008

What's wrong with this Java code?

//returns a date that is before today
public static Date getBackDate(int numDays) {
Date date = new Date();
date.setTime(date.getTime()-numDays*24*60*60*1000);
return date;
}

Wednesday, October 01, 2008

Java PriorityQueue initialCapacity cannot be zero

It is mentioned in the docs, but who reads them, right?

Looking through the PriorityQueue implementation, I couldn't really tell why this restriction is necessary. They use an Object[] array to represent the heap and store items from index 1 vs index 0 to help the math. They also have a - int size - that keeps track of the number of elements in the PriorityQueue.

Seems like they could have made this work even for a PriorityQueue with an initialCapacity of 0. This could avoid some bugs from the call-site where a PriorityQueue is initialized with an initialCapacity obtained from another source (ex: the number of records that match a certain criteria in a table) and this could very well be zero.

Update:
Andrew Hayley who made this patch - http://gcc.gnu.org/ml/java-patches/2006-q3/msg00426.html - did that only to pass the tests. As I hear from other devs involved in the JDK, I will add those details here.