tech: 2009

Monday, December 21, 2009

An In-Memory Dictionary with Java

I needed to build an in-memory dictionary to do fast lookups on terms retreived from a web page. The dictionary is around 13K words from English, including names, abbreviations etc. It was compiled from the SCOWL project.

The first thing I realized was that Java uses around 90M of memory to store 65M of string data. At first glance thus seems to be due to the overhead in String class - here is the relevant class structure showing the overhead:

public final class String  implements java.io.Serializable, java.lang.Comparable<java.lang.String>, java.lang.CharSequence{
    private final char[] value;
    private final int offset;
    private final int count;
    private int hash;
}

As you can see, apart from the char array, there is an offset, count and the hash of the string.

The offset, count fields are used to limit memory usage. Say, if lots of strings could be represented as substrings of others, there need to be only one copy of the char[] value array, the individual strings will reference the same array with different offset, count values.

The hash is used to speed up certain data access operations when Strings are used as a key in a collection like a Hash. Rather than computing the hash each time a String is searched for in a hash, the hash of the string can be stored in the String object. Since a string is immutable, this hash value does not need to be updated once set, making it easy to maintain.

But of course, it adds an extra 4 bytes of overhead.

However, the reason for the biggest increase seems to be the size of the char data type. Since it should be able to store a Unicode character, it takes 2 bytes. So even without the other overhead, we should expect twice the number of bytes as would be needed to store this with a byte array.

And interestingly, the offset, count trick seems to find enough substrings that the amount of memory required is below double the amount one would expect with a byte array. So what we thought was overhead in fact helped us here.

Next I tried to figure out the collisions rate, when the Strings are stored in a HashSet. I was using a HashSet for my dictionary. I found found 938 collisions from 628033 terms [0.14935522%] which is quite ok. Just to make sure that there was no unusual clustering around certain keys, I checked how many slots had more than one collision, meaning three or more keys would be stored on these slots. There were only 7 such slots, and they all had just 3 keys hashed onto them.

So the maximum length of the list at each hash slot was 3. This was quite acceptable.

Here is the code used to find the collision info.

import java.io.*;
import java.util.Set;
import java.util.HashSet;
import java.util.Map;
import java.util.HashMap;

public class Collider {
    static public void main(String[] args) {
        String dictDir = "/Users/thushara/code/platform/AdXpose/src/com/adxpose/affinity/en_dict";
        File[] files = new File(dictDir).listFiles();
        Set<Integer> set = new HashSet<Integer>();
        Map<Integer, Integer> collisions = new HashMap<Integer, Integer>();
        int dups = 0;
        int tot = 0;
        for (File file : files) {
            try {
                FileInputStream fstream = new FileInputStream(file);
                DataInputStream in = new DataInputStream(fstream);
                BufferedReader br = new BufferedReader(new InputStreamReader(in));
                String word;
                while ((word = br.readLine()) != null)   {
                    int hash = word.hashCode();
                    if (set.contains(hash)) {
                        dups++;
                        collisions.put(hash, collisions.get(hash) == null ? 1 : collisions.get(hash)+1);
                    } else {
                        set.add(hash);
                    }
                    tot++;
                }
            } catch (Exception e) {}
        }
        System.out.println("found " + dups + " collisions from " + tot + " terms [" + ((float)dups)/tot*100 + "%]");
        for (Map.Entry<Integer, Integer> entry : collisions.entrySet()) {
            if (entry.getValue()>1) System.out.println(entry.getKey() + ":" + entry.getValue());
        }
    }
}

with this output:


found 938 collisions from 628033 terms [0.14935522%]
78204:2
83505:2
76282:2
71462:2
-1367596602:2
79072:2
94424379:2

Of course the problem of collisons is addressed in the Birthday Paradox. We could also use this theory to determine if String.hashCode() is optimal.

Next I will make a simple light string class based on a byte array using the last byte as a terminator (like in the traditional C string, storing 0 as the last byte).

Thursday, November 19, 2009

Java: building an asynchronous web page fetcher

Today I leveraged the earlier epoll() based framework (FishHooks) I built to fetch web pages. FishHooks is protocol-agnostic and as such most of the code in the fetcher has to do with HTTP. Send the proper request, parse the response etc.

There are, as usual, two files. The driver program ASyncWgetClient.java will use FishHooks to start the epoll() loop. WgetParam.java will extend ConnectionParam and implement the state machine peculiar to the HTTP request/response cycle. Among other things, we will handle the 30X redirects here.

import java.util.List;
import java.util.ArrayList;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.nio.charset.CharacterCodingException;

/**
 * @author: thushara
 * @version: Nov 18, 2009
 */
public class ASyncWgetClient {

    public void process() {

        class DNSBlob {
            String url;
            String path;
            String host;
            int port;
            URLWithRetries retryURL;
            byte[] inetAddr;

            public DNSBlob(String url, String path, String host, int port, URLWithRetries retryURL, byte[] inetAddr) {
                this.url = url;
                this.path = path;
                this.host = host;
                this.port = port;
                this.retryURL = retryURL;
                this.inetAddr = inetAddr;
            }
        };

        //ConnectionParam.verbose = true;
        //ASyncSocketHandler.verbose = true;

        ASyncSocketHandler hndlr = new  ASyncSocketHandler();

        String[] urls = {"http://zerohedge.com", "http://moneycentral.msn.com", "http://www.nakedcapitalism.com",
                "http://youtube.com", "http://democracynow.org", "http://love.com", "http://anncoulter.com",
                "http://peace.org", "http://grunge.com", "http://rockandroll.com", "http://shopping.com",
                "http://luck.com", "http://schools.org", "http://whipitthemovie.com",
                "http://war.org", "http://benstiller.com", "http://www.oprahwinfrey.com", "http://porche.com",
                "http://spanish.com", "http://healingart.com", "http://flash.com"};


        List<DNSBlob> dnsList = new ArrayList<DNSBlob>();

        for (String url : urls) {
            URLWithRetries retryURL = new URLWithRetries(url);
            URL urlObj = null;
            try {
                urlObj = new URL(url);
            } catch (MalformedURLException e) {
                WgetParam.writeErrorFile(WgetParam.getErrorFileName(url), url+" is not correct, skipping...");
                continue;
            }
            if (urlObj.getProtocol().equals("https")) {
                WgetParam.writeErrorFile(WgetParam.getErrorFileName(url), url+" not handling https, skipping...");
                continue;
            }
            String host = urlObj.getHost();
            String path = urlObj.getPath();
            if (urlObj.getQuery() != null) path += ("?"+urlObj.getQuery());
            if (path.length() == 0) path ="/";
            int port = urlObj.getDefaultPort();

            DNSResolver dnsRes = new DNSResolver(host);
            Thread t = new Thread(dnsRes);
            t.start();
            try {
                t.join(1000);
            } catch (InterruptedException e) {}
            byte[] inetAddr = dnsRes.get();

            if (inetAddr == null) {
                WgetParam.writeErrorFile(WgetParam.getErrorFileName(url), host + " has no DNS entry");
                continue;
            }

            dnsList.add(new DNSBlob(url, path, host, port, retryURL, inetAddr));
        }

        ConnectionParam[] params = new WgetParam[urls.length];
        for (int i=0; i<dnsList.size(); i++) {
            DNSBlob dnsBlob = dnsList.get(i);
            String request = "GET "+dnsBlob.path+" HTTP/1.0\r\nHost: "+dnsBlob.host+"\r\nAccept: */*\r\n\r\n";
            try {
                params[i] = new WgetParam(dnsBlob.inetAddr, dnsBlob.port, request, dnsBlob.retryURL, hndlr);
            } catch (CharacterCodingException e) {
                WgetParam.writeErrorFile(WgetParam.getErrorFileName(dnsBlob.url), e.getMessage());
                //just leave null
            }
        }

        try {
            hndlr.connect_all(params);
        } catch (Exception e) {
            System.err.println("unexpected :" + e.getMessage());
            e.printStackTrace();
        }
    }

    public class DNSResolver implements Runnable {
        private String domain;
        private byte[] inetAddr;

        public DNSResolver(String domain) {
            this.domain = domain;
        }

        public void run() {
            try {
                byte[] addr = InetAddress.getByName(domain).getAddress();
                set(addr);
            } catch (UnknownHostException e) {

            }
        }

        public synchronized void set(byte[] inetAddr) {
            this.inetAddr = inetAddr;
        }
        public synchronized byte[] get() {
            return inetAddr;
        }
    };

    public static void main(String[] args) {
        ASyncSocketHandler.verbose = true;
        ASyncWgetClient aWget = new ASyncWgetClient();
        aWget.process();
    }
}

import java.io.*;
import java.nio.charset.CharacterCodingException;
import java.nio.channels.FileChannel;
import java.nio.MappedByteBuffer;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.InetAddress;
import java.net.UnknownHostException;

/**
 * @author: thushara
 * @version: Nov 18, 2009
 */
public class WgetParam extends ConnectionParam {

    public static final String OUTPUT_DIR = "/Users/thushara/affinity3/";

    private URLWithRetries retryURL;
    private ASyncSocketHandler hndlr;

    public WgetParam(byte[] ip, int port, String request, URLWithRetries retryURL, ASyncSocketHandler hndlr) throws CharacterCodingException {
        super(ip, port, request);
        this.retryURL = retryURL;
        this.hndlr = hndlr;
    }

    // returns the next op we're interested in: OP_READ / OP_WRITE / OP_CLOSE
    // after connecting, we can proceed to send the GET request
    public int connected_ok() {
        return ASyncSocketHandler.OP_WRITE;
    }

    public int connect_failed() {
        return ASyncSocketHandler.OP_CLOSE;
    }

    // returns the next op we're interested in: OP_READ / OP_WRITE / OP_CLOSE
    // param: eod = true iif if the complete request was written to the socket
    //              if request is not fully written, return signal to write again
    //              else, we can go on to read, return read singnal.
    public int write_ok(boolean eod) {
        return eod ? ASyncSocketHandler.OP_READ : ASyncSocketHandler.OP_WRITE;
    }

    // params: eod = true iif all data was read from the socket, socket has no more data
    //               if all data is read, this socket won't return any more data so
    //               we should close the socket, signal such
    //               else we should continue reading, signal such.
    public int read_ok(boolean eod) {
        if (eod) {
            byte[] resp = getResponse();

            String fName = getFileName(retryURL.getOrigURL());
            String path = OUTPUT_DIR+fName;
            int code = ((int)(resp[9]-48))*100 + (resp[10]-48)*10 + resp[11]-48;
            switch(code) {
                case 200:
                case 403:
                case 404:
                case 400:
                    try {
                        writeFileWithoutHeaders(resp, path);
                    } catch (IOException e) {
                        writeErrorFile(getErrorFileName(retryURL.getOrigURL()), e.getMessage());
                    }
                    break;
                case 301:
                case 302:
                case 303:
                case 305: // Location: contains proxy, check if it is full url or just the proxy server name
                case 307:
                    if (retryURL.hitLimit()) {
                        writeErrorFile(getErrorFileName(retryURL.getOrigURL()), "URL attempts too many redirections - giving up");
                        break;
                    }
                    String redirURL = null;
                    try {
                        redirURL = getLocation(resp,  retryURL.getDomain());
                    } catch (MalformedURLException e) {
                        writeErrorFile(getErrorFileName(retryURL.getOrigURL()), e.getMessage());
                    }
                    addURL(redirURL);
                    break;
                //case 400:
                //    writeErrorFile(getErrorFileName(retryURL.getOrigURL()), "bad request: 400");
                //    break;
                case 408: // request timed out, try again
                    addURL(retryURL.url);
                    break;
                default:
                    writeErrorFile(getErrorFileName(retryURL.getOrigURL()), "unexpected response code: " + code);
                    break;
            }

        }
        return eod ? ASyncSocketHandler.OP_CLOSE : ASyncSocketHandler.OP_READ;
    }

    private void addURL(String redirURL) {
        retryURL.nextURL(redirURL);
        URL url = null;
        try {
            url = new URL(retryURL.url);
        } catch (MalformedURLException e) {
            writeErrorFile(getErrorFileName(retryURL.getOrigURL()), e.getMessage());
            return;
        }
        if (url.getProtocol().equals("https")) {
            writeErrorFile(getErrorFileName(retryURL.getOrigURL()), url+" not handling https, skipping...");
            return;     
        }
        String host = url.getHost();
        String path = url.getPath();
        if (url.getQuery() != null) path += ("?"+url.getQuery());
        if (path.length() == 0) path = "/";
        int port = url.getDefaultPort();
        InetAddress inetAddr = null;
        try {
            inetAddr = InetAddress.getByName(host);
        } catch (UnknownHostException e) {
            writeErrorFile(getErrorFileName(retryURL.getOrigURL()), e.getMessage());
            return;
        }
        String request = "GET "+path+" HTTP/1.0\r\nHost: "+host+"\r\nAccept: */*\r\n\r\n";
        ConnectionParam connParm = null;
        try {
            connParm = new WgetParam(inetAddr.getAddress(), port, request, retryURL, hndlr);
        } catch (CharacterCodingException e) {
            writeErrorFile(getErrorFileName(retryURL.getOrigURL()), e.getMessage());
            return;
        }
        try {
            hndlr.connect(connParm);
        } catch (ASyncSocketHandler.SelectorNotOpenException e) {}
        catch (IOException e) {
            writeErrorFile(getErrorFileName(retryURL.getOrigURL()), e.getMessage());
        }
    }

    private static String getFileName(String url) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] hs = new byte[16];
            md.update(url.getBytes("UTF-8"), 0, url.length());
            hs = md.digest();
            return String.format("%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x.url", hs[0],hs[1],hs[2],hs[3],hs[4],hs[5],hs[6],hs[7],hs[8],hs[9],hs[10],hs[11],hs[12],hs[13],hs[14],hs[15]);
        } catch (NoSuchAlgorithmException e) {
            System.err.println(e);
            e.printStackTrace();
            System.exit(-1);
        } catch (UnsupportedEncodingException e) {
            System.err.println(e);
            e.printStackTrace();
            System.exit(-1);
        }
        return null;
    }

    private static String getLocation(byte[] httpHeader, String domain) {
        String location = null;
        for (int i=0; i<Integer.MAX_VALUE; i++) {
            StringBuffer line = new StringBuffer();
            for (char c = (char)httpHeader[i]; c!= '\n'; i++,c=(char)httpHeader[i]) {
                line.append(c);
            }
            if (line.length()==0 || line.length()==1) break; // no Location field
            if (line.length()>"Location".length() && line.substring(0,"Location".length()).toLowerCase().equals("location")) {
                location = line.substring("Location".length()+2, line.length()-1);
                break;
            }
        }
        return location.toLowerCase().startsWith("http://") || location.toLowerCase().startsWith("https://") ? location : domain + (location.substring(0,1).equals("/") ? location : "/"+location);
    }

    private void writeFileWithoutHeaders(byte[] bytes, String path) throws IOException {
        File temp = File.createTempFile("page",null);
        FileOutputStream f = new FileOutputStream(temp);
        f.write(bytes);
        f.close();

        FileChannel chan =  new FileInputStream(temp).getChannel();
        MappedByteBuffer buf = chan.map(FileChannel.MapMode.READ_ONLY, 0, chan.size());
        int htmlOffset = 0;
        for (int i=0; i<Integer.MAX_VALUE; i++) {
            int j=0;
            for (char c = (char)buf.get(i); c!= '\n'; i++,j++,c=(char)buf.get(i));
            if (j==0 || (j==1 && buf.get(i-1) == '\r')) {
                htmlOffset = i+1;
                break;
            }
        }
        FileChannel chan2 = new FileOutputStream(path).getChannel();
        chan.transferTo(htmlOffset, chan.size()-htmlOffset, chan2);
        chan.close();
        chan2.close();
    }

    //two convenience methods to note errors, used from both WgetParam class and
    //ASyncWgetClient class.
    public static String getErrorFileName(String url) {
        return getFileName(url).substring(0,getFileName(url).length()-3)+"bad";
    }

    public static void writeErrorFile(String fName, String error) {
        try {
            FileWriter fstream = new FileWriter(OUTPUT_DIR+fName);
            BufferedWriter out = new BufferedWriter(fstream);
            out.write(error);
            out.close();
        } catch (Exception e) {//Catch exception if any
            System.err.println("Error writing error file: " + e.getMessage());
        }
    }


}

You can see there is a lot more code than the simpler client talking to time of day servers. Most of the code is to do with the networking protocol. You still don't deal with sockets, channels or byte buffers.

Tuesday, November 10, 2009

Java : A callback API for epoll(), building on top of Java NIO

I just finished the first draft of the Java callback API that allows you to do asynchronous socket communication using the scalable epoll() mechanism of Linux. This is built on top of Java NIO, so it will take advantage of the best underlying network mechanism the OS has to offer. In Linux 2.6 + /FreeBSD this will likely be epoll().

Use the API if you need to get data from a large number of hosts quickly without the overhead of a thread for each host/connection. Having a thread per connection has serious issues as the number of simultaneous connections increases, specially if memory is scarce.

The Java NIO api can certainly be used for this purpose. But using that API is somewhat difficult and there are various caveats you have to guard against. You need to understand the Selector class, SelectionKey class, InterestOps class, SocketChannel class and their interplay. You need to know the logic for canceling SelectionKeys and setting InterestOps appropriately. And you need to understand the cryptic ByteBuffer class and its variants, possibly how the ByteArrayOutputStream works as well. You need to figure out how to use these classes to store the data separately for each connection, keeping in mind that the data will likely arrive mixed. You also should not call select if all sockets have been completely read, or select() will block - which means you will have to keep track of pending hosts.

The callback framework I developed aims to relieve you of all these details. It lets you focus on the flow of calls between the client and each host along with the application logic once you get data from the hosts.

Usage:
Extend the abstract class ConnectionParam. (Let's name it ConnectionParamImpl)
Create a ConnectionParamImpl object for each connection and stuff these into an array.
Call ASyncSocketHandler.connect_all(array_of_connections) - the argument is the above array.

Since we extended the ConnectionParam class, we need to implement some methods. These are simply the callback methods that tell us that data has either been written to the network, or read from the network. This is where we get a chance to examine the data and do what we need to do from the application's point of view.

    // returns the next op we're interested in: OP_READ / OP_WRITE / OP_CLOSE
    public abstract int connected_ok();
    public abstract int connect_failed();
    public abstract int write_ok(boolean eod);
    public abstract int read_ok(boolean eod);

Each of the over-ridden methods needs to return an int signaling the next network operation you are interested in. The available options are:

ASyncSocketHandler.OP_READ
ASyncSocketHandler.OP_WRITE
ASyncSocketHandler.OP_CLOSE

This is a simple usage scenario that gets output from a bunch of known time of day servers.

There are two objects you need to implement:

ConnectionParamImpl class and ASyncEchoClient class.
ConnectionParamImpl represents a connection to a single host.
ASyncEchoClient is the application entry point.

import java.io.UnsupportedEncodingException;

/**
 * @author: thushara
 * @version: Nov 8, 2009
 */
public class ConnectionParamImpl extends ConnectionParam {

    public ConnectionParamImpl(byte[] ip, int port) {
        super(ip, port);
    }

    // returns the next op we're interested in: OP_READ / OP_WRITE / OP_CLOSE
    // after connecting, we can proceed to read
    public int connected_ok() {
        return ASyncSocketHandler.OP_READ;
    }

    public int connect_failed() {
        return ASyncSocketHandler.OP_CLOSE;
    }

    // returns the next op we're interested in: OP_READ / OP_WRITE / OP_CLOSE
    // param: eod = true iif if the complete request was written to the socket
    //              if request is not fully written, return signal to write again
    //              else, we can go on to read, return read singnal.
    public int write_ok(boolean eod) {
        return eod ? ASyncSocketHandler.OP_READ : ASyncSocketHandler.OP_WRITE;
    }

    // params: eod = true iif all data was read from the socket, socket has no more data
    //               if all data is read, this socket won't return any more data so
    //               we should close the socket, signal such
    //               else we should continue reading, signal such.
    public int read_ok(boolean eod) {
        if (eod) {
            byte[] resp = getResponse();
            try {
                System.out.println(new String(resp,"UTF-8"));
            } catch (UnsupportedEncodingException e) {
                System.out.println("bad chars - non utf-8");
            }
        }
        return eod ? ASyncSocketHandler.OP_CLOSE : ASyncSocketHandler.OP_READ;
    }

}

/**
 * @author: thushara
 * @version: Nov 8, 2009
 */
public class ASyncEchoClient {
    public static void main(String[] args) {

        ConnectionParam.verbose = true;
        ASyncSocketHandler.verbose = true;

        ASyncSocketHandler hndlr = new  ASyncSocketHandler();
        int port = 13;

        String[] ips = {"64.90.182.55", "129.6.15.28", "129.6.15.29", "206.246.118.250", "64.236.96.53", "68.216.79.113", "208.66.175.36",
                "173.14.47.149", "64.113.32.5", "132.163.4.101", "132.163.4.102", "132.163.4.103", "192.43.244.18", "128.138.140.44",
                "128.138.188.172", "198.60.73.8", "131.107.13.100", "207.200.81.113", "69.25.96.13", "64.125.78.85", "64.147.116.229"};
        ConnectionParam[] params = new ConnectionParamImpl[ips.length];

        for (int j=0; j<ips.length; j++) {
            String ip = ips[j];
            String[] sb = ip.split("\\.");
            byte[] bytes = new byte[sb.length];
            for (int i=0; i<sb.length; i++) {
                bytes[i] = (byte)Integer.parseInt(sb[i]);
            }
            params[j] = new ConnectionParamImpl(bytes, port);
        }

        try {
            hndlr.connect_all(params);
        } catch (Exception e) {
            System.err.println("unexpected :" + e.getMessage());
            e.printStackTrace();
        }
    }
}

Notice that you don't see a single reference to a socket or a byte buffer here. All the communication details are wrapped in ASyncSocketHandler.java and ConnectionParam.java.

This is the output you should see, basically the time of day as reported by each host:


55145 09-11-10 23:13:25 00 0 0 434.7 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 401.7 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 398.1 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 402.8 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 383.4 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 386.1 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 348.4 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 347.4 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 348.2 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 347.9 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 339.2 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 316.3 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 306.9 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 310.7 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 313.4 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 298.7 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 295.4 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 274.6 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 268.1 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 268.0 UTC(NIST) * 


55145 09-11-10 23:13:25 00 0 0 275.9 UTC(NIST) *

This is the framework implementation. There are two files.

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.SocketChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CharacterCodingException;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.net.InetSocketAddress;

/**
 * @author: thushara
 * @version: Nov 6, 2009
 */
public abstract class ConnectionParam {
    private byte[] ip;
    private int port;
    private ByteBuffer request;
    private ByteBuffer response;
    private int respBufSize = 1024;
    private ByteArrayOutputStream outStrm;
    private SocketChannel client;

    public static boolean verbose = false;

    public ConnectionParam(byte[] ip, int port) {
        this.ip = ip;
        this.port = port;
        response = ByteBuffer.allocate(respBufSize);
        outStrm = new ByteArrayOutputStream();
    }

    public ConnectionParam(byte[] ip, int port, String request) throws CharacterCodingException {
        this(ip,port);
        setRequest(request);
    }

    public ConnectionParam(byte[] ip, int port, String request, int respBufSize) throws CharacterCodingException {
        this(ip,port);
        setRequest(request);
        this.respBufSize = respBufSize;
    }

    public ConnectionParam(byte[] ip, int port, int respBufSize) {
        this(ip,port);
        this.respBufSize = respBufSize;
    }

    public void setRequest(String request) throws CharacterCodingException {
        Charset charset = Charset.forName( "UTF-8" );
        CharsetEncoder encoder = charset.newEncoder();
        this.request = encoder.encode(CharBuffer.wrap(request));
    }

    public byte[] getResponse() {
        int respSize = outStrm.size() + response.position();
        byte[] out = new byte[respSize];
        System.arraycopy(outStrm.toByteArray(),0,out,0,outStrm.size());

        int prevPos = response.position();
        response.flip();
        response.get(out,outStrm.size(),response.limit());
        response.position(prevPos);
        response.limit(response.capacity());
        
        return out;
    }

    private InetAddress getAddr() {
        try {
            return InetAddress.getByAddress(ip);
        } catch (UnknownHostException e) {
            return null;
        }
    }

    private void store() throws IOException {
        response.flip();
        byte[] bytes = new byte[response.limit()];
        response.get(bytes);
        outStrm.write(bytes, 0, bytes.length);
        response.clear(); //initialize for reading from socket again
    }


    public SocketChannel getSocketChannel() throws IOException {
        if (client != null) return client;

        client = SocketChannel.open();
        client.configureBlocking(false);
        client.connect(new InetSocketAddress(getAddr(),port));
        return client;
    }

    public String toString() {
        InetAddress addr = null;
        try {
            addr = InetAddress.getByAddress(ip);
        } catch (UnknownHostException e) {
            return "";
        }
        return addr.toString() + ":" + port;
        //return String.format("%d.%d.%d.%d:%d", ip[0], ip[1], ip[2], ip[3], port);
    }

    //returns true if all data is written to the socket
    public boolean write() throws IOException {
        int numBytes = getSocketChannel().write(request);
        if (numBytes == 0 && verbose) {
            System.err.println("nothing to write to server");
        }
        return !request.hasRemaining();
    }

    //returns true if all data is read, our end of the socket should be closed
    public boolean read() throws IOException {
        int numBytes = 0;
        try {
            numBytes = getSocketChannel().read(response);
        } catch (IOException e) {
            //System.err.println("end of data");
            store();
            return true;
        }
        if (numBytes == 0 && verbose) {
            System.err.println("nothing to read from server");
        }
        if (numBytes == -1) {
            //System.err.println("end of data");
            store();
            return true;
        }

        while (response.remaining() == 0 && numBytes > 0) {
            store();
            //read again from socket
            numBytes = getSocketChannel().read(response);
        }

        return false;

    }

    // returns the next op we're interested in: OP_READ / OP_WRITE / OP_CLOSE
    public abstract int connected_ok();
    public abstract int connect_failed();
    public abstract int write_ok(boolean eod);
    public abstract int read_ok(boolean eod);
}

import java.nio.channels.Selector;
import java.nio.channels.SocketChannel;
import java.nio.channels.SelectionKey;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.Iterator;
import java.net.ConnectException;

/**
 * @author: thushara
 * @version: Nov 6, 2009
 */
public class ASyncSocketHandler {
    private Selector selector;
    private HashSet<ConnectionParam> clients = new HashSet<ConnectionParam>();
    private int timeout;

    public static final int OP_READ = SelectionKey.OP_READ;
    public static final int OP_WRITE = SelectionKey.OP_WRITE;
    public static final int OP_CLOSE = 128; //leave enough room for possible expansion of SelectionKey enum

    public static boolean verbose = false;

    public void connect(ConnectionParam connParm) throws IOException {
        SocketChannel client = connParm.getSocketChannel();
        SelectionKey clientKey = client.register(selector, SelectionKey.OP_CONNECT);
        clientKey.attach(connParm);
        clients.add(connParm);
    }

    public void connect_all(ConnectionParam[] connParms) throws IOException {
        selector = Selector.open();
        for (ConnectionParam connParm : connParms) {
            connect(connParm);
        }

        while (clients.size() > 0) {
            if (verbose) {
                System.out.println("waiting for sockets to be ready. currently waiting for:");
                dumpClients();
            }
            int numReady = 0;

            try {
                numReady = selector.select(timeout);
            } catch (IOException e) {
                if (verbose) {
                    System.err.println("select failed, unrecoverable!" + e.getMessage());
                }
                break;
            }
            if (numReady <= 0) {
                if (verbose) {
                    System.err.println("No sockets ready for " + timeout/1000 + " secs, consider dead remotes, bailing...");
                }
                break;
            }

            Set<SelectionKey> keys = selector.selectedKeys(); // can't recover from exception thrown here
            Iterator i = keys.iterator();
            ConnectionParam connParm = null;

            SelectionKey key = null;
            while (i.hasNext()) {
                try {
                    key = (SelectionKey)i.next();
                    i.remove();
                    SocketChannel channel = (SocketChannel)key.channel();
                    connParm = (ConnectionParam)key.attachment();
                    if (key.isConnectable()) {
                        if (channel.isConnectionPending()) {
                          boolean ok = false;
                          try {
                            ok = channel.finishConnect();
                          } catch (ConnectException e) {
                              if (verbose) {
                                System.err.println("error processing " + connParm.toString());
                                e.printStackTrace();
                              }
                          }

                          if (!ok) {
                              if (verbose) {
                                System.err.println("not connected, giving up");
                              }
                              //cleanup
                              if (key != null) {
                                  try {
                                      key.channel().close();
                                  } catch (IOException x) {}
                                  key.cancel();
                                  clients.remove(key.attachment());
                                  continue;
                              }
                              connParm.connect_failed();
                          } else {
                              int op = connParm.connected_ok();
                              key.interestOps(op);
                          }
                        }
                    }
                    if (key.isWritable()) {
                        boolean completedWrite = connParm.write();
                        int op = connParm.write_ok(completedWrite);
                        if (op == ASyncSocketHandler.OP_CLOSE) {
                            channel.close();
                            key.cancel();
                            clients.remove(connParm);
                        } else {
                            key.interestOps(op);
                        }
                    }
                    if (key.isReadable()) {
                        boolean endOfData = connParm.read();
                        int op = connParm.read_ok(endOfData); 
                        if (op == ASyncSocketHandler.OP_CLOSE || endOfData) {
                            channel.close();
                            key.cancel();
                            clients.remove(connParm);
                        } else {
                            key.interestOps(op);
                        }
                    }

                } catch (IOException e) {

                }
            }
        }
    }

    private void dumpClients() {
        for (ConnectionParam connParm : clients) {
            System.out.println(connParm.toString());
        }
    }

    public void setTimeout(int timeout) {
        this.timeout = timeout;
    }

}

I'm making the project available here:

Monday, November 09, 2009

Java : fetching web pages using NIO - timeouts

I'm developing a call back framework on top of Java NIO, that will allow for easier implementations of asynchronous page downloads.

As you start adding sockets to a single selector, you get into request time-out issues. This happens because you normally add all the sockets to the selector and then wait for sockets that are ready. Some sockets, after waiting for a few seconds for the request, will timeout at the remote end.

The HTTP response code for this is 408 - it is easy to handle this by removing the SelectionKey from the selector, and adding the socket back to the selector. Since, the application will now be dealing with fewer sockets, the next time, the socket will most probably not time out.

The latency is mostly in domain name resolution. I will next try doing name resolution totally outside socket registration, so I don't eat into the connection time with the slow DNS lookup calls. This should reduce the number of 408 responses, and I should be able to add more sockets to the selector.

Wednesday, November 04, 2009

Resolving domain names quickly with Java

Domain name resolution can take upward of 5 seconds depending on your ISP. Generally, when the resolution works, the DNS response comes back quickly, within a second. Any DNS entry not discovered within a second is likely to be dead.

Java provides InetAddress.getByName(domain) that does the DNS resolution for the given domain. There is no timeout you can specify to this function.

In the interest of providing a responsive user interface, as well as improve performance on background tasks, you can hack a timeout using a Thread. Feel free to re-use this code:

    public class DNSResolver implements Runnable {
        private String domain;
        private InetAddress inetAddr;

        public DNSResolver(String domain) {
            this.domain = domain;
        }

        public void run() {
            try {
                InetAddress addr = InetAddress.getByName(domain);
                set(addr);
            } catch (UnknownHostException e) {
                
            }
        }

        public synchronized void set(InetAddress inetAddr) {
            this.inetAddr = inetAddr;
        }
        public synchronized InetAddress get() {
            return inetAddr;
        }
    }

Use this class as follows:

                DNSResolver dnsRes = new DNSResolver(host);
                Thread t = new Thread(dnsRes);
                t.start();
                t.join(1000);
                InetAddress inetAddr = dnsRes.get();

Basically, we start a thread that executes the blocking call to get the domain name. We wait for the thread to exit under 1 second (1000 ms). If the DNS resolution happens under a second, the thread will exit. If not, control will return to the main thread due to the 1000ms timeout. The thread will keep running, and eventually exit. But the main program has stopped caring at that point. It (main thread) will continue after 1 second considering the domain name to be unresolved.

Caveat:
Note that if you resolve a large number of domains that are unreachable, soon you will have lots of threads taking up memory and you could run out of JVM heap. So keep that in mind as you use this class.

Possible usage scenario:
If you are doing socket communication using nio, and you need to resolve DNS for each socket connection attached to the selector, it is important you quickly timeout DNS, as otherwise some sockets might get closed at the remote end. This is specially true for HTTP connections, if you connect to a HTTP server and do not send a request for a few seconds, the server will close the remote end of the socket.

Tuesday, October 27, 2009

Mac/SnowLeopard and Wireshark

Wireshark has to be tweaked a bit due to permission issues on the Berekley Packet Filters. This has the info.

Since these settings are lost on reboot, a start up item can be created as shown here.

Also, if a module not found dialog box appears, follow these steps (taken from the long thread here):

You need to set the path to the folder that Wireshark looks in for MIBs &c, because the default in version 1.0.6 for Mac OS X is incorrect. In Wireshark, do the following:

• From the Edit menu, select Preferences...
• In the left pane of the Wireshark: Preferences window, click on Name Resolution
• For SMI (MIB and PIB) paths, click the Edit button
• In the SMI Paths window, click the New button
• In the SMI Paths: New window, in the name text box, type /usr/share/snmp/mibs/ and click OK
• Click OK
• Click OK
• From the File menu, select Quit

I found that it was also necessary to quit and restart X11 for the changed Wireshark preferences to take effect:

• From the X11 menu, select Quit X11

This gets rid of the loading the MIBS errors for me. YMMV. By the way, Wireshark is supposed to still be useable, even if you can’t get rid of these errors.

Friday, October 23, 2009

many options of lsof

this is handy reference to the ways of using lsof.

Thursday, October 22, 2009

Mac and launchd - removing programs that refuse to be killed

Mac uses the launchd daemon to start processes on boot time, much like the init.d scripts in Linux. Except, launchd has a configuration file (with an extention .plist) for each such process, and these files are in a few different places. So if you want to stop an auto launch, you will be hunting around, which I just did.

The official doc states all the locations a plist file.

In my case, I was using a educational prototype from UW called vanish. This spawns a couple of processes using launchd. So if I kill the processes, they come right back up. I found the plist files under ~/LaunchAgents and removing them did the trick.

Tuesday, October 20, 2009

Java ByteBuffer : how does this work?

The first time you encounter the ByteBuffer, you may run into some surprises. The function that flips most folks is in fact ByteBuffer.flip(). To understand the flip() and to be not flipped by it and other such idioms, we will look at what this class is and how it should be used.

Basically, a ByteBuffer allows us to read data repeatedly from some input, like a non-blocking socket. ByteBuffer keeps track of the position of the last byte written, so you don't need to. You can keep writing to the same ByteBuffer and rest assured that previous data will not be over-written.

This is pretty handy in asynchronous I/O (using Java NIO package) as data from asynchronous sockets don't always arrive all at once. We need to map buffers to sockets and keep reading until there is no more data from the remote end.

So what about this flip()? Well, the way the ByteBuffer class was designed, data is read to the buffer starting at position and upto limit. Data is written starting at position and upto limit as well. So, if you followed that, after reading some data from a socket, the ByteBuffer position would be advanced, and reading now will not get any data as position is at the end of the buffer. So flip() basically sets the position to 0 (start), and limit to the position (previous position to be exact, or rather the end of useful input).

So think about read and write operations on the ByteBuffer manipulating data within position and limit and you will see more clearly the need to flip once in a while.

Happy flipping!

Reading UTF-8 data from asynchronous sockets to the file system

Using asynchronous sockets, data is generally read into ByteBuffer objects. The general pattern is to read multiple times until there is no more data, and each time when the ByteBuffer is full, transfer to a larger buffer, like a ByteArrayOutputStream.

Now if you want to manipulate data collected (which is now in the ByteArrayOutputStream) as a String, it has to be decoded. This can be done using the CharsetDecoder object like this:

ByteArrayOutputStream outStrm;

// read data to outStrm using nio

CharBuffer charBuffer = CharBuffer.allocate(outStrm.size());
byte[] ba = outStrm.toByteArray();
ByteBuffer byteBuffer = ByteBuffer.wrap(ba);
Charset charset = Charset.forName( "UTF-8" );
decoder = charset.newDecoder();
CoderResult res = decoder.decode(byteBuffer, charBuffer, true);
res = decoder.flush(charBuffer);
String out = charBuffer.flip().toString();

However, all this decoding does is translating UTF-8 characters to their respective code points. As a result, we can't save this data to a file (OutputStream) correctly.

If you were to print the out string to the display, it is not guaranteed to print valid UTF-8 characters. Of course it will work for the single byte characters, but not necessarily for the multi-byte characters. Ex: 0xca a0 represents a non-breaking space with a code point of 0xA0. The above decoding will decode this to the code point 0xA0, but if you now write this to an output stream, it will not be stored as UTF-8, as the decoding stripped the UTF-8 and replaced it with code points.

So the correct approach is to simply write the byte buffer to an output stream like this:

outStrm.writeTo(System.out);

This will present UTF-8 characters to the output stream and thus the file will be saved as correct UTF-8 data.

Monday, September 14, 2009

A lenient URL decoder for Java

The URLDecoder class in the JDK insists on doing a strict parsing of escape characters in an encoded URL string. Sometimes, the application might want to decode correctly escaped string sequences and leave incorrect sequences intact. In fact the Sun documentation states that this aspect of decode handling is implementation dependent. However, Sun's implementation is strict - it throws an exception when it encounters improper escape sequences rather than treating them like regular text.

I couldn't find a lenient implementation so hand-crafted this from the original source for URLDecode class found here.

Following is the lenient decode.

    public static String decodeLenient(String s, String enc)
           throws UnsupportedEncodingException {

       boolean needToChange = false;
       StringBuffer sb = new StringBuffer();
       int numChars = s.length();
       int i = 0;

       if (enc.length() == 0) {
           throw new UnsupportedEncodingException("URLDecoder: empty string enc parameter");
       }

       while (i < numChars) {
           char c = s.charAt(i);
           switch (c) {
               case '+':
                   sb.append(' ');
                   i++;
                   needToChange = true;
                   break;
               case '%':
                   /*
                  # * Starting with this instance of %, process all
                  # * consecutive substrings of the form %xy. Each
                  # * substring %xy will yield a byte. Convert all
                  # * consecutive bytes obtained this way to whatever
                  # * character(s) they represent in the provided
                  # * encoding.
                  # */

                   // (numChars-i)/3 is an upper bound for the number
                   // of remaining bytes
                   byte[] bytes = new byte[(numChars - i) / 3];
                   int pos = 0;

                   while (((i + 2) < numChars) &&
                           (c == '%')) {
                       String hex = s.substring(i + 1, i + 3);
                       try {
                           bytes[pos] =
                                   (byte) Integer.parseInt(hex, 16);
                           pos++;
                       } catch (NumberFormatException e) {
                           sb.append(new String(bytes, 0, pos, enc));
                           sb.append("%");
                           sb.append(hex);
                           pos = 0;
                       }
                       i += 3;
                       if (i < numChars)
                           c = s.charAt(i);
                   }

                   sb.append(new String(bytes, 0, pos, enc));

                   // A trailing, incomplete byte encoding such as
                   // "%x" will be treated as unencoded text
                   if ((i < numChars) && (c == '%')) {
                       for (; i<numChars; i++) {
                           sb.append(s.charAt(i));
                       }
                   }

                   needToChange = true;
                   break;
               default:
                   sb.append(c);
                   i++;
                   break;
           }
       }

       return (needToChange ? sb.toString() : s);
   }
}

Remove intrusive bing popups from content sites

Today as I was pursuing the moneycentral.com site, in my quest to acquire the financial knowledge of the Wall Street robber barons, I stumbled on the latest feature from Bing : annoying hover over popups that show a search listing.

They are rather annoying for two reasons:
1) They appear over the text you are reading
2) It is very difficult to navigate around them as all you need to do is to have the mouse cursor on the link area for the helpful popup to appear

So I modified my previous GreaseMonkey script to handle this as well. Fortunately it is easy as this popup is fired off an anchor tag with an attribute of "itxtdid". All we need to do is remove the attribute, and the popup gets disabled.

Below is the new script. Enjoy.

// ==UserScript==
// @name           test
// @namespace      http://userscripts.org/thushara/
// @include        *
// ==/UserScript==
var allLinks, thisLink;
allLinks = document.evaluate(
 '//a[@href]',
 document,
 null,
 XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
 null);
for (var i = 0; i < allLinks.snapshotLength; i++) {
 thisLink = allLinks.snapshotItem(i);
 // do something with thisLink
 if (thisLink.href.substring(0,19)=="http://www.bing.com") {
   thisLink.removeAttribute("href");
 }
 if (thisLink.hasAttribute("itxtdid")) {
   thisLink.removeAttribute("itxtdid");
 }
}

Tuesday, September 01, 2009

Ubuntu 8.04 - sound on flash player with firefox

If sound works generally, but not inside Flash Player (in FireFox), try this:

thushara@agni:~$ sudo apt-get install libflashsupport
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
libflashsupport
0 upgraded, 1 newly installed, 0 to remove and 131 not upgraded.
Need to get 8326B of archives.
After this operation, 65.5kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com hardy/universe libflashsupport 1.9-0ubuntu1 [8326B]
Fetched 8326B in 0s (14.9kB/s)
Selecting previously deselected package libflashsupport.
(Reading database ... 99009 files and directories currently installed.)
Unpacking libflashsupport (from .../libflashsupport_1.9-0ubuntu1_i386.deb) ...
Setting up libflashsupport (1.9-0ubuntu1) ...

Processing triggers for libc6 ...
ldconfig deferred processing now taking place
thushara@agni:~$

restart FireFox and sound should be available...

Friday, August 14, 2009

remove bing links from content sites

Lately, I have seen numerous "bing" links appearing in certain content sites I frequent. A case in point is http://articles.moneycentral.msn.com

It is somewhat insidious as unless I happen to note it is a search link, I pursue it imagining it will take me to some good content.

So I wrote a GreaseMonkey script to disable those links. Here goes:

// ==UserScript==
// @name           test
// @namespace      http://userscripts.org/thushara/
// @include        http://articles.moneycentral.mn.com/*
// ==/UserScript==
var allLinks, thisLink;
allLinks = document.evaluate(
    '//a[@href]',
    document,
    null,
    XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
    null);
for (var i = 0; i < allLinks.snapshotLength; i++) {
    thisLink = allLinks.snapshotItem(i);
    // do something with thisLink
    if (thisLink.href.substring(0,19)=="http://www.bing.com") {
      thisLink.removeAttribute("href");
    }
}

I'm not too familiar with XPath, but I believe it is possible to specify "http://www.bing.com" inside the XPath query itself, so that there is no need to iterate through all the links finding the search links. I couldn't get the syntax right for this. Please post if you find a way around this.

Thursday, July 30, 2009

groovy, mysql and case sensitivity

Groovy seems to have a somewhat hard to grok policy on case sensitivity as it pertains to mysql columns. To illustrate, for a table apiaccess with a column Domain this fails as of groovy version 1.6.3:

  query = "select domain from apiaccess where apiaccessid=3052";
  row = sql.firstRow(query);
  dom = row.Domain;

The reason is that the Domain in row.Domain does not match regards case with domain in the filter string.

This works:

  query = "select domain from apiaccess where apiaccessid=3052";
  row = sql.firstRow(query);
  dom = row.domain;

However, on an earleir version (perhaps the RC1 candidate of 1.6), the first block of code worked. There groovy expected a match with the actual mysql column name vs the filter string.

On both versions, the case used in the filter string do not need to match the mysql column names with regards to case.

Tuesday, June 30, 2009

mv is not atomic in Mac OS

you shouldn't rely on `mv` being atomic on the regular file system under MacOS. i had a script that had to regularly update a file that is read by a different script. under this scenario i resorted to writing a temporary file and then `mv`ing the file to the permanent location. while this works for linux, it doesn't work for MacOS.

to demonstrate, open two command windows in you Mac and in one type this:

while true; do echo this better be a whole sentence > x1.txt; mv x1.txt x.txt; done

on the other, run this script:

while true
do
    F=`cat x.txt`
    echo $F
    if [ "$F" = "this better be a whole sentence" ]
    then
    echo ok
    else
    echo bad
    exit -1
    fi
done

notice the output:

mpire@brwdbs02:~$ ./x.sh
this better be a whole sentence
ok
this better be a whole sentence
ok
ok
this better be a whole sentence
ok
this better be a whole sentence
ok
cat: x.txt: No such file or directory

bad
[~]

bad
mpire@brwdbs02:~$

Performance Improvement in org.apache.hadoop.io.Text class

I wrote earlier on a performance improvement I made to Hadoop. Upon discussing with Hadoop devs, notably Chris Douglas, this change was made to the core org.apache.hadoop.io.Text class. This has the additional benefit of improving a core text handling class used commonly in Hadoop, and we avoid the additional memory foot-print created by having an additional instance of OutputStream.

This improvement will be available in hadoop 0.21.0:

Note the difference in YourKit profiling data with the new Text class:

Thursday, June 25, 2009

running Hadoop tests

install jdk 1.5 and Apache Forrest. then,
run this command:

ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Users/thushara/apache-forrest-0.8 -Djavac.args="-Xlint -Xmaxwarns 1000" clean test tar

Monday, June 22, 2009

bash: single line for loop

run commands on multiple files at once:

[~/hadoop-src] for f in *; do echo $f; done
common
hdfs
mapreduce
[~/hadoop-src]for f in *; do svn up $f/trunk; done
At revision 787534.

Fetching external item into 'hdfs/trunk/src/test/bin'
External at revision 787534.

At revision 787534.

Fetching external item into 'mapreduce/trunk/src/test/bin'
External at revision 787534.

At revision 787534.
[~/hadoop-src]

Friday, June 19, 2009

date -d is different from Linux to Mac

familiar with:

date -d '1 hour ago'

well, it will work on the Linux command line, but no such luck on the Mac

here is the code you need if your script is to work on both OSes:

OS=`uname -a`
if [[ $OS == Darwin* ]]
then
    TODAY=`date -v-1H +"%Y-%m-%d.%H"`
else
    TODAY=`date -d '1 hour ago' +"%Y-%m-%d.%H"`
fi

Mac Office Excel - all text exports appear as one line

If you use Mac Office to convert a spread-sheet to text, you will see one long line with any Unix utility. The reason is that Office inserts the windows carriage return (^M) with no line feed.

To fix this, bring up the text file inside vi editor type:

:s/^V^M/^V^M/g

this will appear on the editor like

:s/^M/^M/g

as ^V is really an escape for control characters

Thursday, June 18, 2009

Downgrading Subversion Working Copy

Sometimes, an IDE coupled to SVN (ex: Intellij) might upgrade your subversion working copy to the latest SVN server format that is incompatible with the command line svn client you have. In this case, if your distro has a command line svn of the latest version, you are probably fine. But if you don't have such luck (for ex: being a MacOS user who has v 1.4 of svn command line), you might want to downgrade the working copy using instructions here.

Friday, June 05, 2009

Hadoop - reading large lines (several MB) is slow

I ran into a performance issue running a Hadoop map/reduce job on an input that at times contained lines as long as 200MB. The issue was in org.apache.hadoop.util.LineReader.

LineReader uses org.apache.hadoop.io.Text to store potentially large lines of text. Unfortunately Text class does not behave well for large text.

Here is the yourKit profile of a simple block of code using Text class:

Here is a profile when Text is replaced with ByteArrayOutputStream:

Notice the Text.append version took 10 times longer to run.

I could get my map/reduce task that initially took over 20 minutes and crashed (as hadoop TaskTracker was timing out on child tasks that took too long) to work under 30s, with this simple change to LineReader:

  public int readLine(Text str, int maxLineLength,
                      int maxBytesToConsume) throws IOException {
    str.clear();
    boolean hadFinalNewline = false;
    boolean hadFinalReturn = false;
    boolean hitEndOfFile = false;
    int startPosn = bufferPosn;
    long bytesConsumed = 0;
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    outerLoop: while (true) {
      if (bufferPosn >= bufferLength) {
        if (!backfill()) {
          hitEndOfFile = true;
          break;
        }
      }
      startPosn = bufferPosn;
      for(; bufferPosn < bufferLength; ++bufferPosn) {
        switch (buffer[bufferPosn]) {
        case '\n':
          hadFinalNewline = true;
          bufferPosn += 1;
          break outerLoop;
        case '\r':
          if (hadFinalReturn) {
            // leave this \r in the stream, so we'll get it next time
            break outerLoop;
          }
          hadFinalReturn = true;
          break;
        default:
          if (hadFinalReturn) {
            break outerLoop;
          }
        }        
      }
      bytesConsumed += bufferPosn - startPosn;
      int length = bufferPosn - startPosn - (hadFinalReturn ? 1 : 0);
      length = (int)Math.min(length, maxLineLength - os.size());
      if (length >= 0) {
        os.write(buffer, startPosn, length);
        LOG.info("os.size= " + os.size() + " just wrote from " + startPosn + " to " + length + " bytes");
      }
      if (bytesConsumed >= maxBytesToConsume) {
        str.set(os.toByteArray());
        return (int)Math.min(bytesConsumed, (long)Integer.MAX_VALUE);
      }
    }
    LOG.info("finished reading line");
    int newlineLength = (hadFinalNewline ? 1 : 0) + (hadFinalReturn ? 1 : 0);
    if (!hitEndOfFile) {
      bytesConsumed += bufferPosn - startPosn;
      int length = bufferPosn - startPosn - newlineLength;
      length = (int)Math.min(length, maxLineLength - os.size());
      if (length > 0) {
        os.write(buffer, startPosn, length);
      }
    }

    str.set(os.toByteArray());
    return (int)Math.min(bytesConsumed, (long)Integer.MAX_VALUE);
  }

Wednesday, May 13, 2009

Stop Emacs spash screen

Put this in your ~/.emacs :

;; no splash screen
(setq inhibit-startup-message t)

You can also start emacs with -Q option to avoid the splash screen.

seems this is a debated point!

Monday, May 11, 2009

Flash plugin for FireFox on Ubuntu 8.04.1

The installation doesn't work without a minor tweak after the fact:

thushara@agni:~$ sudo ln -s /usr/lib/libssl3.so.1d /usr/lib/libssl3.so
thushara@agni:~$ sudo ln -s /usr/lib/libplds4.so.0d /usr/lib/libplds4.so
thushara@agni:~$ sudo ln -s /usr/lib/libplc4.so.0d /usr/lib/libplc4.so
thushara@agni:~$ sudo ln -s /usr/lib/libnspr4.so.0d /usr/lib/libnspr4.so
thushara@agni:~$ sudo ln -s /usr/lib/libnss3.so.1d /usr/lib/libnss3.so

how did i figure this out? - just start firefox on the command-line and watch it complain:

thushara@agni:~$ firefox
LoadPlugin: failed to initialize shared library libXt.so [libXt.so: cannot open shared object file: No such file or directory]
LoadPlugin: failed to initialize shared library libXext.so [libXext.so: cannot open shared object file: No such file or directory]
LoadPlugin: failed to initialize shared library /usr/lib/adobe-flashplugin/libflashplayer.so [libnss3.so: cannot open shared object file: No such file or directory]
LoadPlugin: failed to initialize shared library /usr/lib/adobe-flashplugin/libflashplayer.so [libnss3.so: cannot open shared object file: No such file or directory]

out of these, our problem is with the libflashplayer.so, seems like a dependent module (libnss3.so) is not being found.

so now, use ld to see the dependencies for the libflashplayer.so :

thushara@agni:~$ ld /usr/lib/adobe-flashplugin/libflashplayer.so
ld: warning: libnss3.so, needed by /usr/lib/adobe-flashplugin/libflashplayer.so, not found (try using -rpath or -rpath-link)
ld: warning: libsmime3.so, needed by /usr/lib/adobe-flashplugin/libflashplayer.so, not found (try using -rpath or -rpath-link)
ld: warning: libssl3.so, needed by /usr/lib/adobe-flashplugin/libflashplayer.so, not found (try using -rpath or -rpath-link)
ld: warning: libplds4.so, needed by /usr/lib/adobe-flashplugin/libflashplayer.so, not found (try using -rpath or -rpath-link)
ld: warning: libplc4.so, needed by /usr/lib/adobe-flashplugin/libflashplayer.so, not found (try using -rpath or -rpath-link)
ld: warning: libnspr4.so, needed by /usr/lib/adobe-flashplugin/libflashplayer.so, not found (try using -rpath or -rpath-link)
ld: warning: cannot find entry symbol _start; not setting start address
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSMessage_GetContentInfo@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `CERT_GetDefaultCertDB@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSContentInfo_GetContentTypeTag@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_ForceHandshake@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSSignedData_VerifySignerInfo@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_Initialized'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_IsInitialized@NSS_3.9.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `CERT_DecodeCertFromPackage@NSS_3.4'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSMessage_Destroy@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_OptionSet@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSContentInfo_GetContent@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_ResetHandshake@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_Read'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_AuthCertificate@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SECMOD_CloseUserDB@NSS_3.11'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `CERT_ChangeCertTrust@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_Write'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_Shutdown@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_Init'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `CERT_DestroyCertificate@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_ImportFD@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_Now'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SECMOD_OpenUserDB@NSS_3.11'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `CERT_VerifyCACertForUsage@NSS_3.6'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_ImportTCPSocket'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_Init@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_AuthCertificateHook@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_SetDomesticPolicy@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSMessage_CreateFromDER@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSMessage_GetContent@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSSignedData_SignerInfoCount@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PK11_FreeSlot@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSSignedData_ImportCerts@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_GetError'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `NSS_CMSSignedData_VerifyCertsOnly@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_Close'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_BadCertHook@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `SSL_SetURL@NSS_3.2'
/usr/lib/adobe-flashplugin/libflashplayer.so: undefined reference to `PR_SetSocketOption'

from this, we find that certain libs are missing, or are they? let's just check if they are somewhere sneaky:

thushara@agni:~$ locate libnss3.so
/usr/lib/libnss3.so.1d
thushara@agni:~$ locate libsmime3.so
/usr/lib/libsmime3.so.1d
thushara@agni:~$ locate libssl3.so
/usr/lib/libssl3.so.1d
thushara@agni:~$ locate libplds4.so
/usr/lib/libplds4.so.0d
thushara@agni:~$ locate libplc4.so
/usr/lib/libplc4.so.0d
thushara@agni:~$ locate libnspr4.so
/usr/lib/libnspr4.so.0d

thus we can just symlink our way out of this:

thushara@agni:~$ sudo ln -s /usr/lib/libssl3.so.1d /usr/lib/libssl3.so
thushara@agni:~$ sudo ln -s /usr/lib/libplds4.so.0d /usr/lib/libplds4.so
thushara@agni:~$ sudo ln -s /usr/lib/libplc4.so.0d /usr/lib/libplc4.so
thushara@agni:~$ sudo ln -s /usr/lib/libnspr4.so.0d /usr/lib/libnspr4.so
thushara@agni:~$ sudo ln -s /usr/lib/libnss3.so.1d /usr/lib/libnss3.so

hit youtube and enJoy ~

Friday, April 17, 2009

Tomcat web.xml file - order is important under web-app

Some servlet containers seem to be picky about the ordering of the elements under <web-app> of the web.xml.

Today I had to change a web.xml so that <filter> tag was placed before the <listener> tag.

Tuesday, March 17, 2009

Java - threads have no return codes

Java threads do not return exit codes. This is a problem as a thread gets its own stack and exceptions from the thread cannot be caught by the parent thread, as it has its own stack. Exceptions are placed in the thread stack, and as a result only the thread throwing the exceptions can catch them.

The Concurrent interface allows us to get a return code from a thread with the introduction of a new abstraction - The Task. Here, a Task accepts a Callable interface that provides for an exit code.

Tuesday, February 24, 2009

`watch` with the Mac

The unix command line utility watch allows you to monitor the output of a command continuously. I found this port which works on my Mac with a minor modification to it's Makefile. Basically, since a Mac system may not have the /usr/local dir, the BINDIR in Makefile should point to somewhere else, say /usr/bin/watch

Wednesday, February 04, 2009

java HTMLParser - handling EncodingChangeException

If you are analyzing a web page containing characters with different encodings, the HTMLParser may throw an exception of type EncodingChangeException.

The correct way to handle this exception is to reset the parser and re-try parsing. On the second time, the HTMLParser is aware of multiple encodings and manages to parse the page without exceptions.

ex:

NodeList nodes = null;
try {
nodes = parser.extractAllNodesThatMatch(new NodeClassFilter(TitleTag.class));
} catch (EncodingChangeException ex) {
//accommodate new encoding, re-parse
parser.reset();
nodes = parser.extractAllNodesThatMatch(new NodeClassFilter(TitleTag.class));
}

Wednesday, January 28, 2009

java Collections.copy behavior is unexpected

Yesterday I was trying to sort a Collection returned by JDO. Of course this is not possible as JDO resultsets are read-only. So I looked up for a java function that will allow me to copy the JDO collection to a new collection, which I would then sort.

The first function that popped up was Collections.copy(Collection dest, Collection src). After I coded up the copy, it crashed with an IndexOutOfBounds exception inside the copy method.

Upon further investigation, I found that the destination Collection has to have a size no less than the source list. This is not what one expects from a copy method. It seems like Sun introduced this function as a convenience in overwriting an existing collection with a different collection. You can read about it here.

Fortunately, if your collection is a List, there is a List.addAll(Collection c) method that would copy a collection to the list.