Tuesday, February 22, 2011

Java code to tail -N a text file

This code allows you to go over the last N lines of a specified file. It has a "head" method, which simply allows you to go over the file from the beginning.
import java.io.*;
import java.nio.channels.FileChannel;
import java.nio.CharBuffer;
import java.nio.ByteBuffer;
import java.util.Iterator;
import java.util.NoSuchElementException;public class MMapFile {

    public class MMapIterator implements Iterator<String> {
        private int offset;

        public MMapIterator(int offset) {
            this.offset = offset;
        }
        
        public boolean hasNext() {
            return offset < cb.limit();
        }

        public String next() {
            ByteArrayOutputStream sb = new ByteArrayOutputStream();
            if (offset >= cb.limit())
                throw new NoSuchElementException();
            for (; offset < cb.limit(); offset++) {
                byte c = (cb.get(offset));
                if (c == '\n') {
                    offset++;
                    break;
                }
                if (c != '\r') {
                    sb.write(c);
                }

            }
            try {
                return sb.toString("UTF-8");
            } catch (UnsupportedEncodingException e) {}
            return sb.toString();
        }

        public void remove() {

        }
    }


    private ByteBuffer cb;
    long size;
    private long numLines = -1;
    public MMapFile(String file) throws FileNotFoundException, IOException {
        FileChannel fc = new FileInputStream(new File(file)).getChannel();
        size = fc.size();
        cb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
    }

    public long getNumLines() {
        if (numLines != -1) return numLines;  //cache number of lines
        long cnt = 0;
        for (int i=0; i <size; i++) {
            if (cb.get(i) == '\n')
                cnt++;
        }
        numLines = cnt;
        return cnt;
    }

    public Iterator<String> tail(long lines) {
        long cnt=0;
        long i=0;
        for (i=size-1; i>=0; i--) {
            if (cb.get((int)i) == '\n') {
                cnt++;
                if (cnt == lines+1)
                    break;
            }
        }
        return new MMapIterator((int)i+1);
    }

    public Iterator<String> head() {
        return new MMapIterator(0);
    }

    static public void main(String[] args) {
        try {
            Iterator<String> it = new MMapFile("/test.txt").head();
            while (it.hasNext()) {
                System.out.println(it.next());
            }
        } catch (Exception e) {
            
        }

        System.out.println();

        try {
            Iterator<String> it = new MMapFile("/test.txt").tail(2);
            while (it.hasNext()) {
                System.out.println(it.next());
            }
        } catch (Exception e) {

        }

        System.out.println();

        try {
            System.out.println("lines: "+new MMapFile("/test.txt").getNumLines());
        } catch (Exception e) {

        }

    }

}

The technique is to simply map the file into memory using java.nio.channels.FileChannel.map in the Java NIO library and manipulate the file data using memory techniques.

For the "tail" function, we walk back the mapped bytes counting newlines. The MMapIterator class conveniently provides a way to iterate over lines once we find the starting line.

There is a point where care must be taken in the MMapIterator.next() implementation. That is making sure that bytes are converted to the appropriate string encoding. We use "UTF-8" but if you are dealing with a different encoding in the input file, this should be changed.

Monday, February 21, 2011

Java : all 8-bit data cannot be cast to char type

If you have a byte and want to make a String, a simple cast to a char will only work if you are dealing with 7-bit ASCII. If that byte could be extended ASCII, the cast will not encode to the correct character code.

This is the sure way to get all 8-bit characters represented in Strings:

new String(new byte[]{c}, "ISO-8859-1")

This is to do with the default encoding used by the JVM which is likely not "ISO-8859-1". On Linux, it is likely to be "UTF-8" and it is "MacRoman" on Snow Leopard (Mac).

Friday, February 18, 2011

Running HBase - some issues to be aware of

I want to take a moment to note down a few issues I had with setting up a distributed HBase environment in case it helps someone else.

First, I set up Hadoop from the 0.20 append branch as described here. I used two machines where the first machine was the master and both machines were used as slaves. This is a guide I used.

mkdir ~/hadoop-0.20-append
cd ~/hadoop-0.20-append
svn co https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ .
ant jar

At the end of this, you will have the hadoop jar file at ~/hadoop-0.20-append/build

The first mistake I made was to use the IP address of the name node for fs.default.name in the conf/core-site.xml file. There is a bug in Hadoop 0.20 release that prevents the use of IP address in this context.

Interestingly, the basic HDFS shell commands worked (ex: get, ls) with the IP address being used for fs.default.name. The problem only cropped up after I setup HBase and tried to use the HBase shell.

To setup HBase, I followed the steps outlined here.

Before I discovered the IP related issue, I encountered an error that showed I was not following the steps faithfully enough. While HBase ships with a version of Hadoop from presumably the 0.20 append branch, it was not identical to the version I built from the 0.20 append branch. As stated in the documentation, I then copied the Hadoop jar I built over the jar shipped with HBase.

Next, I ran into the IP issue. Generally, changing fs.default.name and restarting the Hadoop cluster is not enough in such cases as certain data has been written to HDFS name and data directories already and any mismatch emboldens further "namespace mismatch" errors. Thus, before changing the fs.default.name, I removed the directories specified by dfs.name.dir and dfs.data.dir. In case of dfs.data.dir, I had to remove it on both slaves. Then I changed the IP over to the machine name, formatted the name node and re-started the Hadoop cluster.

It still was not over. This time there was the issue of these machine names not being in the DNS. They happened to be simply names assigned to these machines which were not in the domain name system used by the machines to communicate to one another. Thus I went into the /etc/hosts file on both machines and added appropriate entries to allow each box to resolve the domain name to an IP.

After which, I could create a table and insert rows into it as explained here.

The next step was to programmatically create a table and add rows to it. I adapted the example from here. The programming API by default allows the code to find the hbase configuration files using the class path. Thus, I added the path to the hbase/conf directory to the classpath to get the program to work. Alternately, you could use
org.apache.hadoop.hbase.HbaseConfiguration.addHbaseResources(org.apache.hadoop.conf.Configuration) 
which would have you use
org.apache.hadoop.conf.Configuration.addResources(org.apache.hadoop.fs.Path)
to add the paths to individual configuration files.

Friday, February 11, 2011

Debugging a most curious hang / spin

Here is the stack trace from a JVM that shows 1/4 core of the machine being 100% utilized.

2011-02-11 14:48:05
Full thread dump Java HotSpot(TM) Server VM (19.0-b09 mixed mode):

"pool-23-thread-4" prio=10 tid=0x29e81800 nid=0x50dd waiting for monitor entry [0x2b83f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at net.htmlparser.jericho.TagType.getTagAt(TagType.java:667)
    at net.htmlparser.jericho.Tag.parseAllgetNextTag(Tag.java:631)
    at net.htmlparser.jericho.Tag.parseAll(Tag.java:607)
    at net.htmlparser.jericho.Source.fullSequentialParse(Source.java:609)
    at HTMLTagExtractorUsingJerichoParser.handleRedirects(HTMLTagExtractorUsingJerichoParser.java:217)
    at HTMLTagExtractorUsingJerichoParser.parse(HTMLTagExtractorUsingJerichoParser.java:51)
    at LinksProcessor.add(LinksProcessor.java:222)
    at LinksProcessor.run(LinksProcessor.java:77)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

"pool-23-thread-3" prio=10 tid=0x29e7f400 nid=0x50dc waiting on condition [0x2c355000]
   java.lang.Thread.State: RUNNABLE
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.(String.java:215)
    at java.lang.StringBuilder.toString(StringBuilder.java:430)
    at net.htmlparser.jericho.StartTag.getStartDelimiter(StartTag.java:600)
    at net.htmlparser.jericho.StartTag.getNext(StartTag.java:660)
    at net.htmlparser.jericho.StartTag.getEndTag(StartTag.java:777)
    at net.htmlparser.jericho.StartTag.getEndTagInternal(StartTag.java:566)
    at net.htmlparser.jericho.StartTag.getElement(StartTag.java:167)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:327)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Source.getChildElements(Source.java:721)
    at net.htmlparser.jericho.Element.getParentElement(Element.java:282)
    at HTMLTagExtractorUsingJerichoParser.getRedirectURL(HTMLTagExtractorUsingJerichoParser.java:232)
    at HTMLTagExtractorUsingJerichoParser.handleRedirects(HTMLTagExtractorUsingJerichoParser.java:219)
    at HTMLTagExtractorUsingJerichoParser.parse(HTMLTagExtractorUsingJerichoParser.java:51)
    at LinksProcessor.add(LinksProcessor.java:222)
    at LinksProcessor.run(LinksProcessor.java:77)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

"pool-23-thread-2" prio=10 tid=0x29e82400 nid=0x50db waiting for monitor entry [0x28cf6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.util.ArrayList.(ArrayList.java:112)
    at java.util.ArrayList.(ArrayList.java:119)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:309)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Element.getChildElements(Element.java:335)
    at net.htmlparser.jericho.Source.getChildElements(Source.java:721)
    at net.htmlparser.jericho.Element.getParentElement(Element.java:282)
    at HTMLTagExtractorUsingJerichoParser.getRedirectURL(HTMLTagExtractorUsingJerichoParser.java:232)
    at HTMLTagExtractorUsingJerichoParser.handleRedirects(HTMLTagExtractorUsingJerichoParser.java:219)
    at HTMLTagExtractorUsingJerichoParser.parse(HTMLTagExtractorUsingJerichoParser.java:51)
    at LinksProcessor.add(LinksProcessor.java:222)
    at LinksProcessor.run(LinksProcessor.java:77)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

"pool-23-thread-1" prio=10 tid=0x29e7cc00 nid=0x50da waiting for monitor entry [0x2a8e8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.util.LinkedList.(LinkedList.java:78)
    at net.htmlparser.jericho.Attributes.construct(Attributes.java:109)
    at net.htmlparser.jericho.Attributes.construct(Attributes.java:78)
    at net.htmlparser.jericho.StartTagType.parseAttributes(StartTagType.java:672)
    at net.htmlparser.jericho.StartTagTypeGenericImplementation.constructTagAt(StartTagTypeGenericImplementation.java:132)
    at net.htmlparser.jericho.TagType.getTagAt(TagType.java:674)
    at net.htmlparser.jericho.Tag.parseAllgetNextTag(Tag.java:631)
    at net.htmlparser.jericho.Tag.parseAll(Tag.java:607)
    at net.htmlparser.jericho.Source.fullSequentialParse(Source.java:609)
    at HTMLTagExtractorUsingJerichoParser.handleRedirects(HTMLTagExtractorUsingJerichoParser.java:217)
    at HTMLTagExtractorUsingJerichoParser.parse(HTMLTagExtractorUsingJerichoParser.java:51)
    at LinksProcessor.add(LinksProcessor.java:222)
    at LinksProcessor.run(LinksProcessor.java:77)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x2efac400 nid=0x47dd runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x2efaa000 nid=0x47dc waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x2efa8000 nid=0x47db waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x2efa6800 nid=0x47da waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x2ef96400 nid=0x47d9 in Object.wait() [0x2ec65000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x35c53470> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <0x35c53470> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x2ef94c00 nid=0x47d8 in Object.wait() [0x2ece6000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x35c53448> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x35c53448> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x091c2800 nid=0x47d2 waiting for monitor entry [0xb6b22000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
    at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:169)
    at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:233)
    at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179)
    at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
    at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1002)
    - locked <0x35c98398> (a org.apache.lucene.index.DocumentsWriter)
    at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:958)
    - locked <0x35c98398> (a org.apache.lucene.index.DocumentsWriter)
    at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:5207)
    - locked <0x35c98198> (a org.apache.lucene.index.IndexWriter)
    at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4370)
    - locked <0x35c98198> (a org.apache.lucene.index.IndexWriter)
    at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4209)
    - locked <0x35c98198> (a org.apache.lucene.index.IndexWriter)
    at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4200)
    at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:4078)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4151)
    - locked <0x35c99188> (a java.lang.Object)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4124)
    at IndexerHelper.indexLinks(IndexerHelper.java:137)
    at indexLinks(Gauntlet.java:330)
    at battle(Gauntlet.java:353)
    at main(Gauntlet.java:435)

"VM Thread" prio=10 tid=0x2ef92400 nid=0x47d7 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x091c9c00 nid=0x47d3 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x091cb400 nid=0x47d4 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x091cc800 nid=0x47d5 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x091ce000 nid=0x47d6 runnable

"VM Periodic Task Thread" prio=10 tid=0x091dd000 nid=0x47de waiting on condition

The main thread has issued a cancellation to its children and proceeded to commit the Lucene buffers the child threads have been filling. However, none of the child threads seem to be in the part of the code where Lucene buffers are being modified. And the Lucene commit() does not finish. It is moving along as I can see the index being updated in the file system, but for some reason it is spinning.


The other 4 child threads seem to be spinning. Somehow, all 4 cores are not being used. Just one core is 100% utilized.


top - 15:00:19 up 22 days, 23:49,  2 users,  load average: 1.00, 1.00, 0.94
Tasks: 121 total,   1 running, 120 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3872764k total,  3791916k used,    80848k free,   121416k buffers
Swap:  3906552k total,     9192k used,  3897360k free,  1228492k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                      
18385 user     20   0 2388m 2.2g  10m S  100 58.8   2225:32 java -Xmx2G -Xss512k ...
26819 user     20   0  2548 1196  904 R    0  0.0   0:00.09 top                                                              

Tuesday, February 01, 2011

beware of using Lucene NIOFSDirectory from a thread pool

NIOFSDirectory does not handle a Thread.interrupt() well. If interrupted this way during I/O, it is known to throw a ClosedByInterruptException.

I was using a thread pool (using the java.util.concurrent) package and noticed that a termination of the thread pool could result in the ClosedByInterruptException being thrown.

The warning is provided in the new documentation.

You would still be ok if you could wait for the thread pool to finish its tasks. You could use ExecutorService.shutdown() followed by ExecutorService.awaitTermination() and these calls would not cause the concurrent package to interrupt the NIOFSDirectory implementation.

The problem crops up if you have to resort to a ExecutorService.shutdownNow() as that would try canceling the already queued up tasks, which would end up interrupting the NIOFSDirectory code.