Friday, February 18, 2011

Running HBase - some issues to be aware of

I want to take a moment to note down a few issues I had with setting up a distributed HBase environment in case it helps someone else.

First, I set up Hadoop from the 0.20 append branch as described here. I used two machines where the first machine was the master and both machines were used as slaves. This is a guide I used.

mkdir ~/hadoop-0.20-append
cd ~/hadoop-0.20-append
svn co https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ .
ant jar

At the end of this, you will have the hadoop jar file at ~/hadoop-0.20-append/build

The first mistake I made was to use the IP address of the name node for fs.default.name in the conf/core-site.xml file. There is a bug in Hadoop 0.20 release that prevents the use of IP address in this context.

Interestingly, the basic HDFS shell commands worked (ex: get, ls) with the IP address being used for fs.default.name. The problem only cropped up after I setup HBase and tried to use the HBase shell.

To setup HBase, I followed the steps outlined here.

Before I discovered the IP related issue, I encountered an error that showed I was not following the steps faithfully enough. While HBase ships with a version of Hadoop from presumably the 0.20 append branch, it was not identical to the version I built from the 0.20 append branch. As stated in the documentation, I then copied the Hadoop jar I built over the jar shipped with HBase.

Next, I ran into the IP issue. Generally, changing fs.default.name and restarting the Hadoop cluster is not enough in such cases as certain data has been written to HDFS name and data directories already and any mismatch emboldens further "namespace mismatch" errors. Thus, before changing the fs.default.name, I removed the directories specified by dfs.name.dir and dfs.data.dir. In case of dfs.data.dir, I had to remove it on both slaves. Then I changed the IP over to the machine name, formatted the name node and re-started the Hadoop cluster.

It still was not over. This time there was the issue of these machine names not being in the DNS. They happened to be simply names assigned to these machines which were not in the domain name system used by the machines to communicate to one another. Thus I went into the /etc/hosts file on both machines and added appropriate entries to allow each box to resolve the domain name to an IP.

After which, I could create a table and insert rows into it as explained here.

The next step was to programmatically create a table and add rows to it. I adapted the example from here. The programming API by default allows the code to find the hbase configuration files using the class path. Thus, I added the path to the hbase/conf directory to the classpath to get the program to work. Alternately, you could use
org.apache.hadoop.hbase.HbaseConfiguration.addHbaseResources(org.apache.hadoop.conf.Configuration) 
which would have you use
org.apache.hadoop.conf.Configuration.addResources(org.apache.hadoop.fs.Path)
to add the paths to individual configuration files.

No comments: