Thursday, October 08, 2015

Cartesian product in Scala

val s  = Seq("1","2","3","4")
val t  = Seq("a","b","c","d","e","f")
val u  = Seq("x","y")
val v  = Seq("m","l","n","o","p")

val sq = Seq(s,t,u,v)

sq.foldLeft(Seq(""))((b,a) => b.flatMap{i=>{j=>i+j}})

res9: Seq[String] = List(1axm, 1axl, 1axn, 1axo, 1axp, 1aym, 1ayl, 1ayn, 1ayo, 1ayp, 1bxm, 1bxl, 1bxn, 1bxo, 1bxp, 1bym, 1byl, 1byn, 1byo, 1byp, 1cxm, 1cxl, 1cxn, 1cxo, 1cxp, 1cym, 1cyl, 1cyn, 1cyo, 1cyp, 1dxm, 1dxl, 1dxn, 1dxo, 1dxp, 1dym, 1dyl, 1dyn, 1dyo, 1dyp, 1exm, 1exl, 1exn, 1exo, 1exp, 1eym, 1eyl, 1eyn, 1eyo, 1eyp, 1fxm, 1fxl, 1fxn, 1fxo, 1fxp, 1fym, 1fyl, 1fyn, 1fyo, 1fyp, 2axm, 2axl, 2axn, 2axo, 2axp, 2aym, 2ayl, 2ayn, 2ayo, 2ayp, 2bxm, 2bxl, 2bxn, 2bxo, 2bxp, 2bym, 2byl, 2byn, 2byo, 2byp, 2cxm, 2cxl, 2cxn, 2cxo, 2cxp, 2cym, 2cyl, 2cyn, 2cyo, 2cyp, 2dxm, 2dxl, 2dxn, 2dxo, 2dxp, 2dym, 2dyl, 2dyn, 2dyo, 2dyp, 2exm, 2exl, 2exn, 2exo, 2exp, 2eym, 2eyl, 2eyn, 2eyo, 2eyp, 2fxm, 2fxl, 2fxn, 2fxo, 2fxp, 2fym, 2fyl, 2fyn, 2fyo, 2fyp, 3axm, 3axl, 3axn, 3axo, 3axp, 3aym, 3ayl, 3ayn, 3ayo...

Saturday, September 05, 2015

Scala : Another example of collectFirst

Say, we want to find the population of the native country of the first European bird in a list of birds. This can be done by flatmapping over the birds, thus ignoring the non-European birds, and then getting the head of the resulting list. But then, we have to go over each bird, although we need to just get the first bird. If the list is sufficiently large, this is very inefficient.

The other approach is to use find to get the first European bird, and then simply get its native country's population. But here we have to get the native country of the bird twice. In this example, that is cheap, but if this information requires an expensive database retrieval, then we may not want to do it twice. Where collectFirst becomes very important is when this operation is very expensive - imagine doing a statistical calculation that provides the value for the match.

case class Bird(name: String)

object PlayCollect {

  val birds = Seq(Bird("jaybird"), Bird("owl"), Bird("sparrow"))

  val nativeCountries  = Map( (Bird("jaybird")->"Malaysia"), (Bird("sparrow")->"Ireland"), (Bird("owl")->"Chile") )
  val continents = Map( ("Malaysia" -> "Asia"), ("Ireland" -> "Europe"), ("Chile" -> "South America") )
  val population = Map( ("Malaysia"->10000000), ("Ireland" -> 12000000), "Chile"->45000000 )

  val popEuBird = new PartialFunction[Bird, Int] {
    def apply(b: Bird) = {
      continents(nativeCountries(b)) match {
        case "Europe" => population(nativeCountries(b))
    def isDefinedAt(b: Bird) = {
      continents(nativeCountries(b)) match {
        case "Europe" => true
        case _ => false

  def populationOfNativeCountryOfFirstEuropeanBird: Option[Int] = {
    birds collectFirst popEuBird

  def main (args: Array[String]) {
    populationOfNativeCountryOfFirstEuropeanBird.foreach{pop => println(pop)}

Wednesday, September 02, 2015

Scala : collectFirst example

On occasion, you want to find the first occurrence of an item in a list, then transform it to another type. Using partial functions with collectFirst, we can accomplish this.

collectFirst will first call isDefinedAt to determine if apply should be called for the current item in the list. Thus it will skip over the non matching elements in the list without incurring a match exception.

trait Animal
case class Mammal(name: String) extends Animal
case class Bird(name: String) extends Animal

val animals = Seq(Mammal("elephant"), Mammal("tiger"), Bird("raven"), Mammal("monkey"), Bird("peacock"), Bird("sparrow"))

    val matchBird = new PartialFunction[Animal, Bird] {
      def apply(p: Animal) = {
        p match {
          case b@Bird(name) => b

      def isDefinedAt(p: Animal) = {
          p match {
            case Bird(name) => true
            case _ => false

    animals collectFirst matchBird

    res11: Option[Bird] = Some(Bird(raven))

Friday, February 27, 2015

Titan : using native hadoop libraries on MacOSX

Once you have built the native hadoop libraries on your MacOSX, you need to add this bit of code to bin/ so that it can find them:
if [ -e "${HADOOP_PREFIX}/lib/native/libhadoop.dylib" ]; then
   if [ -n "${LD_LIBRARY_PATH:-}" ]; then
       LD_LIBRARY_PATH="${LIB_PATH}:${LD_LIBRARY_PATH}"     # For Linux

   if [ -n "${DYLD_LIBRARY_PATH:-}" ]; then

The only oddity here is that the script uses "set -u" at the top, which makes bash complain if you use uninitialized variables. So you have to append ":-" to the variables that you are testing. You can see that in the lines that test LD_LIBRARY_PATH etc.

Saturday, December 20, 2014

deleting an iterator in accumulo

I learnt the hard way that setting an iterator in the accumulo shell sets it for a table permanently. To make matters worse, I set this iterator in the metadata table and made everything fail.

Removing the iterator was tricky. First I had to find what accumulo decided to call the iterator as I did not specify a name but just the java class:

Here was the command I used:

setiter -class org.apache.accumulo.core.iterators.FirstEntryInRowIterator -p 99 -scan

Here is how I found what the iterators for accumulo.metadata table were called:

config -t accumulo.metadata -f iterator

SCOPE      | NAME                                                  | VALUE

table      | table.iterator.majc.bulkLoadFilter .................. | 20,org.apache.accumulo.server.iterators.MetadataBulkLoadFilter

table      | table.iterator.majc.vers ............................ | 10,org.apache.accumulo.core.iterators.user.VersioningIterator

table      | table.iterator.majc.vers.opt.maxVersions ............ | 1

table      | table.iterator.minc.vers ............................ | 10,org.apache.accumulo.core.iterators.user.VersioningIterator

table      | table.iterator.minc.vers.opt.maxVersions ............ | 1

table      | table.iterator.scan.firstEntry ...................... | 99,org.apache.accumulo.core.iterators.FirstEntryInRowIterator

table      | table.iterator.scan.firstEntry.opt.scansBeforeSeek .. | 10

table      | table.iterator.scan.vers ............................ | 10,org.apache.accumulo.core.iterators.user.VersioningIterator

table      | table.iterator.scan.vers.opt.maxVersions ............ | 1
The iterator I added seemed to be named "table.iterator.scan.firstEntry", so I tried to delete that:
root@work accumulo.metadata> deleteiter -n table.iterator.scan.firstEntry -t accumulo.metadata

2014-12-20 15:13:42,854 [shell.Shell] WARN : no iterators found that match your criteria
You have to specify just the last part of the iterator name:
root@work accumulo.metadata> deleteiter -scan -n firstEntry -t accumulo.metadata

Wednesday, November 19, 2014

grep many files while printing file name

A useful trick I found:

find . -exec grep -n hello /dev/null {} \;

Including more than one file makes grep print the file name as well as the line number. So we use the handy /dev/null as one extra file to do the job.


Friday, May 23, 2014

Decoding HTML pages with Content-Encoding : deflate

All web servers do not implement zlib protocol the same way when they return data with Content-Encoding set to deflate. Some servers return a zlib header as specified in RFC 1950, but some return the compressed data alone.

Java Inflator class can be used to deal with both cases, but first we must check for the header. The first two bytes denote the header and it is a simple check :

    static boolean isZlibHeader(byte[] bytes) {
        //deal with java stupidity : convert to signed int before comparison
        char byte1 = (char)(bytes[0] & 0xFF);
        char byte2 = (char)(bytes[1] & 0xFF);
        return byte1 == 0x78 && (byte2 == 0x01 || byte2 == 0x9c || byte2 == 0xDA);

    private void inflateToFile(byte[] encBytes, int offset, int size, BufferedOutputStream f) throws IOException {
        Inflater inflator = new Inflater(true);
        inflator.setInput(encBytes, isZlibHeader(encBytes) ? offset+2 : offset, isZlibHeader(encBytes) ? size-2 : size);
        byte[] buf = new byte[4096];
        int nbytes = 0;
        do {
            try {
                nbytes = inflator.inflate(buf);
                if (nbytes > 0) {
            } catch (DataFormatException e) {
                //handle error
        } while (nbytes > 0);

An example URL that had to be processed this way : Here is the Wireshark capture, showing the Content-Encoding set to deflate as well as the de-chunked header (the first 2 bytes "78 9c") at the lower bottom pane of the display: