tech: 2018

Monday, June 04, 2018

JSON to objects in a few languages

When working with data services, we often have a need to convert JSON strings to objects that model our data. Here is a list of code snippets in different languages that convert this Facebook Graph JSON data.

The list is presented in ascending order of verbosity. Predictably, Javascript most succinctly expresses its wishes, whereas Java uses a copious amount of code. Scala avoids some of that verbosity by using case classes that remove boilerplate code for constructors. Jackson (Java) requires getters and setters to identify which attributes of the object are to be serialized, causing code bloat.

JSON:
Javascript:
Ruby:
Python:
Scala:
Java:

Friday, April 20, 2018

Goldbach Conjecture

In the 18th century, two mathematicians came up with a conjecture - known by its original creator - named Goldbach conjecture. It says that any even number greater than 2 can be expressed as a sum of two primes. There is no theoretical proof for this yet, but it is said to hold up to 400 trillion.

A program to test Golbach conjecture for a given integer:

This program demonstrates two algorithms that are well known.

The sieve of Eratosthenes to calculate all primes upto a given number
A linear algorithm to find if two numbers in a list sum to a given number.

To prove the Goldbach conjecture for a given n, we use the sieve to find all prime numbers up to n, then use the linear algorithm to find two primes from this list that sums up to n.

Friday, April 06, 2018

Timing with Jupyter notebook

Pieces of code can be timed within the Jupyter notebook using the %timeit magic.

Here is an example where a grid walk algorithm is implemented three times with progressively better run time, timed with %timeit and graphed using bokeh:

Code:

def num_paths(n):
    M = [[0] * n for i in range(n)]
    for i in range(n):
        M[n-1][i] = 1

    for r in range(n-2, -1, -1):
        for c in range(n-r-1, n):
            M[r][c] = M[r][c-1] + M[r+1][c]
    return M[0][n-1]

def num_paths_from(r, c, n, M):
    if M[r][c] > 0:
        return M[r][c]
    if r == 0 and c == n-1:
        return 1
    paths = ([(x,y) for (x,y) in 
              [(r-1, c), (r, c+1)] if y >= n-x-1 
                                   and y<n])
    npaths = 0
    for x,y in paths:
        npaths += num_paths_from(x,y,n,M)
    M[r][c] = npaths
    return npaths

def num_pathz_from(r, c, n):
    if r == 0 and c == n-1:
        return 1
    paths = ([(x,y) for (x,y) in 
              [(r-1, c), (r, c+1)] if y >= n-x-1 
                                   and y<n])
    npaths = 0
    for x,y in paths:
        npaths += num_pathz_from(x,y,n)
    return npaths

def num_paths_slow(n):
    M = [[0] * n for i in range(n)]
    return num_paths_from(n-1, 0, n, M)

def num_paths_super_slow(n):
    return num_pathz_from(n-1, 0, n)


for sz in range(5,15):
    %timeit num_paths(sz)
    %timeit num_paths_slow(sz)
    %timeit num_paths_super_slow(sz)

Timing:

100000 loops, best of 3: 7.74 µs per loop
10000 loops, best of 3: 26.2 µs per loop
10000 loops, best of 3: 62.1 µs per loop
100000 loops, best of 3: 9.27 µs per loop
10000 loops, best of 3: 32.9 µs per loop
10000 loops, best of 3: 200 µs per loop
100000 loops, best of 3: 11.3 µs per loop
10000 loops, best of 3: 43 µs per loop
1000 loops, best of 3: 615 µs per loop
100000 loops, best of 3: 13.9 µs per loop
10000 loops, best of 3: 56.9 µs per loop
100 loops, best of 3: 2.05 ms per loop
100000 loops, best of 3: 16.6 µs per loop
10000 loops, best of 3: 70.9 µs per loop
100 loops, best of 3: 6.67 ms per loop
100000 loops, best of 3: 19.4 µs per loop
10000 loops, best of 3: 97.4 µs per loop
10 loops, best of 3: 23.7 ms per loop
10000 loops, best of 3: 22.1 µs per loop
10000 loops, best of 3: 105 µs per loop
10 loops, best of 3: 80.2 ms per loop
10000 loops, best of 3: 25.6 µs per loop
10000 loops, best of 3: 135 µs per loop
1 loop, best of 3: 287 ms per loop
10000 loops, best of 3: 29.8 µs per loop
10000 loops, best of 3: 149 µs per loop
1 loop, best of 3: 1.05 s per loop
10000 loops, best of 3: 32.7 µs per loop
10000 loops, best of 3: 171 µs per loop
1 loop, best of 3: 3.78 s per loop

Chart:

Code for the plot:

from bokeh.palettes import Spectral11
from bokeh.plotting import figure, show, output_file

p = figure(plot_width=300, plot_height=300)
slowest = [62,200,615,2050,6670,23700,80200,287000,1050000,3780000]
slower = [26,32,43,56,70,97,105,135,149,171]
fast = [7,9,11,13,16,19,22,25,29,32]
st = 5
end = 8
mypalette=Spectral11[0:3]
p.multi_line(xs=[list(range(st,end)), list(range(st,end)), list(range(st,end))], 
             ys=[slowest[:end-st], 
                 slower[:end-st],
                 fast[:end-st]
                ],
             line_color=mypalette,
             line_width=5
             )

show(p)

This shows how the algorithm with exponential time complexity deteriorates for higher values of n:

Now that I've shown you a bunch of performance numbers and visualization, if you are curious about the algorithm, it is a contrived example of finding the number of paths from one corner of a grid to another, here the squares to the north of the diagonal from top right to bottom left are out of bounds - that is, the path is restricted to the right of the diagonal. In this image, we show the problem for n = 5.

The exponential algorithm recursively finds the number of paths from each point to the end point (the top right corner). But since you can reach a single point by a number of paths (and this number increases exponentially with n), the same computation of finding the number of paths from this point to the grid corner is repeated, causing the slowdown.

The next improvement is to remember the number of paths once calculated. Say if we are on [4,2], we will calculate the path to the grid end from here and mark it in M[4][2]. Next time we are at [4,2], we no longer need to calculate again, as the result can be looked up from M[4][2].

The last algorithm uses dynamic programming to do even less work. It works based on the simple observation that a cell (i,j) can only be reached from just 2 cells. Those are the cell to its immediate left, (i,j-1) and the cell right below it, (i+1,j). Then there is just a single path from these two to (i,j). So if we know the number of paths to those two cells, we can add them up to find the number of paths to (i,j). Then we can keep calculating the paths to each cell, walking from bottom row up, going right on the columns and eventually, we will fill the cell at the top right (0, n -1).

Wednesday, April 04, 2018

Pandas snippets

Here are some useful snippets that can come in handy when cleaning data with pandas. This was useful for me in completing the coursework for python data science course.

Extract a subset of columns from the dataframe based on a regular expression:
Code:

persona1 = pd.Series({
                        'Last Post On': '02/04/2017',
                        'Friends-2015': 10,
                        'Friends-2016': 20,
                        'Friends-2017': 300
})

persona2 = pd.Series({
                        'Last Post On': '02/04/2018',
                        'Friends-2015': 100,
                        'Friends-2016': 240,
                        'Friends-2017': 560
})

persona3 = pd.Series({
                        'Last Post On': '02/04/2014',
                        'Friends-2015': 120,
                        'Friends-2016': 120,
                        'Friends-2017': 120
})

df = pd.DataFrame([persona1, persona2, persona3], 
                  index=['Chris', 'Bella', 'Laura'])
df.filter(regex=("Friends-\d{4}"))

Output:

	Friends-2015	Friends-2016	Friends-2017
Chris	10	20	300
Bella	100	240	560
Laura	120	120	120

Set a column based on the value of both the current row and adjacent rows:

For this example, we define regulars to the gym as those who have gone to the gym last year at least 3 months in a row:
Code:

import datetime
df = pd.DataFrame({'Month': 
                   [datetime.date(2008, i, 1).strftime('%B')
                             for i in range(1,13)] * 3, 
                   'visited': [False]*36},
                   index=['Alice']*12 + 
                         ['Bob']*12 + 
                         ['Bridgett']*12)

df = df.reset_index()

def make_regular(r, name):
    r['visited'] = (r['visited'] or (r['index'] == name) and 
                  ((r['Month'] == 'February') or
                   (r['Month'] == 'March') or
                   (r['Month'] == 'April')))
    return r

df = df.apply(make_regular, axis=1, args=('Alice',))
df = df.apply(make_regular, axis=1, args=('Bob',))
regular = ((df['visited'] == True) & 
          (df['visited'].shift(-1) == True) & 
          (df['visited'].shift(-2) == True))
df[regular]['index'].values .tolist()

Output:

1	['Alice', 'Bob']

Friday, March 23, 2018

Pushing your code to pypi

Here is a good document that describes how to push your code to the Pypi repository.

A URL has changed slightly. In your ~/.pypirc set the URL as follows:

[pypitest]

repository=https://test.pypi.org/legacy/

The register step is no longer required. All you need to do is upload the files.

python setup.py sdist upload -r pypitest

Each time you initiate an upload, you'd need to change the version number and the URL.

While this uploaded the package to test.pypi.org, the upload steps had changed for pypi.org:

thushara@ figleaf (master)$ python setup.py sdist upload -r pypi

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'

warnings.warn(msg)

running sdist

running check

warning: sdist: manifest template 'MANIFEST.in' does not exist (using default file list)

warning: sdist: standard file not found: should have one of README, README.txt

writing manifest file 'MANIFEST'

creating figleaf-0.2

creating figleaf-0.2/figleaf

making hard links in figleaf-0.2...

hard linking setup.cfg -> figleaf-0.2

hard linking setup.py -> figleaf-0.2

hard linking figleaf/__init__.py -> figleaf-0.2/figleaf

hard linking figleaf/graph.py -> figleaf-0.2/figleaf

Creating tar archive

removing 'figleaf-0.2' (and everything under it)

running upload

Submitting dist/figleaf-0.2.tar.gz to https://pypi.python.org/pypi

Upload failed (410): Gone (This API has been deprecated and removed from legacy PyPI in favor of using the APIs available in the new PyPI.org implementation of PyPI (located at https://pypi.org/). For more information about migrating your use of this API to PyPI.org, please see https://packaging.python.org/guides/migrating-to-pypi-org/#uploading. For more information about the sunsetting of this API, please see https://mail.python.org/pipermail/distutils-sig/2017-June/030766.html)

error: Upload failed (410): Gone (This API has been deprecated and removed from legacy PyPI in favor of using the APIs available in the new PyPI.org implementation of PyPI (located at https://pypi.org/). For more information about migrating your use of this API to PyPI.org, please see https://packaging.python.org/guides/migrating-to-pypi-org/#uploading. For more information about the sunsetting of this API, please see https://mail.python.org/pipermail/distutils-sig/2017-June/030766.html)

To upload to pypi I used twine. Installing that on MacOS High Sierra required the removal of SIP.

In ~/.pypirc, I removed the repository line under [pypi]

python setup.py sdist

Remove old tars under dist, and

twine upload dist/*

Now I could see the project under pypi

Installing Twine on MacOS High Sierra

thushara@ wildhops (master)*$ sudo -H pip install twine

Password:

Collecting twine

Downloading twine-1.11.0-py2.py3-none-any.whl

Collecting pkginfo>=1.4.2 (from twine)

Downloading pkginfo-1.4.2-py2.py3-none-any.whl

Requirement already satisfied: setuptools>=0.7.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twine)

Collecting tqdm>=4.14 (from twine)

Downloading tqdm-4.19.8-py2.py3-none-any.whl (52kB)

100% |████████████████████████████████| 61kB 2.1MB/s

Collecting requests-toolbelt>=0.8.0 (from twine)

Downloading requests_toolbelt-0.8.0-py2.py3-none-any.whl (54kB)

100% |████████████████████████████████| 61kB 1.6MB/s

Requirement already satisfied: requests!=2.15,!=2.16,>=2.5.0 in /Library/Python/2.7/site-packages (from twine)

Installing collected packages: pkginfo, tqdm, requests-toolbelt, twine

Exception:

Traceback (most recent call last):

File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 215, in main

status = self.run(options, args)

File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 342, in run

prefix=options.prefix_path,

File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 784, in install

**kwargs

File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 851, in install

self.move_wheel_files(self.source_dir, root=root, prefix=prefix)

File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 1064, in move_wheel_files

isolated=self.isolated,

File "/Library/Python/2.7/site-packages/pip/wheel.py", line 377, in move_wheel_files

clobber(source, dest, False, fixer=fixer, filter=filter)

File "/Library/Python/2.7/site-packages/pip/wheel.py", line 316, in clobber

ensure_dir(destdir)

File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 83, in ensure_dir

os.makedirs(path)

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 150, in makedirs

makedirs(head, mode)

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs

mkdir(name, mode)

OSError: [Errno 1] Operation not permitted: '/System/Library/Frameworks/Python.framework/Versions/2.7/man'

The only way to get write access under /System is to boot into Recovery Mode and run this command on the Terminal:

csrutil disable

Reboot, install again

Thursday, March 22, 2018

A Graph in Python - and pythonic surprises

I started implementing a Graph in python for a project and I encountered an unexpected behavior. See if you can spot the problem.

Code for the graph is here:

However this is buggy. Each time an edge is added to one node, it gets added to all the nodes. Adding an edge from 'bellevue' to 'lynwood' added the edge to both vertices 'bellevue' and 'lynwood'.

Code/Output:

g.add_node(GraphNode('seattle', [Edge('seattle', 'bellevue', 'dist', 10), Edge('seattle', 'lynwood', 'dist', 20)]))

g.add_edge(('bellevue', 'lynwood', 'dist', 5))

print (g)

bellevue -> bellevue:lynwood:dist:5
lynwood -> bellevue:lynwood:dist:5
seattle -> seattle:bellevue:dist:10 seattle:lynwood:dist:20

After a lengthy debugging stint, the issue was identified to be the way Python evaluates default argument values to functions.