Monday, July 3, 2017

Easy debug with Jupyter

Ok, there are plenty of debuggers out there, starting with ipdb. But in many cases, I'm just interested in inspecting the value of the variables of the function where the error happened. In fact, this type of post mortem debugging is very effective and I developed a package for that: jupydbg (on github)

It's really simple: first install the package (from pypi: pip install jupydbg or download from github and unzip in your pythonpath)

The first package, catch, can be used as follows:
  • import catch from the module
  • invoke the magic %%catch in your cell where an error occurs
  • selects the frame that you want to inspect
  • click on the button next to it
  • then call %eval in a next cell to inspect the data


The second package is to run unittest interactively.
import jupydbg.jupytest
then %test <the suite, case or method>





Monday, May 29, 2017

Size matters

In a recent project whose objective is shot transition detection, the simplest algorithm I was testing required the calculation of the standard deviation of each image. With numpy, assuming that X is an array containing the images, it's as simple as:

N = X.shape[0] # number of images
X0 = X.reshape((N,-1)) #flattens the images
np.std(X0,axis=1) # calculates one value by image

It's however relatively slow. The formula is simple. std could be calculated as
np.mean(X**2,axis=1)-np.mean(X,axis=1)**2
or
np.mean((X-np.mean(X,axis=1))**2,axis=1)

I tested the first formula as written above, with X a vector of uint8 (unsigned integer on 8 bits). There was a good surprise: it was much faster. And a bad one: the result was completely wrong. The reason is that the image is made of pixels, which are coded as 3 bytes, each of them containing the luminosity of each channel (red, green, blue) from 0 to 255. It's stored in an unsigned byte, which goes from 0 to 255 also, for memory efficiency. The problem is that by default, it will store the results of its operations in a vector of the same type, which means that instead of calculating $X^2$, it will calculate $X^2 \mod 256$.

Most of the time, numpy is smart and runs the calculation in the proper type: for instance np.std above uses double precision float numbers (float64). However converting to that type requires time, which explains the poor performance. Since I needed also to calculate the mean of the array, the best performance I could achieve was using the following:

mean = np.mean(X0,axis=1)
np.sqrt(np.mean(X0.astype(np.uint16)**2,axis=1)-mean**2)

The conversion to unit16 is required to process the squaring, but requires less time than the conversion to double precision. Numpy calculates the mean using a double but we need one by image, not one by pixel. For a vector or 4000 images of 245760 pixels each, the calculation now takes 1.36s compared to 2.46s initially.

Friday, May 12, 2017

Python and Excel

In personal projects as well as in my job, I've often had to load data coming from Excel files. Tens of them. In all sorts of flavors, shapes and versions. csv, xls, xlsx and more, such pdf exports. The worst of if is that even inside a workbook the file has its own formatting: tables separated with blank lines, changing headers, strangely formatted dates, nothing can stop the creativity of the authors.

There are many Python packages to load them: csv is pretty well managed with the included csv module or numpy or pandas. xlrd can read xls files with formatting and xlsx without. openpyxl can read and write xlsx files. But I felt I needed a higher level package that could handle the layout inside a sheet.

Here's Sheetparser, a Python module whose idea is to describe the layout of the sheet with spatial patterns, allowing for changes in actual position of the elements. 
Let's take an example: the sheet below is made of 2 tables and some lines, separated by empty lines:


If the size of the tables change between 2 versions of the file, it can become quite tedious to write the code to read the data. With sheetpattern it's a simple as:

from sheetpattern import *
sheet = load_workbook(filename)['Sheet2']
pattern = Sheet('sheet', Rows,
                Table, Empty, Table, Empty,
                Line, Line)
context = PythonObjectContext()
pattern.match_range(sheet, context)
assert context.table.data[0][0] == 'a11'
        
And it's easy to make that code tolerant to small layout changes. I think the library works well and changed the way I load and write Excel sheets programmatically.

It was also a good exercise to learn how to properly publish a package:
 
 

Tuesday, May 9, 2017

Neural networks from the ground up with python (part 2)

In a previous post, I showed that a 1 layer "network" with a linear activation function implemented a linear regression. Let's now use the sigmoid function and do some classification.

The activation function: 1D classifier

Classification is about tagging data: associate to each input a class. Let's take the following very simple problem: we have 40 values between 0 and 1. Each of them is tagged with 0 or 1: 0 if X is less than 0.5 and 1 otherwise.

size = 40
X_train = np.linspace(0,1,size)
y_train = (X_train > 0.5)

We want to predict the tag given the value

Let's use one single neuron, which does the following calculation:
$WX+b$.
We now use an activation function, a function that further transforms the previous calculation. We use sigmoid, which is defined as $\sigma(t) = \frac 1 {1+e^{-x}}$.

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

We now calculate our predicted Y as $\sigma(WX+b)$. W will change the slope and b (in fact -b/W) shift the curve horizontally.

So the idea will be to find the right parameters for W and b so the S shape function is the "closest" to the training parameters. This is done by defining a distance function and minimizing the loss, which is the sum of distances.

We see in the figure on the right a poorly fitted S shape, with W=15 and b = -0.3*W. Clearly we can do better.


The best W and b are found as the parameters that minimize the error. That can be done very simply with Keras as follows:
model = Sequential()
model.add(Dense(1, input_dim=1))
model.add(Activation("sigmoid"))
model.compile(loss=losses.binary_crossentropy, optimizer='sgd')
model.fit(X_train, y_train, nb_epoch=1000, verbose=False)

The result is a well centered sigmoid function that properly predicts the tag.
This is actually equivalent to do a logistic regression.

Tuesday, March 28, 2017

Using Java components with Jython

I had recently to work with KVStore, from Oracle, whose API only exists in Java. I ended up installing a full Java environment, but I also tested the (eventually simpler) solution to call the API from Jython.

Jython is a port of the Python language to the Java virtual machine. Compared to CPython, it means that some library that use C implementation won't be available (e.g. Numpy). However it can readily access Java classes and provides conversion for all the basic types between the Java and Python world. The nice thing about Jython is that it gives an interpreter to play with Java libraries.

In order to install and run Jython:
  1. You will need a Java JRE
  2. Download Jython, either the stand-alone version or the installer. I use the stand-alone version below to reduce the installation footprint.
  3. If you used the installer, there's a jython.exe command in the jython/bin directory. If you use the installer, the command will be:java -jar <path to the jar>\jython-standalone-2.7.0.jar <other command line arguments>
  4. Add additional .jar as required. In my case, I add to install 2 jars from Oracle KVStore. There are several ways to do it.
    1. the first option is to add all the .jar files to the environment variable JYTHONPATH, before starting jython. 
    2. Or add it to the sys.path directly from the program file. For instance:
import sys
import os
jar_dir = <replace this by the path to the vendor jars directory>

for i in os.listdir(jar_dir):
    sys.path.append(os.path.join(jar_dir,i))

And then it's easy to mix Java and Python code:

from java.util import ArrayList
a=ArrayList()
a.add(1)
print a

Friday, March 24, 2017

Neural networks from the ground up with Python (part 1)

Implementing neural networks and leveraging all the power hidden in your CPU or GPU has become as simple as a few lines of Python code with libraries such as Keras and Tensorflow. Complex applications are remarkable, but how does that work? In order to understand the complex, I like to start with the simple concepts and build in complexity.

Single neuron and linear regression

A single neuron doesn't do very much: given a parameter W (the weight), a bias b and a activation function, it calculates for X the value $\hat Y = f (WX + b)$
We will use the sigmoid function or a linear function (Y=X) for f in this article.

If we use a linear function, then $\hat Y = WX + b$. A straight line. And training it on a data set (X,y) with a quadratic error function is equivalent to do a linear regression: in both cases, it looks for the solution of $$ \min_{W,b} \sum_{i} | Wx_i+b - y_i |^2 $$ Just slower and less accurate because in the case of the linear regression we use an analytic result instead of running an optimization. Here's the code, assuming that you installed Keras. I use it with Tensorflow:

import numpy as np
import keras
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras import optimizers
import keras.objectives as losses

# build the model with Keras: 1 layer, 1 neuron
model = Sequential()
model.add(Dense(1, input_dim=1))

# prepare the training set
np.random.seed(42)
X_train = (np.random.rand(50)*10).reshape(-1,1)
Y_train = 0.1*X_train+2 + (np.random.randn(50)*.1).reshape(-1,1)

# use the mean square error and optimize with gradient descent
model.compile(loss=losses.mean_squared_error, optimizer='sgd')
model.fit(X_train, Y_train, nb_epoch=200, verbose=False) 

# diplay the result
import numpy as np                 
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(X_train.squeeze(),model.predict(X_train).squeeze(),'.r')
plt.scatter(X_train.squeeze(),Y_train.squeeze()) 

And the result is:

That's a very inefficient way to do a linear regression, but it shows well the training and prediction.

With N neurons, we would be able to calculate a linear combination of a vector of size N: X=(X[0],...,X[N-1])


We can obtain W and b with the following lines:
for layer in model.layers:
    h=layer.get_weights()
    print (h)
Which yields:
[array([[ 0.13543533]], dtype=float32), array([ 1.79489422], dtype=float32)]
That's 0.13 instead of 0.1 and 1.79 instead of 2. Not great, but with 500 more iterations I have the following W=0.09437597 and b = 2.00935483. Getting close!


By the way, the proper way to do linear regression is of course:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,Y_train)
print(lr.intercept_,lr.coef_)

Tuesday, March 14, 2017

Monitoring memory usage in a Jupyter notebook

As I was working on a Jupyter notebook, I realized that my computer was slowing down dramatically. A quick check on the memory showed that my Jupyter notebook was growing out of control. We don't necessarily do "big data" but running data analysis on reasonable large data sets for sure has a cost.

There is the command %whos but it doesn't show everything, and the size of the data is not easy to read. So I wrote the little piece of code below: it displays all objects that use more than 1MB, and the total. The line starts with the list of names that are used in the notebook for that object.

import numpy as np
import pandas as pd
def show_mem_usage():
    '''Displays memory usage from inspection
    of global variables in this notebook'''
    gl = sys._getframe(1).f_globals
    vars= {}
    for k,v in list(gl.items()):
        # for pandas dataframes
        if hasattr(v, 'memory_usage'):
            mem = v.memory_usage(deep=True)
            if not np.isscalar(mem):
                mem = mem.sum()
            vars.setdefault(id(v),[mem]).append(k)
        # work around for a bug
        elif isinstance(v,pd.Panel):
            v = v.values
        vars.setdefault(id(v),[sys.getsizeof(v)]).append(k)
    total = 0
    for k,(value,*names) in vars.items():
        if value>1e6:
            print(names,"%.3fMB"%(value/1e6))
        total += value
    print("%.3fMB"%(total/1e6))

It assumes that the data is stored in numpy arrays or pandas dataframe. If the data is stored in a variable that is not directly a global variable, but is indirectly referenced, then it will not be listed. For many applications in Data Analytics, this is sufficient. Just delete the variables (del XXX) that you no longer need and the garbage collector will recover the memory.

Thanks to that code, I realized that  Jupyter stores the result of a cell in a variable named as the cell execution number, prefixed with a _ (e.g _1), and it will keep that variable until you restart the kernel. Therefore, the proper way to display the result of a cell is to print it, not to just write it at the end of the cell code.