Monday, May 29, 2017

Size matters

In a recent project whose objective is shot transition detection, the simplest algorithm I was testing required the calculation of the standard deviation of each image. With numpy, assuming that X is an array containing the images, it's as simple as:

N = X.shape[0] # number of images
X0 = X.reshape((N,-1)) #flattens the images
np.std(X0,axis=1) # calculates one value by image

It's however relatively slow. The formula is simple. std could be calculated as
np.mean(X**2,axis=1)-np.mean(X,axis=1)**2
or
np.mean((X-np.mean(X,axis=1))**2,axis=1)

I tested the first formula as written above, with X a vector of uint8 (unsigned integer on 8 bits). There was a good surprise: it was much faster. And a bad one: the result was completely wrong. The reason is that the image is made of pixels, which are coded as 3 bytes, each of them containing the luminosity of each channel (red, green, blue) from 0 to 255. It's stored in an unsigned byte, which goes from 0 to 255 also, for memory efficiency. The problem is that by default, it will store the results of its operations in a vector of the same type, which means that instead of calculating $X^2$, it will calculate $X^2 \mod 256$.

Most of the time, numpy is smart and runs the calculation in the proper type: for instance np.std above uses double precision float numbers (float64). However converting to that type requires time, which explains the poor performance. Since I needed also to calculate the mean of the array, the best performance I could achieve was using the following:

mean = np.mean(X0,axis=1)
np.sqrt(np.mean(X0.astype(np.uint16)**2,axis=1)-mean**2)

The conversion to unit16 is required to process the squaring, but requires less time than the conversion to double precision. Numpy calculates the mean using a double but we need one by image, not one by pixel. For a vector or 4000 images of 245760 pixels each, the calculation now takes 1.36s compared to 2.46s initially.

Friday, May 12, 2017

Python and Excel

In personal projects as well as in my job, I've often had to load data coming from Excel files. Tens of them. In all sorts of flavors, shapes and versions. csv, xls, xlsx and more, such pdf exports. The worst of if is that even inside a workbook the file has its own formatting: tables separated with blank lines, changing headers, strangely formatted dates, nothing can stop the creativity of the authors.

There are many Python packages to load them: csv is pretty well managed with the included csv module or numpy or pandas. xlrd can read xls files with formatting and xlsx without. openpyxl can read and write xlsx files. But I felt I needed a higher level package that could handle the layout inside a sheet.

Here's Sheetparser, a Python module whose idea is to describe the layout of the sheet with spatial patterns, allowing for changes in actual position of the elements. 
Let's take an example: the sheet below is made of 2 tables and some lines, separated by empty lines:


If the size of the tables change between 2 versions of the file, it can become quite tedious to write the code to read the data. With sheetpattern it's a simple as:

from sheetpattern import *
sheet = load_workbook(filename)['Sheet2']
pattern = Sheet('sheet', Rows,
                Table, Empty, Table, Empty,
                Line, Line)
context = PythonObjectContext()
pattern.match_range(sheet, context)
assert context.table.data[0][0] == 'a11'
        
And it's easy to make that code tolerant to small layout changes. I think the library works well and changed the way I load and write Excel sheets programmatically.

It was also a good exercise to learn how to properly publish a package:
 
 

Tuesday, May 9, 2017

Neural networks from the ground up with python (part 2)

In a previous post, I showed that a 1 layer "network" with a linear activation function implemented a linear regression. Let's now use the sigmoid function and do some classification.

The activation function: 1D classifier

Classification is about tagging data: associate to each input a class. Let's take the following very simple problem: we have 40 values between 0 and 1. Each of them is tagged with 0 or 1: 0 if X is less than 0.5 and 1 otherwise.

size = 40
X_train = np.linspace(0,1,size)
y_train = (X_train > 0.5)

We want to predict the tag given the value

Let's use one single neuron, which does the following calculation:
$WX+b$.
We now use an activation function, a function that further transforms the previous calculation. We use sigmoid, which is defined as $\sigma(t) = \frac 1 {1+e^{-x}}$.

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

We now calculate our predicted Y as $\sigma(WX+b)$. W will change the slope and b (in fact -b/W) shift the curve horizontally.

So the idea will be to find the right parameters for W and b so the S shape function is the "closest" to the training parameters. This is done by defining a distance function and minimizing the loss, which is the sum of distances.

We see in the figure on the right a poorly fitted S shape, with W=15 and b = -0.3*W. Clearly we can do better.


The best W and b are found as the parameters that minimize the error. That can be done very simply with Keras as follows:
model = Sequential()
model.add(Dense(1, input_dim=1))
model.add(Activation("sigmoid"))
model.compile(loss=losses.binary_crossentropy, optimizer='sgd')
model.fit(X_train, y_train, nb_epoch=1000, verbose=False)

The result is a well centered sigmoid function that properly predicts the tag.
This is actually equivalent to do a logistic regression.