Tuesday, March 28, 2017

Using Java components with Jython

I had recently to work with KVStore, from Oracle, whose API only exists in Java. I ended up installing a full Java environment, but I also tested the (eventually simpler) solution to call the API from Jython.

Jython is a port of the Python language to the Java virtual machine. Compared to CPython, it means that some library that use C implementation won't be available (e.g. Numpy). However it can readily access Java classes and provides conversion for all the basic types between the Java and Python world. The nice thing about Jython is that it gives an interpreter to play with Java libraries.

In order to install and run Jython:
  1. You will need a Java JRE
  2. Download Jython, either the stand-alone version or the installer. I use the stand-alone version below to reduce the installation footprint.
  3. If you used the installer, there's a jython.exe command in the jython/bin directory. If you use the installer, the command will be:java -jar <path to the jar>\jython-standalone-2.7.0.jar <other command line arguments>
  4. Add additional .jar as required. In my case, I add to install 2 jars from Oracle KVStore. There are several ways to do it.
    1. the first option is to add all the .jar files to the environment variable JYTHONPATH, before starting jython. 
    2. Or add it to the sys.path directly from the program file. For instance:
import sys
import os
jar_dir = <replace this by the path to the vendor jars directory>

for i in os.listdir(jar_dir):
    sys.path.append(os.path.join(jar_dir,i))

And then it's easy to mix Java and Python code:

from java.util import ArrayList
a=ArrayList()
a.add(1)
print a

Friday, March 24, 2017

Neural networks from the ground up with Python (part 1)

Implementing neural networks and leveraging all the power hidden in your CPU or GPU has become as simple as a few lines of Python code with libraries such as Keras and Tensorflow. Complex applications are remarkable, but how does that work? In order to understand the complex, I like to start with the simple concepts and build in complexity.

Single neuron and linear regression

A single neuron doesn't do very much: given a parameter W (the weight), a bias b and a activation function, it calculates for X the value $\hat Y = f (WX + b)$
We will use the sigmoid function or a linear function (Y=X) for f in this article.

If we use a linear function, then $\hat Y = WX + b$. A straight line. And training it on a data set (X,y) with a quadratic error function is equivalent to do a linear regression: in both cases, it looks for the solution of $$ \min_{W,b} \sum_{i} | Wx_i+b - y_i |^2 $$ Just slower and less accurate because in the case of the linear regression we use an analytic result instead of running an optimization. Here's the code, assuming that you installed Keras. I use it with Tensorflow:

import numpy as np
import keras
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras import optimizers
import keras.objectives as losses

# build the model with Keras: 1 layer, 1 neuron
model = Sequential()
model.add(Dense(1, input_dim=1))

# prepare the training set
np.random.seed(42)
X_train = (np.random.rand(50)*10).reshape(-1,1)
Y_train = 0.1*X_train+2 + (np.random.randn(50)*.1).reshape(-1,1)

# use the mean square error and optimize with gradient descent
model.compile(loss=losses.mean_squared_error, optimizer='sgd')
model.fit(X_train, Y_train, nb_epoch=200, verbose=False) 

# diplay the result
import numpy as np                 
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(X_train.squeeze(),model.predict(X_train).squeeze(),'.r')
plt.scatter(X_train.squeeze(),Y_train.squeeze()) 

And the result is:

That's a very inefficient way to do a linear regression, but it shows well the training and prediction.

With N neurons, we would be able to calculate a linear combination of a vector of size N: X=(X[0],...,X[N-1])


We can obtain W and b with the following lines:
for layer in model.layers:
    h=layer.get_weights()
    print (h)
Which yields:
[array([[ 0.13543533]], dtype=float32), array([ 1.79489422], dtype=float32)]
That's 0.13 instead of 0.1 and 1.79 instead of 2. Not great, but with 500 more iterations I have the following W=0.09437597 and b = 2.00935483. Getting close!


By the way, the proper way to do linear regression is of course:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,Y_train)
print(lr.intercept_,lr.coef_)

Tuesday, March 14, 2017

Monitoring memory usage in a Jupyter notebook

As I was working on a Jupyter notebook, I realized that my computer was slowing down dramatically. A quick check on the memory showed that my Jupyter notebook was growing out of control. We don't necessarily do "big data" but running data analysis on reasonable large data sets for sure has a cost.

There is the command %whos but it doesn't show everything, and the size of the data is not easy to read. So I wrote the little piece of code below: it displays all objects that use more than 1MB, and the total. The line starts with the list of names that are used in the notebook for that object.

import numpy as np
import pandas as pd
def show_mem_usage():
    '''Displays memory usage from inspection
    of global variables in this notebook'''
    gl = sys._getframe(1).f_globals
    vars= {}
    for k,v in list(gl.items()):
        # for pandas dataframes
        if hasattr(v, 'memory_usage'):
            mem = v.memory_usage(deep=True)
            if not np.isscalar(mem):
                mem = mem.sum()
            vars.setdefault(id(v),[mem]).append(k)
        # work around for a bug
        elif isinstance(v,pd.Panel):
            v = v.values
        vars.setdefault(id(v),[sys.getsizeof(v)]).append(k)
    total = 0
    for k,(value,*names) in vars.items():
        if value>1e6:
            print(names,"%.3fMB"%(value/1e6))
        total += value
    print("%.3fMB"%(total/1e6))

It assumes that the data is stored in numpy arrays or pandas dataframe. If the data is stored in a variable that is not directly a global variable, but is indirectly referenced, then it will not be listed. For many applications in Data Analytics, this is sufficient. Just delete the variables (del XXX) that you no longer need and the garbage collector will recover the memory.

Thanks to that code, I realized that  Jupyter stores the result of a cell in a variable named as the cell execution number, prefixed with a _ (e.g _1), and it will keep that variable until you restart the kernel. Therefore, the proper way to display the result of a cell is to print it, not to just write it at the end of the cell code.

Tuesday, March 7, 2017

Distributing Jupyter Workbook with images

Jupyter is an awesome tool. It's an easy way to share information mixed with code, and the Jupyter extension associated with Chrome makes it even better to insert images.
However when images are included in a notebook, the file actually contains a link to the file:
 <img src="QERDFASRTSFDGSDTRWERDSFG.PNG"/>
which means that if you send the .ipynb file, the images will appear as broken.

The first solution that comes to mind is to zip the file and images in a file and distribute it. It works perfectly well, it's however rather inconvenient if there are many images. Also managing deleted images is quite painful.

The second solution is to embed the image data in the tag. This is possible with the data option in the image source:
<img src=\"data:image/png;base64,<data>\"/>

That unfortunately doesn't work: Jupyter sanitizes the code and removes the data. The result is still a broken image.

Here's a simple solution based on the fact that Jupyter doesn't sanitize Python code. It's is therefore possible to include images by writing in a code cell:
display(HTML('<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAMAAABEpIrGAAAAYFBMVEX///82aZT/zj7/2m7/1FZDcpv/88+btMra4+v/55//8MKCobxchahoj6//5JP/3Xv/12K0x9fy9vj/6qv/0UqnvdD/9tv/+ef/4Ib/7bdPfKHB0d6Oq8N0l7Xn7fLN2eQE1eizAAAA2klEQVQ4jZ3SSRaDIBAE0KYBFQxmUDMP979lVAiQUMkitXBhfeHRSPTKbrw5MUW79kEgjRYpLQArUVMbRV0CIZZHiEZgS2PaBIG3/AE+YP7qshJliJkHs/RbUHvA3E/9Tv8AwwRG2AfAFWXTy+OsB2YeMYhupAfr4vxLbk3ne5YZqO/x4DZ8z6wSaOyGy0wruPC9BfUCrmEBBYGJgyTY8yFOEgOZ7gKDyt+mO0OgZJVfenj7/UcBm/8NejIIpEHNel2CUwTHQ8dVCcimNRTqZ9JJxRtp9t/PCvIEy2AHqB9InbEAAAAASUVORK5CYII="/>'))
So in short the idea is to convert all images to a code cell such as the above.
To do this on an actual Jupyter notebook manually is rather tedious so I'm joining the following code that does the job, and adds some code to hide the added code:
import json,base64
import re
import sys


def embed_images(notebookname):
    notebook = json.load(open(notebookname,encoding='utf-8'))    
    file_re = re.compile('<img +src="(.*)"/>')
    for cell in notebook['cells']:
        if cell['cell_type']=='code':
            source = cell['source']
            if source and source[0] == '#dispimage__':
                print('found it')
                break
    else:
        notebook['cells'].insert(0,{
                "cell_type": "code", "execution_count": None,
                "metadata": {"collapsed": True},
                "outputs": [],
                "source": ['from IPython.core.display import HTML,display\n']
                })
        notebook['cells'].append({
                "cell_type": "code", "execution_count": None,"metadata": {"collapsed": True},
                "outputs": [],
                "source": [
     '# Embedded image display: all images are included in python code',
     '# so the file can be distributed without attached image files',
     'display(HTML("""',
     '<script>function sel(){return $("div.input:contains(\'#dispimage__\')")};',
     '$(function(){setTimeout(function() {sel().hide()}),3000});',
     '</script><button onclick="sel().toggle()">Show/hide</button>',
     '"""))']
                })
    # embed images
    cells = notebook['cells']
    for i in range(len(cells)):
        cell = cells[i]
        source = cell['source']
        if cell['cell_type']=='markdown' and len(source)==1:
            m = file_re.match(source[0])
            if m:
                filename = m.group(1)
                print(filename)
                s = open(filename,'rb').read()
                image = base64.b64encode(s)
                cells[i] = {
                    "cell_type": "code",
                    "execution_count": None,
                    "metadata": {
                        "collapsed": True
                    },
                    "outputs": [],
                    "source": [
     "display(HTML('<img src=\"data:image/png;base64,%s\"/> #dispimage__'))"%image.decode('utf-8')
                        ]
                    }

    with open('embed_'+notebookname,'w',encoding='utf-8') as f:
        json.dump(notebook,f)

if __name__ == '__main__':
    import sys
    embed_images(sys.argv[1])

All you need to do is to copy the above code in a file, then run
python [thatfilename.py] [yournotebook.ipynb]

It creates a new workbook starting with embed_. The script adds a button at the end of the notebook to hide or display the added code.