Tuesday, March 14, 2017

Monitoring memory usage in a Jupyter notebook

As I was working on a Jupyter notebook, I realized that my computer was slowing down dramatically. A quick check on the memory showed that my Jupyter notebook was growing out of control. We don't necessarily do "big data" but running data analysis on reasonable large data sets for sure has a cost.

There is the command %whos but it doesn't show everything, and the size of the data is not easy to read. So I wrote the little piece of code below: it displays all objects that use more than 1MB, and the total. The line starts with the list of names that are used in the notebook for that object.

import numpy as np
import pandas as pd
def show_mem_usage():
    '''Displays memory usage from inspection
    of global variables in this notebook'''
    gl = sys._getframe(1).f_globals
    vars= {}
    for k,v in list(gl.items()):
        # for pandas dataframes
        if hasattr(v, 'memory_usage'):
            mem = v.memory_usage(deep=True)
            if not np.isscalar(mem):
                mem = mem.sum()
            vars.setdefault(id(v),[mem]).append(k)
        # work around for a bug
        elif isinstance(v,pd.Panel):
            v = v.values
        vars.setdefault(id(v),[sys.getsizeof(v)]).append(k)
    total = 0
    for k,(value,*names) in vars.items():
        if value>1e6:
            print(names,"%.3fMB"%(value/1e6))
        total += value
    print("%.3fMB"%(total/1e6))

It assumes that the data is stored in numpy arrays or pandas dataframe. If the data is stored in a variable that is not directly a global variable, but is indirectly referenced, then it will not be listed. For many applications in Data Analytics, this is sufficient. Just delete the variables (del XXX) that you no longer need and the garbage collector will recover the memory.

Thanks to that code, I realized that  Jupyter stores the result of a cell in a variable named as the cell execution number, prefixed with a _ (e.g _1), and it will keep that variable until you restart the kernel. Therefore, the proper way to display the result of a cell is to print it, not to just write it at the end of the cell code.

1 comment:

  1. I tried using your code, but got the following:

    TypeError Traceback (most recent call last)
    in ()
    ----> 1 show_mem_usage()
    2
    3 # clean up the date column to move from 'object' to real date
    4
    5 # original format is '02.01.2013'

    in show_mem_usage()
    9 # for pandas dataframes
    10 if hasattr(v, 'memory_usage'):
    ---> 11 mem = v.memory_usage(deep=True)
    12 if not np.isscalar(mem):
    13 mem = mem.sum()

    TypeError: memory_usage() missing 1 required positional argument: 'self'

    ReplyDelete