Friday, May 12, 2017

Python and Excel

In personal projects as well as in my job, I've often had to load data coming from Excel files. Tens of them. In all sorts of flavors, shapes and versions. csv, xls, xlsx and more, such pdf exports. The worst of if is that even inside a workbook the file has its own formatting: tables separated with blank lines, changing headers, strangely formatted dates, nothing can stop the creativity of the authors.

There are many Python packages to load them: csv is pretty well managed with the included csv module or numpy or pandas. xlrd can read xls files with formatting and xlsx without. openpyxl can read and write xlsx files. But I felt I needed a higher level package that could handle the layout inside a sheet.

Here's Sheetparser, a Python module whose idea is to describe the layout of the sheet with spatial patterns, allowing for changes in actual position of the elements. 
Let's take an example: the sheet below is made of 2 tables and some lines, separated by empty lines:


If the size of the tables change between 2 versions of the file, it can become quite tedious to write the code to read the data. With sheetpattern it's a simple as:

from sheetpattern import *
sheet = load_workbook(filename)['Sheet2']
pattern = Sheet('sheet', Rows,
                Table, Empty, Table, Empty,
                Line, Line)
context = PythonObjectContext()
pattern.match_range(sheet, context)
assert context.table.data[0][0] == 'a11'
        
And it's easy to make that code tolerant to small layout changes. I think the library works well and changed the way I load and write Excel sheets programmatically.

It was also a good exercise to learn how to properly publish a package:
 
 

No comments:

Post a Comment