For a while now, I've been using Pandas exclusively to do input/output of Excel and CSV files. It's great! It detects encodings, deals with types such as datetimes and missing values, can handle different formats (xls, xlsx, xlsm, xlsb, ...) all from a single function call. Highly recommend.
However, I am using pandas exclusively to do input/output of Excel and CSV files. I don't use it to mangle data. I don't use it to write to SQL databases. I don't even use it for batch operation on columns. All I want is for my data to get out of the Excel files and into native Python objects as fast as possible.
The reason for this is simple: I want to use the tools that are present in the Python ecosystem to do my job. Python has been around since 1991, and in the time since, a wealth of Python tools have developed around the core language. This core language has no idea what a DataFrame (arguably the fundamental Pandas building block) is. As a consequence, neither do the tools.
There is a real disconnect between "the Pandas ecosystem" and "the Python ecosystem", even though the former is a subset of the latter! This is caused by a few things.
The original reason I started eschewing Pandas was simply the import time. I was working on a serverless backend for a web application, and Pandas was taking half a second to load on almost every request. Since the data was small (hundreds of rows, tens of columns), I decided to try and rewrite the backend in pure Python.
To my surprise, not only did the overhead of the import disappear. The code itself got faster. Now that I know a little bit more about writing fast code, I see that I could've expected this. While Pandas (Numpy) operations tend to be vectorized (i.e. rows are calculated in batches rather than one-at-a-time), there is some overhead to obtain this privilege: data has to be transferred out of Python objects into Numpy arrays, you have to keep track of metadata to make sure the fast C code can run unobstructed, etc. There is also the kitchen sink of features that Pandas has: the Python interpreter simply has to run through a lot of code before it gets to the juicy bits. Remember, kids, the fastest code is the code that doesn't execute!
Here's a table of comparisons between pure Python, Pandas, and Numpy: adding a list of integers between 0 and 100 for various sizes. Notice that Python outruns Pandas (but not Numpy) for this simple task until about 2000 elements. And this is a task that doesn't involve much data wrangling: a straightforward add. Once you start mixing Python and Pandas code in the same place, you start paying this overhead multiple times.
I would like to add more data to substantiate the claim here, but haven't got round to it.
Size | Python | Pandas | Numpy |
---|---|---|---|
10 | 0.00075s | 0.052s | 0.00065s |
100 | 0.0041s | 0.049s | 0.00064s |
1000 | 0.034s | 0.049s | 0.0011s |
2500 | 0.082s | 0.051s | 0.0016s |
5000 | 0.16s | 0.053s | 0.0024s |
10000 | 0.33s | 0.055s | 0.0041s |
I alluded to types in the Performance section. Indeed, Numpy specifies it's own data types, and for good reason: if you want to go fast, you need to make sure you're operating directly on the actual data. Part of the power of Python is that, under the hood, everything is a PyObject. This makes it possible to pass strings into functions written to add numbers and still have everything work out. The downside is that the PyObject always holds a reference to the actual data; so if you're trying to access the underlying data, there is some overhead to access it.
Because Pandas is built on top of Numpy, it inherits these "own data types". This leads to problems when trying to combine Pandas types and regular Python types. For instance, the line s.astype(int)
(where s
is a Pandas Series) does different things depending on the platform you execute it on, none of them consistent with the semantics of int
.
Python's integer type is (thanks to the magic of PyObject) arbitrarily sized. It can handle any number you ask it to, no matter how large. However, Numpy's integer types are closer to the metal (core of the machine). I actually tried this casting to int
for real, when reading some data from an Excel file, and having to cast a float to an integer. I got a C-long instead of a Python integer. This was bad, because the maximum value you can store in a C-long on Windows is a little over 2 billion, and here I was doing calculations for a fund which was investing over 10 billion euros. The data ended up containing a negative number, because the conversion a) didn't warn that it was giving me a 32-bit integer instead of the arbitrary-size integer I asked for, and b) didn't error out once numbers did start to overflow.
Another area where Pandas goes its own way is in dealing with time. Since Pandas was originally developed for time series analysis, it makes sense that the folks at AQR would run into the limits of Python's built-in suite of datetime types, and that they would roll their own. Unfortunately, this results in hassle every time you try to mix and match the two: they simply don't understand eachother.
This means that when you start working with datetimes in Pandas, you have to learn a whole new API, new ways of doing stuff that you already know how to do in pure Python. But you also can't put pure Python time-related objects in a DataFrame, because that will automatically convert them to Pandas' version of time-related objects.
Another area is missing values. Due to historical concerns, Pandas has not one, not two, not even three, but FOUR "missing value" types. That is, pandas.NA
, numpy.nan
, pandas.NaT
, and it also recognizes good-old None
. From the historical perspective, it's understandable: first you build on top of Numpy, so you get numpy.nan
. Then you build your own time handling type on top of that, so you need something equivalent, which turns into pandas.NaT
(None
is also a PyObject so it makes sense to roll your own for performance reasons). pandas.NA
is an attempt to unify all of the above. However, we're still stuck in the XKCD 927-phase of that process, so we have four. And even if we only had one, it still wouldn't be the same one used in regular Python.
And this is one of my central issues with Pandas: it provides convenience around reading, slicing, and manipulating tabular data, but in the process of doing so, it pulls in a whole slew of crap that has nothing to do with that! It's not just time-related types. It's integers, floats, text types, the whole shebang. Using Pandas is almost like using an entirely different programming language.
Pandas is a very full-featured library. It does everything from data reading, to type casting, array mathematics, string manipulation, datetime wrangling, data writing, and data reshaping. All this functionality does have a cost.
The first cost is in performance. Even if you don't use a particular feature, the code still has to check if you had wanted to use that feature, which generates overhead. This doesn't seem too bad, but try reading the code for pandas.read_excel
.
At importing, to get pandas.read_excel
in the namespace, the interpreter goes from pandas/__init__.py
here to pandas/io/api.py
to pandas/io/excel/__init__.py
to pandas/io/excel/_base.py
here. That's opening 4 layers of indirection already. This is not the biggest amount of overhead, and everything is easy to find, but it gives you a flavour of what's to come.
In _base.py
, you can find the definition of read_excel
. After skipping through the overloads (these are read by the interpreter on import, but are ignored during the rest of the program) we step into the definition, we find the possibility of opening an ExcelFile in case we hadn't already (I usually don't). Then this functions punts all the work to io.parse
inside a try-finally
-block to make sure the file gets closed at the end.
So, off to io.parse
we go! Oh, but io
is an object, which was sure to be an ExcelFile
. So that means we have to go find ExcelFile.parse
. This happens to exist in the same file, on line 1633. Reading this method is somewhat disappointing. It just calls self._reader.parse
, so we need to figure out what is in self._reader
and go from there.
Let's assume we called read_excel
with just a filepath, so we get the basic ExcelFile
, and moreover that we're using the openpyxl
engine. Common sense and some code reading tells us we get the OpenpyxlReader
from _openpyxl.py. At this point I already forgot which function we're trying to call, but luckily we kept notes!
Right, parse
it is. And this seems... not defined? In the 99 lines of OpenpyxlReader
, there is no def parse
. Looking closer, we see that this reader inherits from BaseExcelReader
. Which lives back in _base.py.
Now we're finally at the meat of the method (line 675), which took us 6 hops. As you would expect, there is a decent amount of input validation happening: after all, this is a widely used library and who knows what people are going to throw at it. But you have to do all this work every time you want to read in an Excel sheet, even if you know you passed the parameters in correctly. [3]
The first bit of meaningful work (as in, actually doing something with the Excel sheet or the data it contains) that is actually happening is not in this method at all. It's on line 738 and gets the sheet by a specified name (or index on line 740). This code is actually defined back in the OpenpyxlReader.
After this there's a bunch more code that takes care of figuring out what kind of headers and indexes the file has, and what we should do about it. And then, you probably guessed it because this has been going on all the time in this story, the function punts the remaining work of actually parsing the rest of the data to TextParser
, which lives in pandas/readers.py. And this function just calls __init__
on TextFileReader
, so it's a glorified constructor which should live on TextFileReader in the first place. What the calling code actually does is call read
on said class.
Let me stop for a second here and tell you this. If your use for a class is "call __init__
and then a method", that's not a class, that's a function. Wisdom courtesy of Jack Diederich.
And then, finally, once we look at read
(Line 1742 in readers.py), we see the mythical dataframe being created: on line 1775, we find df = DataFrame(col_dict, columns=columns, index=index)
. Great!
But where does col_dict
come from? Glad you asked. From self._engine.read
(Line 1758). So after you have burrowed yourself knee deep in the layers and layers of indirection that go into this code, you don't (I certainly don't) even remember where the engine argument was passed in. I know the engine argument we passed in somewhere is "openpyxl".
But as it turns out, this self._engine.read
is another type of engine, namely one which parses a bunch of text (the clue is in the name "TextReader").
You can keep pulling at this yarn, and more stuff will keep coming out, but let's stop there. The point is, this code is very hard to read: there's a lot of hopping around while you're trying to remember what your place in the code was, a lot of back and forth, and a lot of stuff that seems only tangentially related to the task at hand.
Part of this lies in the design: Pandas is built using OOP principles, and so it seems natural to have a base class for reading Excel files that handles all the common logic, where the subclasses implement their specific piece of the puzzle. But as you can tell, this terrible for code readability (and thus maintainability). A much cleaner design would be to write a function that does all of this header and text extraction business, and takes in the data already extracted from the Excelsheet and puts it into a neat DataFrame. Extract - Transform - Load. We've known about this process since the 70s. You can even put it into a nice little function to keep together to maintain API compatibility, something like this:
def read_excel(io, engine, sheet_name=None, **kwds):
# Extract
data = engine.extract_raw_data(io, sheet_name=sheet_name) # Doesn't have to be a function!
# Transform
index, columns = find_columns_and_index(data, **kwds)
parsed_data = parse_text(data, **kwds)
# Load
return DataFrame(parsed_data, columns=columns, index=index)
Another factor is the giant amount of features the Excel reader has, and the fact that read_excel
is the entrypoint to read Excel-related files, no matter what kind of file you are reading. It has many many options; some of them relating to the extraction step, some of them related to the transformation step, and some of them relating to the load step. One example is passing in storage_options
which gets passed on to the ExcelFile constructor and influences how to open it. It allows you to open an Excel file from a url.
Yes, you can open Excel files from URLs. And if you pass in storage_options
you can even pass in basic authentication like a Bearer token to the underlying urrlib3 request. I mean, that's a nice feature, but what's wrong with just passing in a BytesIO?
import pandas as pd
from io import BytesIO
import requests
response = requests.get('https://github.com/pandas-dev/pandas/blob/main/pandas/tests/io/data/excel/test1.xlsm?raw=true')
response.raise_for_status()
df = pd.read_excel(BytesIO(response.content))
With this method, you can also adjust the security settings of your request, the data passed in, unwrapping the response,.... At my current workplace all internal network traffic is unencrypted, and then re-encrypted at the network boundary. That does lead to some SSL-headaches every now and again, and having control over network is a boon there.
It's also more flexible: what if the Excel Sheet was in a database instead, say some SQL Server which stores binary blobs of Excel files? With the above pattern, it's obvious: I just pull down the Excelfile like normal, using something like pyodbc
, and then pass it in in exactly the same way. What's the alternative? Add yet another option to read_excel
? Have read_sql
grow the capability of parsing Excel files from database tables?
So, uhh, yeah. Big rant, here's the point: Pandas is a big ball of features. A very useful big ball of features, but the features are not very orthogonal: they intersect, interlock, and interdepend. The ways they do so is very hard to anticipate from reading the code. I wish people would stop writing code like it.