Pandas.DataFrame info always answers and is sometimes right

TIL

Today I learned that the .info() of a pandas.DataFrame will always give you an answer, but it is wildly difficult to know how accurate it is, because it depends on the underlying data types. If everything is a NumPy Dtype, then the result you get is the amount of memory NumPy is using under the hood, which is very accurate. But as soon as you have something like a Python string in the data frame, now pandas will tell you how much memory the NumPy data types are using, and it will give you some quick estimate of how much memory space the Python string types are using because numpy is just storing a pointer to that data in memory, but this is far less accurate because it doesn't actually know what the strings are.

Memory Usage Deep or True

To get an honest answer about a DataFrame's memory usage, you need to pass memory_usage='deep' to the .info() method.


# This is the default
In [50]: df.info(memory_usage=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   s1      10 non-null     int64 
 1   s2      10 non-null     int64 
 2   s3      10 non-null     object
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes

Notice that the memory usage has a + in it? You can force a deeper analysis by passing memory_usage='deep'


In [51]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   s1      10 non-null     int64 
 1   s2      10 non-null     int64 
 2   s3      10 non-null     object
dtypes: int64(2), object(1)
memory usage: 792.0 bytes