Pandas-Select-Dtypes

On my team we often have to change data types of columns in a pandas.DataFrame for a variety of reasons. The main one is it tends to be an artifact of EDA whereby a file is read in via pandas but the data types are somewhat wonky (ie. dates show up as strings, or a column that should be a integer comes in as float, etc.). The best solution I think is to leverage the dtypes keyword argument in which pd.read_X method is used. However there is another way which is to coerce the data types at runtime instead of loadtime.

A handy way to do this is by using pandas.DataFrame.select_dtypes...

Here is an example of finding columns read in as datetime64 and the developer would prefer to use pandas datetimes.

df = pd.read_csv("./file-with-confusing-dtypes.csv")
for c in df.columns:
    if df[c].dtype == "datetime64":
        df[c] = pd.to_datetime(df.c)

Here is the difference in code flow between select_dtypes and manually finding the datetype64 columns:

df = pd.read_csv("./file-with-confusing-dtypes.csv")
for c in df.select_dtypes('datetime64'):
    df[c] = pd.to_datetime(df.c)

The difference isn't huge but it's the little steps in leveling up that turn script-kitty scripts into clean looking functions.

Comments