Pandas-String-Contains

🗒️

TL;DR

pandas.Series.str.contains accepts regular expressions and this is turned on by default!

Use case

We often need to filter pandas DataFrames based on several string values in a Series.

Notice that sweet pyflyby import 😁!

sandbox   main via 3.8.11(sandbox) ipython
❯ df = pd.DataFrame({"A": ["string1", "string2", "string3"]})
[PYFLYBY] import pandas as pd

sandbox   main via 3.8.11(sandbox) ipython
❯ df

         A
0  string1
1  string2
2  string3

sandbox   main via 3.8.11(sandbox) ipython
❯ df[df.A.str.contains('1') | df.A.str.contains('2')]

         A
0  string1
1  string2

And this isn't the worst thing in the world, especially for such a tiny example...

But what if we had dozens or more values to filter on?

Then it looks so much nicer to create an iterable of the values we want to filter on and join them with an apropriate regex operator (in this case | for inclusive or)


sandbox   main via 3.8.11(sandbox) ipython
❯ vals = ["1", "2"]  # iterable with whatever is appropriate for your use case

sandbox   main via 3.8.11(sandbox) ipython
❯ df[df.A.str.contains("|".join(vals), regex=True)]

         A
0  string1
1  string2

Fin

This is a super nice and concise way to do the kind of filtering my team does on a daily basis!