I often struggle to remember the correct way to do and
type comparisons when working in pandas.
I remember learning long long ago that and
and &
are different, the former being lazy boolean evaluation whereas the latter is a bitwise operation.
I learned a lot from this SO post
Lists
Python list
objects can contain unlike elements - ie. [True, 'foo', 1, '1', [1,2,3]]
is a valid list with booleans, strings, integers, and another list.
Because of this, we can't use &
to compare two lists since they can't be combined in a consistent and meaningful way.
However we can use and
since it doesn't do bitwise operations, it just evaluates the boolean value of the list (basically if it's non-empty then bool(my_list)
evaluates to True
)
Here's an example:
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ my_list = [1, "2", "foo", [True], False]
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ bool(my_list)
True
If we compare my_list
with another_list
using and
then the comparision will go:
if bool(my_list):
if bool(another_list):
<operation>
else:
break
Let's see another example:
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ another_list = [False, False]
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ my_list and another_list
[False, False]
bool(my_list)
evaluated to True
, and bool(another_list)
also evaluated to True
even though it's full of False
values because the object is non-empty.
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if my_list and another_list:
...: print("foo")
foo
So using and
in this case results in a True
conditional, so the print
statement is executed.
Feels kind of counter-intuitive at first glance, to me anyways...
However, we can't use &
because there isn't a meaningful to do bitwise operations over these two lists:
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ my_list & another_list
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-19-a2a16cebb3da>:1 in <cell line: 1> │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: unsupported operand type(s) for &: 'list' and 'list'
Numpy
numpy
arrays are special and they have a lot of fancy vectorization utilities built-in which make them great and fast for mathematical operations but now our logical comparisons need to be handled with a different kind of care.
First thing though - without some trickery they do not hold mixed data types like a list
does (necessary, I think, for the vectorized optimization that numpy is built on top of)
With that out of the way here's the main thing for this post, we can't just evaluate the bool
of an array - numpy says no no no.
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ arr = np.array(["1", 2, True, False])
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ arr
array(['1', '2', 'True', 'False'], dtype='<U21')
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ bool(arr)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-25-4e8c5dd85b93>:1 in <cell line: 1> │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This means that using
and
withnumpy
arrays doesn't really make sense because we probably care about the truth value of each element (bitwise), not the truth value of the array.
Notice that when I print arr
all the elements are a string - and the dtype
is <U21
for all elements.
This is not how I instantiated the array so be aware of that behavior with numpy.
<U21
is a dtype expressing the values are 'Little Endian', Unicode, 12 characters. See here for docs for docs
So for logical comparisions we should look at the error message then...
Our handy error message says to try any
or all
Because the datatypes in this example are basically strings, using arr.any()
will result in an error that I do not fully understand, but any(arr)
and all(arr)
work...
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if arr.any():
...: print("foo")
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-48-25ecac52db96>:1 in <cell line: 1> │
│ /home/u_paynen3/personal/sandbox/.venv/sandbox/lib/python3.8/site-packages/numpy/core/_methods.p │
│ y:57 in _any │
│ │
│ 54 def _any(a, axis=None, dtype=None, out=None, keepdims=False, *, where=True): │
│ 55 │ # Parsing keyword arguments is currently fairly slow, so avoid it for now │
│ 56 │ if where is True: │
│ ❱ 57 │ │ return umr_any(a, axis, dtype, out, keepdims) │
│ 58 │ return umr_any(a, axis, dtype, out, keepdims, where=where) │
│ 59 │
│ 60 def _all(a, axis=None, dtype=None, out=None, keepdims=False, *, where=True): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UFuncTypeError: ufunc 'logical_or' did not contain a loop with signature matching types (None, <class 'numpy.dtype[str_]'>) -> None
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if all(arr):
...: print("foo")
foo
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if any(arr):
...: print("foo")
foo
Let's change the example to just use integers and see what happens:
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ arr2 = np.array([1, True, False])
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ arr2
array([1, 1, 0])
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if arr2.any():
...: print("foo")
foo
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if arr2.all():
...: print("foo")
Ah, now some sanity...
First, the booleans are stored as integers, which based on this discussion makes sense.
Next we check if any
values (this is a bitwise operation) are True
, which we see they are so the conditional evaluates to True
.
Howver, if we check that all
values are True
we see they aren't, the last value is False
or 0
so the conditional fails.
This is a different way to evaluate logical conditions than with lists and it's because of the special nature of numpy arrays that allows them to be compared bitwise but on the flip side, there isn't a meaningful way to evaluate the truth value
of an array.
Pandas
Now for pandas
, which under the hood is a lot of numpy
but not fully.
pandas.Series
objects can hold mixed data types like lists, however to logically evaluate truth values we have to treat them like numpy arrays.
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ s = pd.Series([1, "foo", True, False])
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ s
0 1
1 foo
2 True
3 False
dtype: object
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ bool(s)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-60-68e48e81da14>:1 in <cell line: 1> │
│ /home/u_paynen3/personal/sandbox/.venv/sandbox/lib/python3.8/site-packages/pandas/core/generic.p │
│ y:1527 in __nonzero__ │
│ │
│ 1524 │ │
│ 1525 │ @final │
│ 1526 │ def __nonzero__(self): │
│ ❱ 1527 │ │ raise ValueError( │
│ 1528 │ │ │ f"The truth value of a {type(self).__name__} is ambiguous. " │
│ 1529 │ │ │ "Use a.empty, a.bool(), a.item(), a.any() or a.all()." │
│ 1530 │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Just like with numpy, we can't evaluate the truth value of the series in a meaningful way, but bitwise operations make perfect sense...
❯ if s.any():
...: print("foo")
foo
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ if s.all():
...: print("foo")
I thought this was about and
and &
...
Right, so recall that and
is a lazy boolean evaluation (ie. it evaluates the 'truth value' an object) whereas &
does bitwise comparison.
What we see then with pandas
and numpy
is that if we want to do logical comparisons, we need to do them bitwise, ie. use &
.
Keep in mind though that the data types make a big deal - we can't use &
with strings because the bitwise operation isn't supported, for strings we need to use the boolean evaluation.
The Original Point
My main use case for this is finding elements in a dataframe/series based on 2 or more columns aligning row values...
Say I have a dataframe like this:
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ df
s s2 s3
0 1 0 foo
1 1 a bar
2 1 b baz
3 2 a fee
4 2 0 fi
Example use case is I want to get the values in s3
where s
is 1 and s2
is 'a'. ie. I'm just after bar
for now...
Up until now I've always just tried df.s3[(df.s == 1) and (df.s2 == "a")]
the first time and every single time I've gotten this error that I just haven't ever fully understood:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
But after this deep dive I think I've grasped that and
doesn't actually do what I want here, and in order to do the bitwise comparision I need to use &
sandbox NO VCS via 3.8.11(sandbox) ipython
❯ df.s3[(df.s == 1) & (df.s2 == "a")]
1 bar
Name: s3, dtype: object
End
Hopefully this set of ramblings brings some clarity to and
and &
and you can Google one less error in the future in your logical comparison workflows 😄