I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).
I had been looking for something like contains
or isin
for nd.arrays
/ pd.series
, but got no luck.
This frustrated me quite a bit, as I was already checking the columns of my DataFrame
for occurrences of specific string patterns, as in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]
However, no matter how I banged my head, I could not apply .str.contains()
to the object returned bydf.columns
- which is an Index
- nor the one returned by df.columns.values
- which is an ndarray
. This works fine for what is returned by the "slicing" operation df[column_name]
, i.e. a Series
, though.
My first solution involved a for
loop and the creation of a help list:
ll = []
for a in df.columns:
if a.startswith('start_exp1') | a.startswith('start_exp2'):
ll.append(a)
df[ll]
(one could apply any of the str
functions, of course)
Then, I found the map
function and got it to work with the following code:
import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]
Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str
data type returned by the iteration.
I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.
I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.
Thanks,
Michele
EDIT : I just found the Index
method Index.to_series()
, which returns - ehm - a Series
to which I could apply .str.contains('whatever')
.
However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str
to the re.search()
function..
See Question&Answers more detail:
os