Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
42 views
in Technique[技术] by (71.8m points)

python - how to drop columns missing column names AND data

So, I read CSV-files that are generated using excel. Those can contain empty columns and rows on the right side - resp. below the data range/table. Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.

In a first iteration I just used

pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False) 

which seemed to work fine. But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.

I do want to keep all columns that have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column

So, per https://stackoverflow.com/a/43983654/2215053 I first used

unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)

which looks and feels clean, but it scrambles the column order.

So now I go with:

df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]

Which works. But there should be an obviously right way to accomplish this frequently occuring task? So how could I do this better?

Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv() and not originally imported from the csv?

question from:https://stackoverflow.com/questions/65887119/how-to-drop-columns-missing-column-names-and-data

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Unfortunately, I think there is no built-in function. Also not in pandas.read_csv. But you can apply the following code:

# get all rows which contain only nas
ser_all_na= df.isna().all(axis='rows')
# get all rows which got a generic name Unnamed...
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
# now delete all columns which got no explicit name and only contain nas
del_indexer&= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...