So, I read CSV-files that are generated using excel.
Those can contain empty columns and rows on the right side - resp. below the data range/table.
Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.
In a first iteration I just used
pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False)
which seemed to work fine.
But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.
I do want to keep all columns that
have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column
So, per https://stackoverflow.com/a/43983654/2215053 I first used
unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)
which looks and feels clean, but it scrambles the column order.
So now I go with:
df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]
Which works.
But there should be an obviously right way to accomplish this frequently occuring task?
So how could I do this better?
Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv()
and not originally imported from the csv?
question from:
https://stackoverflow.com/questions/65887119/how-to-drop-columns-missing-column-names-and-data 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…