Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
419 views
in Technique[技术] by (71.8m points)

pandas - Merging multiple CSV files based on specific column - Python

I'm trying to combine about 101 CSV files in Pandas. Each file has the 2 time columns, and a 'value' column. I'd like to keep the 2 times columns as they are the same across the CSV files, and then merge the 'value' column from each of the 101 CSVs into a new CSV file.

Using pd.merge I can combine 2 files using the below

data1 = {'time': ['00:00','01:00','02:00'], 
        'local_time': ['09:30','10:30','11:30'],
        'value': ['265.591','330.766','360.962']}

data2 = {'time': ['00:00','01:00','02:00'], 
        'local_time': ['09:30','10:30','11:30'],
        'value': ['521.217','588.034','588.034']}

df_1 = pd.DataFrame(data1)
df_2 = pd.DataFrame(data2)
locs = ['_A11','_B10']

df_test = pd.merge(df_1,df_2, on=['time','local_time'], how='inner', suffixes = (locs)
)

print(df_test)

This yields:

    time local_time value_A11 value_B10
0  00:00      09:30   265.591   521.217
1  01:00      10:30   330.766   588.034
2  02:00      11:30   360.962   588.034

However, I'm not quite sure how to combine the next 99 csv files or if this even the best way to approach this task.

I'm aiming to get something like:

    time local_time value_A11 value_B10 value_B11 ...
0  00:00      09:30   265.591   521.217       123 ...
1  01:00      10:30   330.766   588.034       456 ...
2  02:00      11:30   360.962   588.034       789 ...

Any help would be very much appreciated!

EDIT 1:

Colin's example worked, however I've been loading in the dataframes into an array as such:

import glob
import os

# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)

# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]

#test df
df_list[0]

I was wondering how I could amend the for loop so that it can feed the array through? Thanks again!

EDIT 2: Errors from answer to edit 1

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-144-772c1d15f228> in <module>
     14 # loop through each dataframe and merge it with existing one
     15 for i, df in enumerate(df_list[1:]):
---> 16   df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     79         copy=copy,
     80         indicator=indicator,
---> 81         validate=validate,
     82     )
     83     return op.get_result()

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    628         # validate the merge keys dtypes. We may need to coerce
    629         # to avoid incompat dtypes
--> 630         self._maybe_coerce_merge_keys()
    631 
    632         # If argument passed to validate,

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
   1136                     inferred_right in string_types and inferred_left not in string_types
   1137                 ):
-> 1138                     raise ValueError(msg)
   1139 
   1140             # datetimelikes must match exactly

ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat

EDIT 3

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-cce982321079> in <module>
     11 
     12 # change datatype to datetime for first df
---> 13 df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
     14 df['time'] = pd.to_datetime(df_list[0]['time'])
     15 

NameError: name 'df' is not defined
question from:https://stackoverflow.com/questions/65866668/merging-multiple-csv-files-based-on-specific-column-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This seems like a good approach. I would just set up the merge and suffixes a little differently so you can loop through each dataframe, like below. Each new value column will be merged to df_test.

EDIT: Updated code to work with OP's edit

EDIT 2 : Fixed datatype for OP's error

import pandas as pd    
import glob
import os

# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)

# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]

# change datatype to datetime for first df
df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
df['time'] = pd.to_datetime(df_list[0]['time'])


# loop through each dataframe and merge it with existing one
for i, df in enumerate(df_list[1:]):

  # change datatype to datetime
  df['local_time'] = pd.to_datetime(df['local_time'])
  df['time'] = pd.to_datetime(df['time'])

  df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))

#print(df_output)
'''
    time local_time  value_0  value_1  value_2  value_3
0  00:00      09:30  738.591  265.591  521.217  856.217
1  01:00      10:30  217.766  330.766  588.034  346.034
2  02:00      11:30  295.962  360.962  588.034  645.034
'''

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...