I'm trying to combine about 101 CSV files in Pandas. Each file has the 2 time columns, and a 'value' column. I'd like to keep the 2 times columns as they are the same across the CSV files, and then merge the 'value' column from each of the 101 CSVs into a new CSV file.
Using pd.merge I can combine 2 files using the below
data1 = {'time': ['00:00','01:00','02:00'],
'local_time': ['09:30','10:30','11:30'],
'value': ['265.591','330.766','360.962']}
data2 = {'time': ['00:00','01:00','02:00'],
'local_time': ['09:30','10:30','11:30'],
'value': ['521.217','588.034','588.034']}
df_1 = pd.DataFrame(data1)
df_2 = pd.DataFrame(data2)
locs = ['_A11','_B10']
df_test = pd.merge(df_1,df_2, on=['time','local_time'], how='inner', suffixes = (locs)
)
print(df_test)
This yields:
time local_time value_A11 value_B10
0 00:00 09:30 265.591 521.217
1 01:00 10:30 330.766 588.034
2 02:00 11:30 360.962 588.034
However, I'm not quite sure how to combine the next 99 csv files or if this even the best way to approach this task.
I'm aiming to get something like:
time local_time value_A11 value_B10 value_B11 ...
0 00:00 09:30 265.591 521.217 123 ...
1 01:00 10:30 330.766 588.034 456 ...
2 02:00 11:30 360.962 588.034 789 ...
Any help would be very much appreciated!
EDIT 1:
Colin's example worked, however I've been loading in the dataframes into an array as such:
import glob
import os
# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)
# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]
#test df
df_list[0]
I was wondering how I could amend the for loop so that it can feed the array through? Thanks again!
EDIT 2: Errors from answer to edit 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-144-772c1d15f228> in <module>
14 # loop through each dataframe and merge it with existing one
15 for i, df in enumerate(df_list[1:]):
---> 16 df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
79 copy=copy,
80 indicator=indicator,
---> 81 validate=validate,
82 )
83 return op.get_result()
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
628 # validate the merge keys dtypes. We may need to coerce
629 # to avoid incompat dtypes
--> 630 self._maybe_coerce_merge_keys()
631
632 # If argument passed to validate,
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
1136 inferred_right in string_types and inferred_left not in string_types
1137 ):
-> 1138 raise ValueError(msg)
1139
1140 # datetimelikes must match exactly
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
EDIT 3
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-cce982321079> in <module>
11
12 # change datatype to datetime for first df
---> 13 df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
14 df['time'] = pd.to_datetime(df_list[0]['time'])
15
NameError: name 'df' is not defined
question from:
https://stackoverflow.com/questions/65866668/merging-multiple-csv-files-based-on-specific-column-python