Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

pandas - Extracting dataframes from a dictionary of dataframes

I have a directory containing many csv files which I have loaded into a dictionary of dataframes

So, just 3 sample small csv files to illustrate

    import os
    import csv
    import pandas as pd

    #create 3 small csv files for test purposes
    os.chdir('c:/test')
    with open('dat1990.csv','w',newline='') as fp:
        a=csv.writer(fp,delimiter=',')
        data = [['Stock','Sales','Year'],
                ['100','24','1990'],
                ['120','33','1990'],
                ['23','5','1990']]
        a.writerows(data)

    with open('dat1991.csv','w',newline='') as fp:
        a=csv.writer(fp,delimiter=',')
        data = [['Stock','Sales','Year'],
                ['400','35','1991'],
                ['450','55','1991'],
                ['34','6','1991']]
        a.writerows(data)

    with open('other1991.csv','w',newline='') as fp:
        a=csv.writer(fp,delimiter=',')
        data = [['Stock','Sales','Year'],
                ['500','56','1991'],
                ['600','44','1991'],
                ['56','55','1991']]
        a.writerows(data)

create a dictionary for processing the csv files into dataframes

    dfcsv_dict = {'dat1990': 'dat1990.csv', 'dat1991': 'dat1991.csv', 
        'other1991': 'other1991.csv'}

create a simple import function for importing csv to pandas

    def myimport(csvfile):
        return pd.read_csv(csvfile)

iterate through the dictionary to import all csv files into pandas dataframes

    df_dict = {}
    for k, v in dfcsv_dict.items():
        df_dict[k] = myimport(v)

Given I now may have thousands of dataframes within the unified dictionary object, how can I select a few and "extract" them out of the dictionary?

So for example, how would I extract just two of these three dataframes nested in the dictionary, something like

    dat1990 = df_dict['dat1990']
    dat1991 = df_dict['dat1991']

but without using literal assignments. Maybe some sort of looping structure over the dictionary, hopefully with a means to select a subgroup based on a string sequence in the dictionary key: eg all dataframes named dat or 1991 etc

I don't want another "sub dictionary" but want to extract them as named "standalone" dataframes as the above code illustrates.

I am using python 3.5.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This is an old question from Jan 2016 but since no one answered, here is an answer from Oct 2019. Might be useful for future reference.

I think you can skip the step of creating a dictionary of dataframes. I previously wrote an answer on how to create a single master dataframe from multiple CSV files, and adding a column in the master dataframe with a string extracted from the CSV filename. I think you could essentially do the same thing here.

Create a dataframe of csv files based on timestamp intervals

Steps:

  1. Create path to folder with files
  2. Create list of files in folder
  3. Create empty dataframe to store CSV dataframes
  4. Loop through each csv as a dataframe
  5. Add a column with the filename as a string
  6. Concatenate the individual dataframe to the master dataframe
  7. Use a dataframe filter mask to create new dataframe
import pandas as pd
import os

# Step 1: create a path to the folder, syntax for Windows OS
path_test_folder = 'C:\test\'

# Step 2: create a list of CSV files in the folder
files_in_folder = os.listdir(path_test_folder)
files_in_folder = [x for x in files_in_folder if '.csv' in x]

# Step 3: create empty master dataframe to store CSV files
df_master = pd.DataFrame()

# Step 4: loop through the files in folder
for each_csv in files_in_folder:

    # temporary dataframe for the CSV
    path_csv = os.path.join(path_test_folder, each_csv)
    temp_df = pd.read_csv(path_csv)

    # add folder with filename
    temp_df['str_filename'] = str(each_csv)

    # combine into master dataframe
    df_master = pd.concat([df_master, temp_df])

# then filter on your filenames
mask_filter = df_master['str_filename'].isin(['dat1990.csv', 'dat1991.csv'])
df_filter = df_master.loc[mask_filter]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...