I'm a relative novice to pandas but use it to plot and compare trends in industrial and economic data across countries and time. The df are organised like this:
#create sample df
df1 = pd.DataFrame(columns=['2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'],index = ['United Arab Emirates', 'Argentina', 'Australia', 'Austria', 'Bulgaria',
'Brazil', 'Canada'])
df2 = pd.DataFrame(columns=['2004', '2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016'],index = ['Argentina', 'Australia', 'Austria', 'Bulgaria',
'Brazil', 'Canada', 'Switzerland', 'Chile', 'Colombia'])
df3 = pd.DataFrame(columns=['2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'],index = ['Argentina', 'Australia', 'Austria', 'Bulgaria',
'Brazil', 'Canada'])
This data comes from different sources so does not always contain the same list of countries and years. In order to scatter plot them I need to wrangle the df so that they are al the same shape with identical rows & columns / lists of countries and years. I am doing this as following:
Concat the df joining on inner, creating lists of countries and years that are common to all the df:
#create lists of countries and years common to all df
dfList = [df1, df2, df3]
merged = pd.concat(dfList, axis = 1, join='inner')
countryList = merged.index
merged = pd.concat(dfList, axis=0, join='inner')
yearList = merged.columns
However I am having problems writing a function that loops through the df and removes the columns and rows (years & countries) that are not contained in yearList and countryList. The following function seems to run okay but does not change the columns / rows of the df. I'm pretty sure this is down to my misunderstanding of how variables work within loops, but haven't been able to find anything on this as it applies to complete df.
Can anyone point out why this loop isn't working or suggest a more elegant / efficient way of wrangling a group of df so that they all contain identically labelled indices & columns? Many thanks in advance.
#loop through all df removing all rows / cols that are not in countryList & yearList
def countryyear(x):
for x in dfList:
x = x[x.index.isin(countryList)]
x = x.loc[:,x.columns.isin(yearList)]
#return x
countryyear(dfList)
question from:
https://stackoverflow.com/questions/65881577/how-do-i-wrangle-a-number-of-df-so-that-they-all-contain-identically-labelled-in