df = pd.read_csv(file_path, sep=’,’, header = 0, index_col=False,names=None)
Explanation:‘read_csv’ function has a plethora of parameters and I have specified only a few, ones that you may use most often. A few key points:
a) header=0 means you have the names of columns in the first row in the file and if you don’t you will have to specify header=None
b) index_col = False means to not use the first column of the data as an index in the data frame, you might want to set it to true if the first column is really an index.
c) names = None implies you are not specifying the column names and want it to be inferred from csv file, which means that your header = some_number contains column names. Otherwise, you can specify the names in here in the same order as you have the data in the csv file.
If you are reading a text file separated by space or tab, you could simply change the sep to be:sep = " " or sep='\t'
Examples:
a)
df.iat[1,2] provides the element at 1th row and 2nd column. Here it's important to note that number 1 doesn't correspond to 1 in index column of dataframe. It's totally possible that index in df does not have 1 at all. It's like python array indexing.
b)
df.at[first,col_name] provides the value in the row where index value is first and column name is col_name
c)
df.loc[list_of_indices,list_of_cols]
eg df.loc[[4,5],['age','height']]
Slices dataframe for matching indices and column names
d)
df.iloc[[0,1],[5,6]] used for interger based indexing will return 0 and 1st row for 5th and 6th column.
13.如何遍历行?
iterrows() and itertuples()
for i,row in df.iterrows():
sum+=row['hieght']
iterrows() passess an iterators over rows which are returned as series. If a change is made to any of the data element of a row, it may reflect upon the dataframe as it does not return a copy of rows.
itertuples() returns named tuples
for row in df.itertuples():
print(row.age)
14.如何按列排序?
df.sort_values(by = list_of_cols,ascending=True)
15.如何将函数应用于序列中的每个元素?
df['series_name'].apply(f)
where f is the function you want to apply to each element of the series. If you also want to pass arguments to the custom function, you could modify it like this.
def f(x,**kwargs):
#do_somthing
return value_to_store
df['series_name'].apply(f, a= 1, b=2,c =3)
If you want to apply a function to more than a series, then:
def f(row):
age = row['age']
height = row['height']
df[['age','height']].apply(f,axis=1)
If you don't use axis=1, f will be applied to each element of both the series. axis=1 helps to pass age and height of each row for any manipulation you want.
df1 --> name,age,height
df2---> name,age,height
result = pd.concat([df1,df2],axis=0)
对于水平串联,
df1--> name,age
df2--->height,salary
result = pd.concat([df1,df2], axis=1)
21.如何合并两个数据帧?
For the previous example, assume you have an employee database forming two dataframes like
df1--> name, age, height
df2---> name, salary, pincode, sick_leaves_taken
You may want to combine these two dataframe such that each row has all details of an employee. In order to acheive this, you would have to perform a merge operation.
df1.merge(df2, on=['name'],how='inner')
This operation will provide a dataframe where each row will comprise of name, age, height, salary, pincode, sick_leaves_taken.
how = 'inner' means include the row in result if there is a matching name in both the data frames. For more read: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge