Pandas时间序列数据处理教程[Python]

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文探讨了使用Pandas DataFrame对时间序列数据的操作方法和技巧。

时间序列数据

使用给定格式将列转换为日期时间

df[‘day_time’] = pd.to_datetime(df[‘day_time’], format=’%Y-%m-%d %H:%M:%S’)
0 2012–10–12 00:00:00
1 2012–10–12 00:30:00
2 2012–10–12 01:00:00
3 2012–10–12 01:30:00

Re-index一个DataFrame(数据帧)以内插缺失值(例如，下面每30分钟一次)。在运行它之前，您需要在df上有一个datetime索引。

full_idx = pd.date_range(start=df[‘day_time’].min(), end=df[‘day_time’].max(), freq=’30T’)
df = (
 df
 .groupby(‘LCLid’, as_index=False) 
 .apply(lambda group: group.reindex(full_idx, method=’nearest’)) 
 .reset_index(level=0, drop=True) 
 .sort_index() 
)

在DataFrame中查找缺失的日期

# Note date_range is inclusive of the end date
ref_date_range = pd.date_range(‘2012–2–5 00:00:00’, ‘2014–2–8 23:30:00’, freq=’30Min’)

ref_df = pd.DataFrame(np.random.randint(1, 20, (ref_date_range.shape[0], 1)))
ref_df.index = ref_date_range

# check for missing datetimeindex values based on reference index (with all values)
missing_dates = ref_df.index[~ref_df.index.isin(df.index)]

missing_dates

>>DatetimeIndex(['2013-09-09 23:00:00', '2013-09-09 23:30:00',
               '2013-09-10 00:00:00', '2013-09-10 00:30:00'],
              dtype='datetime64[ns]', freq='30T')

根据日期时间列中的日期拆分DataFrame(数据框)

split_date = pd.datetime(2014,2,2)
df_train = df.loc[df[‘day_time’] < split_date]
df_test = df.loc[df[‘day_time’] >= split_date]

在DataFrame(数据框)中找到最近的日期(这里我们假设索引是日期时间字段)

dt = pd.to_datetime(“2016–04–23 11:00:00”)
df.index.get_loc(dt, method=“nearest”)
#get index date
idx = df.index[df.index.get_loc(dt, method='nearest')]
#row to series
s = df.iloc[df.index.get_loc(dt, method='nearest')]

计算行中日期时间之间的增量(假设索引是日期时间)

df[‘t_val’] = df.index
df[‘delta’] = (df[‘t_val’]-df[‘t_val’].shift()).fillna(0)

计算date列与给定日期之间的运行增量(例如，此处我们使用date列中的第一个日期作为我们要与之求差的日期)。

dt = pd.to_datetime(str(train_df[‘date’].iloc[0]))
dt
>>Timestamp('2016-01-10 00:00:00')
train_df['elapsed']=pd.Series(delta.seconds for delta in (train_df['date'] - dt))
#convert seconds to hours
train_df['elapsed'] = train_df['elapsed'].apply(lambda x: x/3600)

内部处理

重置索引

            data
day_time
2014-02-02  0.45
2014-02-02  0.41
df.reset_index(inplace=True)
  day_time    data
0 2014-02-02  0.45
0 2014-02-02  0.41
#to drop it
df.reset_index(drop=True, inplace=True)

设定索引

df = df.set_index(“day_time”)

重设索引，不要保留原始索引

df = df.reset_index(drop=True)

Drop列(删除列)

df.drop(columns=[‘col_to_drop’,'other_col_to_drop'],inplace=True)

Rename列(重命名列)

df.rename(columns={‘oldName1’: ‘newName1’, ‘oldName2’: ‘newName2’}, inplace=True)

先按column_1，然后按column_2，按升序对DataFrame(数据框)进行排序

df.sort_values(by=['column_1', 'column_2'])
#descending
df.sort_values(by='column_1', ascending=0)

选择(Select)

根据 Pandas 列中的值从DataFrame中选择行

超级有用的片段

df.loc[df[‘column_name’] == some_value]
df.loc[df['column_name'].isin(some_values)]
df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]

从DataFrame(数据框)中选择列

df1 = df[['a','b']]

获取列中的唯一值

acorns = df.Acorn.unique()
#same as
acorns = df['Acorn'].unique()

按列分组，应用操作，然后将结果转换为DataFrame(数据框)

df = df(['LCLid']).mean().reset_index()

获列中的值最小的行

lowest_row = df.iloc[df[‘column_1’].argmin()]

按行号选择

my_series = df.iloc[0]
my_df = df.iloc[[0]]

按列号选择

df.iloc[:,0]

替代(Replace替换)

将数据框中的行替换为另一个具有相同索引的数据框中的行。

#for example first I created a new dataframe based on a selection
df_b = df_a.loc[df_a['machine_id'].isnull()]
#replace column with value from another column
for i in df_b.index:
    df_b.at[i, 'machine_id'] = df_b.at[i, 'box_id']
#now replace rows in original dataframe
df_a.loc[df_b.index] = df_b

用行索引替换列中的值

df.loc[0:2,'col'] = 42

遍历行

使用迭代

for index, row in df.iterrows():
    print (row["type"], row["value"])

使用itertuples(速度更快)

for row in df.itertuples():
    print (getattr(row, "type"), getattr(row, "value"))

如果需要修改要迭代的行，请使用apply：

def my_fn(c):
    return c + 1
df['plus_one'] = df.apply(lambda row: my_fn(row['value']), axis=1)

或者，请参见以下示例：

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

NaN的

用零(或某个值)替换df或列中的NaN

df.fillna(0)
df['some_column'].fillna(0, inplace=True)

统计列内的NaN数量

df[‘energy(kWh/hh)’].isna().sum()

查找具有Nan的列，这些列的列表，然后选择具有一个或多个NaN的列：

>#which cols have nan
df.isna().any()
#list of cols with nan
df.columns[df.isna().any()].tolist()
#select cols with nan
df.loc[:, df.isna().any()]

获取列为NaN的行

df[df['Col2'].isnull()]

数据分析

显示DataFrame(数据框)的最后n行

df.tail(n=2)

显示DataFrame头的转置。我们将len(list(df))作为数字传递给head以显示所有列

df.head().T.head(len(list(df)))
>>             0  1  2  3  4
index  2012-02-05 00:00:00  2012-02-05 00:00:00  2012-02-05 00:00:00  2012-02-05 00:00:00  2012-02-05 00:00:00
LCLid  MAC000006  MAC005178  MAC000066  MAC004510  MAC004882
energy(kWh/hh)  0.042  0.561  0.037  0.254  0.426
dayYear  2012  2012  2012  2012  2012
dayMonth  2  2  2  2  2
dayWeek  5  5  5  5  5
dayDay  5  5  5  5  5
dayDayofweek  6  6  6  6  6
dayDayofyear  36  36  36  36  36

字符串运算

替换列中的特定字符


df[‘bankHoliday’] = df[‘bankHoliday’].str.replace(‘?’,’’)

连接两列

df['concat'] = df["id"].astype(str) + '-' + df["name"]

合并

在多列上合并DataFrame

df = pd.merge(X, y, on=[‘city’,’year’,’weekofyear’])

Concat /垂直附加

df = df1.append(df2, ignore_index=True) #or frames = [df1, df2, df3]

result = pd.concat(frames)

Split(分割)

将DataFrame(数据帧)分为N个大小大致相等的DataFrame

idxs=df.index.values
chunked = np.array_split(idxs, NUM_CORES)
for chunk in chunked:
   part_df = df.loc[df.index.isin(chunk)]
   #run some process on the part
   p= Process(target=proc_chunk, args=[part_df])
   jobs.append(p)
   p.start()

类型转换

更改DataFrame(数据框)中的列类型

df_test[[‘value’]] = df_test[[‘value’]].astype(int)

增加数据

添加一个空列

df["nan_column"] = np.nan
df["zero_column"] = 0

数据类型

将“ a”和“ b”列转换为数字，将非数字强制转换为“ NaN”

df[['a', 'b']] = df[['a', 'b']].apply(pd.to_numeric, errors='coerce')

从字典列表创建DataFrame

df = pd.DataFrame([sig_dict, id_dict, phase_dict, target_dict])
df=df.T
df.columns=[‘signal’,’id’,’phase’,’target’]

Numpy

作为连接DataFrame(数据帧)的替代方法，可以使用numpy(对于大型合并，其内存敏感度度低于pandas-useful)

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7,8]])
a, b
>(array([[1, 2],
 [3, 4]]), array([[5, 6],
 [7, 8]]))
c=np.concatenate((a, b), axis=1)
c
>array([[1, 2, 5, 6],
       [3, 4, 7, 8]])
df = pd.DataFrame(c)
df.head()
>0  1  2  3
 0  1  2  5  6
 1  3  4  7  8
for i in range(10):
    df = pq.read_table(path+f’df_{i}.parquet’).to_pandas()
    vals = df.values
    if i > 0:
        #axis=1 to concat horizontally
        np_vals = np.concatenate((np_vals, vals), axis=1)
    else:
        np_vals=vals
np.savetxt(path+f'df_np.csv', np_vals, delimiter=",")

导入/到处(Import/Export)

按列分组，然后将每个组导出到单独的DataFrame(数据框)中:

f = lambda x: x.to_csv(“{1}.csv”.format(x.name.lower()), index=False)
df.groupby(‘LCLid’).apply(f)
#for example our original dataframe may be:
day_time            LCLid      energy(kWh/hh) 
289  2012–02–05 00:00:00 MAC004954 0.45 
289  2012–02–05 00:30:00 MAC004954 0.46
6100 2012–02–05 05:30:00 MAC000041 0.23

以Feather格式导入/导出

在这里，我们将DataFrame(数据框)保存为Feather格式(读回的速度非常快)。注意在使用pandas == 0.23.4保存Feather文件>〜2GB时可能会遇到问题

df.to_feather(‘df_data.feather’)
import feather as ftr
df = ftr.read_dataframe(‘df_data.feather’)

以Parquet格式导入/导出

import pyarrow.parquet as pq
df.to_parquet(“data.parquet”)
df = pq.read_table(“data.parquet”).to_pandas()

不带索引保存

df.to_csv('file.csv', index=False)

读入，指定新的列名

df = pd.read_csv('signals.csv', names=['phase', 'amplitude'])

datetime64日期和时间代码

参考这里(https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html)：

Code Meaning 
Y    year    
M    month
W    week
D    day

时间单位：

Code Meaning
h    hour
m    minute
s    second
ms   millisecond
us   microsecond
ns   nanosecond
ps   picosecond
fs   femtosecond
as   attosecond

参考资料

Pandas for time series data – tricks and tips

鲜花

握手

雷人

路过

鸡蛋

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

python - Pandas中groupby分组统计唯一值的2种方法发布时间：2022-05-14

python - 深度学习Loss Nan的原因发布时间：2022-05-14

剪的笔顺,诠释剪的笔画,认识剪的部首

florent37/ViewAnimator: A fluent Android

2022-08-15

florent37/Shrine-MaterialDesign2: implem

2022-08-17

CVE-2020-36276

2022-09-23

六六分期app的软件客服如何联系？(六六分期

2023-10-27

doraiso/Mastodon

2022-08-18

阅读排行榜

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：8753|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：6411|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：5325|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：5798|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：5682|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：6045|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：5644|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：5118|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：5467|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：4836|2022-11-06

客服电话

电子邮件

Pandas时间序列数据处理教程[Python]

时间序列数据

内部处理

选择(Select)

替代(Replace替换)

遍历行

NaN的

数据分析

字符串运算

合并

Concat /垂直附加

Split(分割)

类型转换

增加数据

数据类型

从字典列表创建DataFrame

Numpy

导入/到处(Import/Export)

datetime64日期和时间代码

参考资料

上一篇：

下一篇：

PacktPublishing/Python-Machine-Learning-

sussillo/hfopt-matlab: A parallel, cpu-b

e-radionicacom/Inkplate-Arduino-library:

鲁东大学一米网:Win7系统USB驱动器RAM的操

emersion/go-ostatus: An OStatus library

剪的笔顺,诠释剪的笔画,认识剪的部首

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

六六分期app的软件客服如何联系？(六六分期

doraiso/Mastodon

关于我们

产品与服务

解决方案

139-2527-9053