pandas教程

pandas教程

阅读该教程,您需要python基础和numpy的预备知识

表格基础

1
2
3
4
import pandas as pd
import numpy as np
s=pd.Series([1,3,6,np.nan,44,1])
print(s)
1
2
3
4
5
6
7
0     1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64

shape获取表格长度,返回元组(行数,列数)

日期序列

1
2
3
4
import pandas as pd
import numpy as np
dates=pd.date_range('20160101',periods=6)
print(dates)
1
2
3
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', freq='D')

DataFrame生成表格

利用index和columns设置表头

1
2
3
4
5
import pandas as pd
import numpy as np
dates=pd.date_range('20160101',periods=6)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
print(df)
1
2
3
4
5
6
7
                   a         b         c         d
2016-01-01 0.902005 -0.463295 0.066307 1.064827
2016-01-02 -0.430363 -0.213347 -0.162768 -0.369665
2016-01-03 -0.923885 -1.808873 1.686310 0.518044
2016-01-04 -0.343998 -0.204547 -0.443543 -0.461751
2016-01-05 -0.309939 -1.194092 -0.590669 -0.418642
2016-01-06 0.845991 -0.515617 -0.394218 0.477955

采用默认的表头

1
2
3
4
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(12).reshape((3,4)))
print(df)
1
2
3
4
   0  1   2   3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

使用字典建立表格

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':1.,
'B':pd.Timestamp('20130102'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'
})
print("df表格:\n",df)
print("各列的类型\n",df.dtypes)
print("打印各行表头\n",df.index)
print("打印各列表头\n",df.columns)
print("取出所有列\n",df.values) # 取出所有列
print("表格统计数据\n",df.describe()) # 打印表格统计数据
t=df.T # 转置表格
print("转置后的表格\n",t)
df1=df.sort_index(axis=1,ascending=False) # df不变,排序结果在df1,对列表头排序
print("对列表头排序\n",df1)
df2=df.sort_index(axis=0,ascending=False) # df不变,排序结果在df2,对行表头排序
print("对行表头排序\n",df2)
# 还可以对值排序
df3=df.sort_values(by='E')
print("对值排序\n",df3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
df表格:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
各列的类型
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
打印各行表头
Index([0, 1, 2, 3], dtype='int64')
打印各列表头
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
取出所有列
[[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']]
表格统计数据
A B C D
count 4.0 4 4.0 4.0
mean 1.0 2013-01-02 00:00:00 1.0 3.0
min 1.0 2013-01-02 00:00:00 1.0 3.0
25% 1.0 2013-01-02 00:00:00 1.0 3.0
50% 1.0 2013-01-02 00:00:00 1.0 3.0
75% 1.0 2013-01-02 00:00:00 1.0 3.0
max 1.0 2013-01-02 00:00:00 1.0 3.0
std 0.0 NaN 0.0 0.0
转置后的表格
0 ... 3
A 1.0 ... 1.0
B 2013-01-02 00:00:00 ... 2013-01-02 00:00:00
C 1.0 ... 1.0
D 3 ... 3
E test ... train
F foo ... foo

[6 rows x 4 columns]
对列表头排序
F E D C B A
0 foo test 3 1.0 2013-01-02 1.0
1 foo train 3 1.0 2013-01-02 1.0
2 foo test 3 1.0 2013-01-02 1.0
3 foo train 3 1.0 2013-01-02 1.0
对行表头排序
A B C D E F
3 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
0 1.0 2013-01-02 1.0 3 test foo
对值排序
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
2 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
3 1.0 2013-01-02 1.0 3 train foo

选择数据

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
print("名为A的列\n",df['A'])
print("名为A的列\n",df.A)
print("[0,3)行\n",df[0:3])
print("'20130102':'20130104'\n",df['20130102':'20130104'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
名为A的列
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32
名为A的列
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32
[0,3)行
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
'20130102':'20130104'
A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15

loc: select by label

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
print("取出'20130102'行\n",df.loc['20130102'])
print("取出A列和B列\n",df.loc[:,['A','B']])
print("取出指定行和指定列\n",df.loc['20130102',['A','B']])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
取出'20130102'行
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int32
取出A列和B列
A B
2013-01-01 0 1
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13
2013-01-05 16 17
2013-01-06 20 21
取出指定行和指定列
A 4
B 5
Name: 2013-01-02 00:00:00, dtype: int32

iloc: select by position

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
print(df.iloc[3,1])
print(df.iloc[3:5,1])
print(df.iloc[[1,3,5],1:3]) # 不连续地筛选
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
13
2013-01-04 13
2013-01-05 17
Freq: D, Name: B, dtype: int32
B C
2013-01-02 5 6
2013-01-04 13 14
2013-01-06 21 22

bool判断

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
print(df.A>8)
print(df[df.A>8])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
2013-01-01 False
2013-01-02 False
2013-01-03 False
2013-01-04 True
2013-01-05 True
2013-01-06 True
Freq: D, Name: A, dtype: bool
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

修改指定位置的值

1
2
3
4
5
6
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[2,2]=114514
print("表格\n",df)
1
2
3
4
5
6
7
8
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 114514 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

修改符合条件的表格元素

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
df[df>5]=-1
print(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 -1 -1
2013-01-03 -1 -1 -1 -1
2013-01-04 -1 -1 -1 -1
2013-01-05 -1 -1 -1 -1
2013-01-06 -1 -1 -1 -1

修改符合条件的行

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
df[df.A>4]=-11
print(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 -11 -11 -11 -11
2013-01-04 -11 -11 -11 -11
2013-01-05 -11 -11 -11 -11
2013-01-06 -11 -11 -11 -11
1
2
3
4
5
6
7
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
df[df>4]=df*10+6
print(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 56 66 76
2013-01-03 86 96 106 116
2013-01-04 126 136 146 156
2013-01-05 166 176 186 196
2013-01-06 206 216 226 236

插入新的列

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("表格\n",df)
df['F']=np.nan
df['G']=np.arange(6)
df['H']=pd.Series([6,5,4,3,2,1],index=pd.date_range('20130101',periods=6))
print(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
表格
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A B C D F G H
2013-01-01 0 1 2 3 NaN 0 6
2013-01-02 4 5 6 7 NaN 1 5
2013-01-03 8 9 10 11 NaN 2 4
2013-01-04 12 13 14 15 NaN 3 3
2013-01-05 16 17 18 19 NaN 4 2
2013-01-06 20 21 22 23 NaN 5 1

处理丢失的数据

dropna删除丢失的数据行列

axis=0时删除含有nan的行

axis=1时删除含有nan的列

how=‘any’表示只要有nan就删除

how=‘all’表示删除所有都是nan的行/列

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[1,2]=np.nan
df.iloc[4,:]=np.nan
print("表格\n",df)
print(df.dropna(axis=0,how='any'))
print(df.dropna(axis=0,how='all'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
表格
A B C D
2013-01-01 0.0 1.0 2.0 3.0
2013-01-02 4.0 5.0 NaN 7.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-05 NaN NaN NaN NaN
2013-01-06 20.0 21.0 22.0 23.0
A B C D
2013-01-01 0.0 1.0 2.0 3.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-06 20.0 21.0 22.0 23.0
A B C D
2013-01-01 0.0 1.0 2.0 3.0
2013-01-02 4.0 5.0 NaN 7.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-06 20.0 21.0 22.0 23.0

fillna给nan赋值 isnull返回对应布尔矩阵

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
dates=pd.date_range('20130101',periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[1,2]=np.nan
df.iloc[4,:]=np.nan
print("表格\n",df)
print(df.fillna(value=0))
print(df.isnull())
print(np.any(df.isnull())==True) # 在表格中寻找是否存在一个nan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
表格
A B C D
2013-01-01 0.0 1.0 2.0 3.0
2013-01-02 4.0 5.0 NaN 7.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-05 NaN NaN NaN NaN
2013-01-06 20.0 21.0 22.0 23.0
A B C D
2013-01-01 0.0 1.0 2.0 3.0
2013-01-02 4.0 5.0 0.0 7.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-05 0.0 0.0 0.0 0.0
2013-01-06 20.0 21.0 22.0 23.0
A B C D
2013-01-01 False False False False
2013-01-02 False False True False
2013-01-03 False False False False
2013-01-04 False False False False
2013-01-05 True True True True
2013-01-06 False False False False
True

读写文件

读文件:

read_csv

read_excel

read_hdf

read_sql

read_json

read_msgpack (experimental)

read_html

read_gbq (experimental)

read_stata

read_sas

read_clipboard

read_pickle

写文件:

to_csv

to_excel

to_hdf

to_sql

to_json

to_msgpack (experimental)

to_html

to_gbq (experimental)

to_stata

to_clipboard

to_pickle

测试文件下载:

students.xlsx

students.csv

读取csv

1
2
3
4
import pandas as pd
import numpy as np
data=pd.read_csv('students.csv',sep='\t')
print(data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
    Student ID   name  age  gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 name 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male

读取excel

参考文献

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
file='students.xlsx'
f=open(file,'rb')
df=pd.read_excel(f,sheet_name='Sheet1')
print(df)
f.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
    Student ID   name  age  gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 name 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male

输出excel

1
2
3
with pd.ExcelWriter(filepath) as writer:
df.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

示例一:

示例文件students.xlsx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np
file='students.xlsx'
f=open(file,'rb')
df=pd.read_excel(f,sheet_name='Sheet1')
print(df)
f.close()

df1=df.iloc[:,2]>=10
df2=df[df1]
df2=df2.sort_values(by='age')

print(df2)
for i in range(df2.shape[0]):
if df2.iloc[i,3]=="Male":
df2.iloc[i,3]="沃尔玛塑料袋"
else:
df2.iloc[i,3]="武装直升机"
if len(df2.iloc[i,1])<3:
df2.iloc[i,1]="hahaha"+df2.iloc[i,1]
df2.iloc[i,0]+=22330000
df2=df2.reset_index() # 重置index
print(df2)

# 输出为excel
filepath=".\\df2.xlsx"

with pd.ExcelWriter(filepath) as writer:
df.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

concat拼接

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])
print("拼接前:\n",df1,'\n',df2,'\n',df3)
res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
print("拼接后:\n",res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
拼接前:
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
拼接后:
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0

如果不写ignore_index=True,即默认ignore_index=False, 拼接的结果为

1
2
3
4
5
6
7
8
9
10
      a    b    c    d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0

拼接表头不同的列

如果拼接表头不同的列,如下例子所示

axis=0时对列操作,在列的方向上拼接;axis=1时对行操作,在行的方向上拼接

join=inner时,会去除因为拼接错位而产生的含有NaN的行/列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d', 'e'], index=[2,3,4])
print("df1:\n",df1)
print("df2:\n",df2)
res = pd.concat([df1, df2], axis=1, join='outer')
print("axis=1, join='outer'\n",res)
res = pd.concat([df1, df2], axis=1, join='inner')
print("axis=1, join='inner'\n",res)
res = pd.concat([df1, df2], axis=0, join='outer')
print("axis=0, join='outer'\n",res)
res = pd.concat([df1, df2], axis=0, join='inner')
print("axis=0, join='inner'\n",res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
df1:
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
df2:
b c d e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
axis=1, join='outer'
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
axis=1, join='inner'
a b c d b c d e
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
axis=0, join='outer'
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
axis=0, join='inner'
b c d
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0

merge拼接

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print("left:\n",left)
print("right:\n",right)
res=pd.merge(left,right,on='key') # 按照key列合并
print("res:\n",res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
left:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
right:
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
res:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

考虑多个key

how = [‘left’, ‘right’, ‘outer’, ‘inner’]

inner:必须所有key相同才保留

outer:不要求所有key相同,可能会产生含有NaN的行

left和right是使用到的两组数据名,可以以被选择的数据作为模板进行合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pandas as pd
import numpy as np
# consider two keys
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print("left:\n",left)
print("right:\n",right)
res = pd.merge(left, right, on=['key1', 'key2'], how='inner') # default for how='inner'
# how = ['left', 'right', 'outer', 'inner']
print("res:\n",res)
res = pd.merge(left, right, on=['key1', 'key2'], how='left')
print("res:\n",res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
left:
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right:
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
res:
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
res:
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN

使用indicator

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
# indicator
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print("df1\n",df1)
print("df2\n",df2)
res = pd.merge(df1, df2, on='col1', how='outer', indicator=True)
print("res:\n",res)
# give the indicator a custom name(自定义名字)
res = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
print("res:\n",res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
df1
col1 col_left
0 0 a
1 1 b
df2
col1 col_right
0 1 2
1 2 2
2 2 2
res:
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
res:
col1 col_left col_right indicator_column
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only

left_index和right_index合并

合并行index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np
# merged by index
lef = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
righ = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print(lef)
print(righ)
# left_index and right_index
res = pd.merge(lef, righ, left_index=True, right_index=True, how='outer')
print("res:\n",res)
res = pd.merge(lef, righ, left_index=True, right_index=True, how='inner')
print("res:\n",res)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
     A   B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
res:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
res:
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2

suffixes处理会重复的列

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
# handle overlapping
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})
res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
print(res)
1
2
3
    k  age_boy  age_girl
0 K0 1 4
1 K0 1 5


pandas教程
https://blog.algorithmpark.xyz/2023/09/02/language/pandas/index/
作者
CJL
发布于
2023年9月2日
更新于
2024年2月15日
许可协议