VB.net 2010 视频教程 VB.net 2010 视频教程 python基础视频教程
SQL Server 2008 视频教程 c#入门经典教程 Visual Basic从门到精通视频教程
当前位置:
首页 > temp > python入门教程 >
  • Python 基础教程 —— Pandas 库常用方法实例说明

 

1. pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False )

            data:支持多种数据类型

            index:可选参数,数据索引,如为空则是由0开始的整数排序,索引确定后只能查看不能修改

            dtype:  数据类型,可为空

            name:  列名,可为空

复制代码
 1 # index 为空时,默认由0开始顺序排列
 2 list=pd.Series(['a','b','c'])
 3 print(list)
 4 --------------------------------------------------------
 5 out:
 6 1 a
 7 2 b
 8 3 c
 9 =======================================================
10 
11 #使用 index 输入
12 list=pd.Series(['Leslie','Jack','Mike'],[2,1,3])
13 print(list)
14 --------------------------------------------------------
15 out:
16 2 Leslie
17 1 Jack
18 3 Mike
19 ========================================================
20 
21 # 以dic字典输入数据
22 list=pd.Series({2:'Leslie',1:'Jack',3:'Mike'})
23 print(list)
24 --------------------------------------------------------
25 out:
26 2 Leslie
27 1 Jack
28 3 Mike
29 ========================================================
30 
31 #显示筛选结果
32 list=pd.Series({2:'Leslie',1:'Jack',3:'Mike'},[2,3])
33 print(list)
34 --------------------------------------------------------
35 out:
36 2 Leslie
37 3 Mike
38 =========================================================
39 
40 #指定列名name
41 price=pd.Series(['68','90'],name='price',index=['JAVA IN ACTION','Python Data Science Handbook'])
42 print(price)
43 --------------------------------------------------------
44 out:
45 JAVA IN ACTION                  68
46 Python Data Science Handbook    90
47 Name: price, dtype: object
复制代码

 注意:列名默认以0开始的整数

 回到目录

2. pandas.DataFrame ([data],[index])   根据行建立数据

    DataFrame可看作panads的行索引,最基础是通过单个已有的series对象创建DataFrame

    data: 被panads序列化的行数据集

        index:行索引集合,为空时将由0开始按整数排列

1 java=pd.Series({'price':68,'count':1})
2 python=pd.Series({'price':90,'count':1})
3 frame=pd.DataFrame(data=[java,python],index=['JAVA IN ACTION','Python Data Science Handbook'])
4 print(frame)

输出

 注意:data, index 参数必须是集合,否则会报错

  回到目录

3. pandas.DataFrame ({dic})  根据列建立数据

    可通过此方法利用字典建立列数据

复制代码
1 #每本书的价格列
2 price=pd.Series({'JAVA IN ACTION':68,'Python Data Science Handbook':90})
3 #每本书的数据列
4 count=pd.Series({'JAVA IN ACTION':1,'Python Data Science Handbook':1})
5 #使用字典建立DataFrame
6 frame=pd.DataFrame({'price':price,'count':count})
7 print(frame)
复制代码

结果与上面一样,系统会根据行索引绑定数据

  回到目录

 4. pandas.DataFrame([list])根据数据建立列数据

注意:使用 list 与 dic 最大不同在 dic 在调用于生成列时先通过 index 指定行索引

1 price1=pd.Series(['68','90'],name='price1',index=['JAVA IN ACTION','Python Data Science Handbook'])
2 count1=pd.Series(['1','1'],name='count1',index=['JAVA IN ACTION','Python Data Science Handbook'])
3 frame1=pd.DataFrame([price1,count1])
4 print(frame1)

对比上面例子,当以数组建立 DataFrame 时,数组内的数据默认为行数据

  回到目录

 5. loc 、iloc数据筛选

data=pandas.Series(['Leslie',‘Rose','Jack','Mike'])

显式索引即 data[ 'Leslie' : 'Jack'] 作切片时,结果包含最后一个索引即 Jack

隐式索引即 data[ 0 : 2 ]作切片时,结果不包含最后一个

为了避免混淆,建议使用 loc(显式)、iloc(隐式) 

data[ 'Leslie' : 'Jack'] 等效于 data.loc[ 'Leslie' : 'Jack']

data[ 0 : 2 ]等效于data.iloc[ 0 : 2 ]

 

同时,loc 也可作为数据的筛选条件

1 age=pd.Series({'Leslie':28,'Jack':32,'Rose':18})
2 address=pd.Series({'Jack':'Beijing','Rose':'Shanghai','Leslie':'Guangzhou'})
3 person=pd.DataFrame({'address':address,'age':age})
4 print(person.loc[person['age']<30])

显示结果

多条件筛选

1 age=pd.Series({'Leslie':28,'Jack':32,'Rose':18})
2 address=pd.Series({'Jack':'Beijing','Rose':'Shanghai','Leslie':'Guangzhou'})
3 person=pd.DataFrame({'address':address,'age':age})
4 print(person.loc[(person['age']<30) & (person['age']>20)])

  回到目录

6. 多级行索引

将 index 行索引分成多维级别

1 test=pd.DataFrame(data=np.random.rand(4,2),
2                    index=[['index0','index0','index1','index1'],[0,1,0,1]],
3                    columns=['column0','column1'])
4 print(test)

结果

 

可为多级行索引建立名称,容易管理

1 test1=pd.DataFrame(data=np.random.rand(4,2),
2                    index=[['index0','index0','index1','index1'],[0,1,0,1]],
3                    columns=['column0','column1'])
4 test1.index.names=['indexName0','indexName1']
5 print(test1)

结果

 回到目录

7. 使用 pandas.MultiIndex 显式创建多级行索引

使用数组方法 MultiIndex.from_arrays ()

1 data=[['Python Learning from Scratch','1','68'],['Pro Apahe Hadoop','1','105'],['Python Crash Course','2','89']
2     ,['Beginning Python From Novice','1','76'],['Python Appclications','2','120'],['Deep Learning with TensorFlow','1','58']]
3 index=pd.MultiIndex.from_arrays([['Leslie','Leslie','Jack','Jack','Mike','Mike'],[2020,2021,2020,2021,2020,2021]])
4 column=['Book','Count','Price']
5 book=pd.DataFrame(data=data,index=index,columns=column)

使用索引值的元组方法 MultiIndex.from_tuples()

1 data=[['Python Learning from Scratch','1','68'],['Pro Apahe Hadoop','1','105'],['Python Crash Course','2','89']
2     ,['Beginning Python From Novice','1','76'],['Python Appclications','2','120'],['Deep Learning with TensorFlow','1','58']]
3 index=pd.MultiIndex.from_tuples([('Leslie',2020),('Leslie',2021),('Jack',2020),('Jack',2021),('Mike',2020),('Mike',2021)])
4 column=['Book','Count','Price']
5 book=pd.DataFrame(data=data,index=index,columns=column)

使用笛卡乐积方法 MultiIndex.from_product ()

1 data=[['Python Learning from Scratch','1','68'],['Pro Apahe Hadoop','1','105'],['Python Crash Course','2','89']
2     ,['Beginning Python From Novice','1','76'],['Python Appclications','2','120'],['Deep Learning with TensorFlow','1','58']]
3 index=pd.MultiIndex.from_product([['Leslie','Jack','Mike'],[2020,2021]])
4 column=['Book','Count','Price']
5 book=pd.DataFrame(data=data,index=index,columns=column)

上面3种方法可获取相同结果,3种方法有不同的使用场景

  回到目录

8. 多级行索引的升维及降维

继续以上面例子为例,使用 stack(level) 可以把 DataFrame 升维,使用 unstack(level) 可能把 DataFrame 降维

注意:数据升维降维后都将返回一个数据集的副本,修改其值不会影响原数据

复制代码
 1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
 2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
 3 index=pd.MultiIndex.from_tuples([('Leslie',2020),('Leslie',2021),('Jack',2020),('Jack',2021),('Mike',2020),('Mike',2021)])
 4 column=['Book','Count','Price']
 5 book=pd.DataFrame(data=data,index=index,columns=column)
 6 //计算总体价格
 7 total=book['Price']*book['Count']
 8 print(str(total)+'\n')
 9 //降维显示,把二维的行索引转化为列
10 print(total.unstack())
复制代码

结果

 

使用 level 参数可以设置降维的层级,level 为 0  即把多维行的第一维度进行转换 ( 即name参数 ),level 为 1  即把多维行的第二维度进行转换 ( 即 year 参数 )

复制代码
1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
3 index=pd.MultiIndex.from_tuples([('Leslie',2020),('Leslie',2021),('Jack',2020),('Jack',2021),('Mike',2020),('Mike',2021)])
4 column=['Book','Count','Price']
5 book=pd.DataFrame(data=data,index=index,columns=column)
6 //计算总价
7 total=book['Price']*book['Count']
8 //把第一维name进行降维
9 print(total.unstack(level=0))
复制代码

可见结果刚好与上面的例子相反,若把level设置为1,则结果跟上面相同

 

使用 stack 把数据进行升维,level 使用与 unstack 类似

复制代码
1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
3 index=pd.MultiIndex.from_tuples([('Leslie',2020),('Leslie',2021),('Jack',2020),('Jack',2021),('Mike',2020),('Mike',2021)])
4 column=['Book','Count','Price']
5 book=pd.DataFrame(data=data,index=index,columns=column)
6 print(book.stack())
复制代码

结果 

 

索引重置的另外两个常用方法 reset_index() 与 set_index()

reset_index(self,level, drop: bool = False, inplace: bool = False, col_level: Hashable = 0, col_fill: Label = "")  把行标签转换成列

  level:默认为 None,从索引中删除给定的级别,默认情况下删除所有级别。

  drop: 默认为 False 不要尝试将索引插入 DataFrame 列,这会将索引重置为默认的整数索引。

  inplace:bool, 默认为 False,修改DataFrame到位(不要创建新对象)。

  col_level:int 或 str, 默认为 0,如果列有多个级别,请确定将标签插入到哪个级别。默认情况下,它被插入到第一级。

  col_fill:object, 默认为空,如果列具有多个级别,请确定如何命名其他级别。如果为None,则重复索引名称。

复制代码
1 age=pd.Series({'Leslie':28,'Jack':32,'Rose':18})
2 address=pd.Series({'Jack':'Beijing','Rose':'Shanghai','Leslie':'Guangzhou'})
3 person=pd.DataFrame({'address':address,'age':age})
4 print(str(person)+"\n")
5 #把name转换成列,转换后列名默认为index
6 person=person.reset_index()
7 #把列名改为name
8 person.rename(columns={'index':'name'},inplace=True)
9 print(person)
复制代码

显示结果

 

set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

  keys:label or array-like or list of labels/arrays,这个是需要设置为索引的列名,可以是单个列名,或者是多个列名
  drop:bool, default True,删除要用作新索引的列
  append:bool, default False,添加新索引
  inplace:bool, default False,是否要覆盖数据集
  verify_integrity:bool, default False,检查新索引是否重复

复制代码
 1 age=pd.Series({'Leslie':28,'Jack':32,'Rose':18})
 2 address=pd.Series({'Jack':'Beijing','Rose':'Shanghai','Leslie':'Guangzhou'})
 3 person=pd.DataFrame({'address':address,'age':age})
 4 print(str(person)+"\n")
 5 #把行索引name转换成列,默认列名为index
 6 person=person.reset_index()
 7 #把列名改为name
 8 person.rename(columns={'index':'name'},inplace=True)
 9 print(str(person)+"\n")
10 #重新把列name转换成行索引
11 person=person.set_index(['name'],append=True)
12 print(person)
复制代码

运行结果

  回到目录

9. 在DataFrame 中添加列 insert

def insert(loc, column, value, allow_duplicates=False) 可以直接组DataFrame添加列

  • loc: 所添加的位置索引,添加到哪一列
  • column:列名称
  • value: 添加的数据集
1 age=pd.Series({'Leslie':28,'Jack':32,'Rose':18})
2 address=pd.Series({'Jack':'Beijing','Rose':'Shanghai','Leslie':'Guangzhou'})
3 person=pd.DataFrame({'address':address,'age':age})
4 person.insert(2,'sex',[’male','male','female'])

运行结果

 回到目录 

10. 排序 sort

如果在使用 MultiIndex 不是有序索引,那在切片时候系统经常会报以下错误(注意:数据排序后返回的将是原数据的一个副本,副本值修改不会改变原数据值

 此时可使用 sort_index() 或 sortlevel() 先对数据进行排序再进行切片

复制代码
1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
3 index=pd.MultiIndex.from_tuples([('Leslie',2020),('Leslie',2021),('Jack',2020),('Jack',2021),('Mike',2020),('Mike',2021)])
4 column=['Book','Count','Price']
5 book=pd.DataFrame(data=data,index=index,columns=column)
6 #先按 index 进行排序
7 book=book.sort_index()
8 print(str(book.loc['Leslie':,:])+'\n')
9 print(book.loc[('Leslie',2021):,:'Count'])
复制代码

运行结果

  回到目录

11. 根据多级索引进行数据统计

用户还可以使用 mean()、sum()、max() 等方法对多级索引进行数据统计,也可使用 level 参数设置所统计的维度

复制代码
 1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
 2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
 3 index=pd.MultiIndex.from_tuples([('Leslie',2020),('Leslie',2021),('Jack',2020),('Jack',2021),('Mike',2020),('Mike',2021)])
 4 column=['Book','Count','Price']
 5 book=pd.DataFrame(data=data,index=index,columns=column)
 6 book=book.sort_index()
 7 #原始数据
 8 print(str(book)+'\n')
 9 #以name为纬度计算每年总价
10 print(str(book.sum(level=0))+'\n')
11 #以year为纬度设计平均数
12 print(str(book.mean(level=1))+'\n')
13 #以year为纬度计算最大值
14 print(book.max(level=1))
复制代码

运行结果,可见在计算平均值和总值时关于Book等不匹配的字段系统全自动忽略

  回到目录

12. 简易合并 pandas.concat 

pd.concat( objs: Union[Iterable["NDFrame"], Mapping[Label, "NDFrame"]],axis=0,join="outer",

                     ignore_index: bool = False,keys=None,levels=None,names=None,

                     verify_integrity: bool = False,sort: bool = False,copy: bool = True,)

  • objs:  series,dataframe或者是panel构成的序列lsit 
  • axis:  需要合并链接的轴,0是行,1是列 
  • join:连接的方式 inner,或者outer
  • ignore_index:  是否把索引重置
  • verify_intergrity:  捕捉重复索引的错误

concat 默认会将所在列进行合并,确失列默认为 NaN 表示,index 默认允许重复(若不想要重复索引,可以把 ignore_index 设置为 True)
若把 verify_intergrity 设置为 True,一旦出现重复索引,系统就抛出异常

复制代码
 1 data2=[['Python Learning from Scratch',68,'Eric Matthes'],['Pro Apahe Hadoop',72,'Magnus Lie'],['Python Crash Course',98,'Wes Mckinney']]
 2 data3=[['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen'],['Deep Learning with TensorFlow','Md Rezaul']]
 3 
 4 column2=['Book','Price','Author']
 5 column3=['Book','Author']
 6 
 7 book2=pd.DataFrame(data=data2,columns=column2).sort_index()
 8 book3=pd.DataFrame(data=data3,columns=column3).sort_index()
 9 
10 print(pd.concat([book2,book3]))
复制代码

运行结果

 若想要去掉缺失列,可以把参数 join 设置为 ' inner '

复制代码
 1 data2=[['Python Learning from Scratch',68,'Eric Matthes'],['Pro Apahe Hadoop',72,'Magnus Lie'],['Python Crash Course',98,'Wes Mckinney']]
 2 data3=[['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen'],['Deep Learning with TensorFlow','Md Rezaul']]
 3 
 4 column2=['Book','Price','Author']
 5 column3=['Book','Author']
 6 
 7 book2=pd.DataFrame(data=data2,columns=column2).sort_index()
 8 book3=pd.DataFrame(data=data3,columns=column3).sort_index()
 9 
10 print(pd.concat([book2,book3],join='inner'))
复制代码

运行结果

  回到目录

13.  merge 合并与连接

pandas.merge (left, right, how: str = "inner",  on=None, left_on=None,  right_on=None,
                         left_index: bool = False,     right_index: bool = False,     sort: bool = False,
            suffixes=("_x", "_y"),     copy: bool = True,     indicator: bool = False,  validate=None)

  • left: 集合数据
  • right: 集合数据
  • how: 连接方式,默认为 inner 内链接,还可以是 outer 外链接, left 左链接, right 右链接
  • on:链接条件,若为空时,默认为left/right 的交集作为链接条件
  • left_on: 指定链接条件的列名
  • right_on: 指定链接条件的列名
  • left_index: 是否用索引为链接条件
  • right_index: 是否用索引为链接条件
  • sort: 是否排序
  • suffixes: 当出现重复列名时可加上后缀
  • copy:默认是True, 合并数据为复制数据
  • indicator:
  • validate: 对应方式  (一对一为 1:1)   ( 一对多为1:m )(多对一为m:1) (多对多为m:m ) 

merge 是最常用的合并连接,用法与SQL数据库中的使用方法极为相似,支持一对一,一对多,多对多方式
在缺失值时,merge也会用 NaN 代替,与 concat 不一样的是 merge 默认会自动生成新的索引
方法可通过on参数与配置关联列,若为空时,则默认为 left / right 的交集作为链接条件,此例中即为 Book 列

复制代码
 1 _book=[['Python Learning from Scratch','Eric Matthes'],['Pro Apahe Hadoop','Magnus Lie'],['Python Crash Course','Wes Mckinney'],
 2        ['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen'],['Deep Learning with TensorFlow','Md Rezaul']]
 3 column1=['Book','Author']
 4 book=pd.DataFrame(data=_book,columns=column1)
 5 
 6 _price=[['Python Learning from Scratch',68,2],['Pro Apahe Hadoop',105,3],['Python Crash Course',89,1]
 7     ,['Beginning Python From Novice',76,2],['Python Appclications',120],['Deep Learning with TensorFlow',58,3]]
 8 
 9 column2=['Book','Price','Count']
10 price=pd.DataFrame(data=_price,columns=column2)
11 
12 print(pd.merge(book,price,on='Book'))
复制代码

运行结果,index=4 的书本没有设定 Count 时,系统默认为 NaN

当关联列的名称不同时,可通过 left_on 和 right_on 分开指定列名

复制代码
 1 _book=[['Python Learning from Scratch','Eric Matthes'],['Pro Apahe Hadoop','Magnus Lie'],['Python Crash Course','Wes Mckinney'],
 2        ['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen'],['Deep Learning with TensorFlow','Md Rezaul']]
 3 column1=['Name','Author']
 4 book=pd.DataFrame(data=_book,columns=column1)
 5 
 6 _price=[['Python Learning from Scratch',68,2],['Pro Apahe Hadoop',105,3],['Python Crash Course',89,1]
 7     ,['Beginning Python From Novice',76,2],['Python Appclications',120],['Deep Learning with TensorFlow',58,3]]
 8 price=pd.DataFrame(data=_price,columns=column2)
 9 
10 pd.set_option('display.max_columns',None)
11 print(pd.merge(book,price,left_on='Name',right_on='Book'))
复制代码

运行结果

 为了避免关系列Name与Book同时显示,可以通过 drop()方法把重复列去掉

复制代码
 1 _book=[['Python Learning from Scratch','Eric Matthes'],['Pro Apahe Hadoop','Magnus Lie'],['Python Crash Course','Wes Mckinney'],
 2        ['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen'],['Deep Learning with TensorFlow','Md Rezaul']]
 3 column1=['Name','Author']
 4 book=pd.DataFrame(data=_book,columns=column1)
 5 
 6 _price=[['Python Learning from Scratch',68,2],['Pro Apahe Hadoop',105,3],['Python Crash Course',89,1]
 7     ,['Beginning Python From Novice',76,2],['Python Appclications',120],['Deep Learning with TensorFlow',58,3]]
 8 column2=['Book','Price','Count']
 9 price=pd.DataFrame(data=_price,columns=column2)
10 
11 pd.set_option('display.max_columns',None)
12 print(pd.merge(book,price,left_on='Name',right_on='Book').drop('Name',axis=1))
复制代码

运行结果

也可能通过 left_index 和 right_index 来通过索引进行合并

复制代码
 1 _book=[['Python Learning from Scratch','Eric Matthes'],['Pro Apahe Hadoop','Magnus Lie'],['Python Crash Course','Wes Mckinney'],
 2        ['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen'],['Deep Learning with TensorFlow','Md Rezaul']]
 3 column1=['Name','Author']
 4 book=pd.DataFrame(data=_book,columns=column1)
 5 
 6 _price=[['Python Learning from Scratch',68,2],['Pro Apahe Hadoop',105,3],['Python Crash Course',89,1]
 7     ,['Beginning Python From Novice',76,2],['Python Appclications',120,1],['Deep Learning with TensorFlow',58,3]]
 8 column2=['Book','Price','Count']
 9 price=pd.DataFrame(data=_price,columns=column2)
10 
11 pd.set_option('display.max_columns',None)
12 print(pd.merge(book,price,left_index=True,right_index=True).drop('Name',axis=1))
复制代码

运行结果

 以上例子中都是默认使用内链接 how='inner' 返回数据的交集, 也可通过设置 how=’outer' 返回并集

book 集合中不存在书本  Deep Learning with TensorFlow 的信息,所以默认情况下,合并数据后应该只剩下5行数据

复制代码
 1 _book=[['Python Learning from Scratch','Eric Matthes'],['Pro Apahe Hadoop','Magnus Lie'],['Python Crash Course','Wes Mckinney'],
 2        ['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen']]   
 3 column1=['Name','Author']
 4 book=pd.DataFrame(data=_book,columns=column1)
 5 
 6 _price=[['Python Learning from Scratch',68,2],['Pro Apahe Hadoop',105,3],['Python Crash Course',89,1]
 7     ,['Beginning Python From Novice',76,2],['Python Appclications',120,1],['Deep Learning with TensorFlow',58,3]]
 8 column2=['Book','Price','Count']
 9 price=pd.DataFrame(data=_price,columns=column2)
10 
11 pd.set_option('display.max_columns',None)
12 print(pd.merge(book,price,left_index=True,right_index=True,how='inner').drop('Name',axis=1))
复制代码

运行结果

把 how设置为 outer后,运行结果

 

 同理,通过把 how 设置为 left / right,可以使用左右链接

复制代码
 1 _book=[['Python Learning from Scratch','Eric Matthes'],['Pro Apahe Hadoop','Magnus Lie'],['Python Crash Course','Wes Mckinney'],
 2        ['Beginning Python From Novice','Brandon Rhodes'],['Python Appclications','John Goerzen']]   
 3 column1=['Name','Author']
 4 book=pd.DataFrame(data=_book,columns=column1)
 5 
 6 _price=[['Python Learning from Scratch',68,2],['Pro Apahe Hadoop',105,3],['Python Crash Course',89,1]
 7     ,['Beginning Python From Novice',76,2],['Deep Learning with TensorFlow',58,3]] 
 8 column2=['Book','Price','Count']
 9 price=pd.DataFrame(data=_price,columns=column2)
10 
11 pd.set_option('display.max_columns',None)
12 print(pd.merge(book,price,left_on='Name',right_on='Book',how='left').drop('Name',axis=1))
复制代码

运行结果

   回到目录

 14. 列统计函数 describe 

panads 中还有一个非常方便统计的 describe 函数,它作用是对每一列的若干个常用统计函数(count、mean、std、min 等)进行计算

1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column)
5 print(book.describe())

运行结果

    回到目录

15.  groupby 分组运算

groupby可以使数据进行分组后再计算,常用的累计方式有 count 计算行数量、mean 平均值 、median中位数 、min 最小值 、max 最大值、std 标准差 、var 方差 、mad 均值绝对偏差 、prod 所有项乘积 、sum 所有项求和等方法

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,2],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column)
5 
6 print(book.groupby('Type').sum())
复制代码

运行结果

 

 也可专门针对某一列进分组运算

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column)
5 print(str(book)+'\n')
6 print(book.groupby('Type')['Count'].describe())
复制代码

运行结果 

 

 除了普通计算,在分组后还可以进行 aggregate 累计、filter 过滤、transform 转换、apply 应用等操作

通过 aggregate 可针对不同列进行不同的累计操作,例子中就是计算各类书本的平均价格与销售总数

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column)
5 print(str(book)+'\n')
6 print(book.groupby('Type').aggregate({'Price':'mean','Count':'sum'}))
复制代码

运行结果

 

使用 filter 就是常用的条件过滤,只有符合过滤条件的数据才会被算到分组计算当中
func传入的参数是 group 的分组的数据集,而返回是 bool,通过返回值判断此组数据是否符合筛选条件
下面的例子就是找出销量总数大于 6 的书本

复制代码
1 def func(x):
2     return sum(x['Count'])>6
3 
4 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
5     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
6 column=['Book','Type','Price','Count']
7 book=pd.DataFrame(data=_book,columns=column)
8 print(str(book)+'\n')
9 print(book.groupby('Type').filter(func))
复制代码

运行结果

 

 transform 可以对分组内全部数据进行运算后返回一个全新的数据组,最常见的就是计算数据与平均的差别

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column)
5 print(str(book)+'\n')
6 print(book.groupby('Type')['Price'].transform(lambda x:x-x.mean()))
复制代码

运行结果

 

apply 可以对每个分组里的数据进行任意方法操作,唯一不同的是它输入的参数是一个 DataFrame,返回的则是一个数据集
下面例子就是统计每组数据内不同书本所占的销售占比

复制代码
 1 def data(x):
 2     x.insert(4,'Rate','')
 3     x['Rate'] = x['Count']/sum(x['Count'])*100
 4     return x
 5 
 6 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
 7     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
 8 column=['Book','Type','Price','Count']
 9 book=pd.DataFrame(data=_book,columns=column)
10 11 print(book.groupby('Type').apply(data).sort_values('Type'))
复制代码

运行结果

groupby 除了可以根据列等分组外,可以根据索引,数据,列表等多种方式进行分组,前提是数组长度必须与DataFrame的长度一致
下面的例子数据就是根据预先定义的数组进行分组的

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column)
5 print(str(book)+'\n')
6 index=[0,1,0,2,1,3]
7 print(book.groupby(index).sum())
复制代码

运行结果

除了使用数组以外,还可以使用字典对数据进行分组
下面的例子把Type为 Python、TensorFlow的书本归入AI类,把Type为Hadoop归入BD类再进行统计

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column).set_index('Type')
5 print(str(book)+'\n')
6 mapping={'Python':'AI','TensorFlow':'AI','Hadoop':'BD'}
7 print(book.groupby(mapping).sum())
复制代码

运行结果

 

另外,数据还可以根据组合键进行分组,从而返回一个多级索引的结果
下面的例子把Type为 Python、TensorFlow的书本归入AI类,把Type为Hadoop归入BD类再进行统计,在AI中再分别统计 Python、TesnsorFlow数据

复制代码
1 _book=[['Python Learning from Scratch','Python',68,2],['Pro Apahe Hadoop','Hadoop',105,3],['Python Crash Course','Python',89,1]
2     ,['Beginning Python From Novice','Python',76,4],['Deep Learning with TensorFlow','TensorFlow',58,3],['Hadoop:The Definitive Guide','Hadoop',99,3]]
3 column=['Book','Type','Price','Count']
4 book=pd.DataFrame(data=_book,columns=column).set_index('Type')
5 print(str(book)+'\n')
6 index=[0,2,0,0,1,2]
7 mapping={'Python':'AI','TensorFlow':'AI','Hadoop':'BD'}
8 print(book.groupby([mapping,index]).sum())
复制代码

运行结果

   回到目录

 16. pivot_table 数据透视表

试想一下,如果有一组数据,它包含了书本的开发语言(Language)、类型(Tpye)、单价(Price)、销售数量(Count),现在想根据书本的的Language、Type去统计书本的平均价格 Price,如果用回上一节的例子,我们可以通过 groupby 来实现

复制代码
1 _book=[['Python Learning from Scratch','Python','AI',68,2],['Pro Apahe Hadoop','Hadoop','BG',105,3],['Python Crash Course','Python','AI',89,1]
2     ,['Beginning Python From Novice','Python','AI',76,4],['Deep Learning with TensorFlow','TensorFlow','AI',58,3]
3     ,['Hadoop:The Definitive Guide','Hadoop','BG',99,3],['HBase: The Definitive Guide','HBase','BG',108,2],['HBase In Action','HBase','BG',79,2]]
4 column=['Book','Language','Type','Price','Count']
5 book=pd.DataFrame(data=_book,columns=column)
6 print(str(book)+'\n')
7 8 print(str(book.groupby(['Language','Type'])['Price'].mean().unstack())+'\n')
复制代码

运行结果

 

 然而这种操作看起来比较繁琐,而且可读性差,往往开发人员需要仔细看一段时间才能明白其中用意,有见及此系统为开发人员准备了一个方法去实现此功能

pivot_table(values=None,index=None, columns=None,aggfunc="mean",
                     fill_value=None,margins=False,dropna=True, margins_name="All",observed=False)

  • values:可选参数,用来做集合的值,其用法与pivot的values类似。默认是显示所有的值。
  • index:必选参数,用来指定行索引。如果用数组做行索引,数据必须等长。
  • columns:必选参数,用来指定列索引。
           
  • aggfunc:聚合函数, pivot_table后新dataframe的值都会通过aggfunc进行运算,默认使用mean算法求平均值,aggfunc有多种书写格式:
       aggfunc = [ np.mean ]
       aggfunc = [ np.sum,np.mean ] 
       aggfunc = { 'Price':'mean' } 
       aggfunc = { 'Price':[np.mean] } 
       aggfunc = { 'Price':np.mean,'Count':np.sum }
              aggfunc = { 'Price':'mean','Count':'sum'}
  • fill_value:填充NA值。默认不填充
  • margins:添加行列的总计,默认不显示。
  • dropna:如果整行都为NA值,则进行丢弃,默认丢弃。
  • margins_name:在margins参数为ture时,用来修改margins的名称

 使用以下方法,可以更简单得到相同的效果,而且可读性更强,因为 aggfunc 默认是计算平均值,所以如果统计的是单列,可以不用输入 aggfunc

复制代码
1 _book=[['Python Learning from Scratch','Python','AI',68,2],['Pro Apahe Hadoop','Hadoop','BG',105,3],['Python Crash Course','Python','AI',89,1]
2     ,['Beginning Python From Novice','Python','AI',76,4],['Deep Learning with TensorFlow','TensorFlow','AI',58,3]
3     ,['Hadoop:The Definitive Guide','Hadoop','BG',99,3],['HBase: The Definitive Guide','HBase','BG',108,2],['HBase In Action','HBase','BG',79,2]]
4 column=['Book','Language','Type','Price','Count']
5 book=pd.DataFrame(data=_book,columns=column)
6 print(str(book)+'\n')
7 print(book.pivot_table(values='Price',index='Language',columns='Type'))
复制代码

运行结果

 

 如果需要进行多列计算,刚可以通过 aggfunc 参数为不同的列设置不同的算法,下面的例子就是统计平均价格 Price 和总体数量 Count

复制代码
1 _book=[['Python Learning from Scratch','Python','AI',68,2],['Pro Apahe Hadoop','Hadoop','BG',105,3],['Python Crash Course','Python','AI',89,1]
2     ,['Beginning Python From Novice','Python','AI',76,4],['Deep Learning with TensorFlow','TensorFlow','AI',58,3]
3     ,['Hadoop:The Definitive Guide','Hadoop','BG',99,3],['HBase: The Definitive Guide','HBase','BG',108,2],['HBase In Action','HBase','BG',79,2]]
4 column=['Book','Language','Type','Price','Count']
5 book=pd.DataFrame(data=_book,columns=column)
6 print(str(book)+'\n')
7 print(book.pivot_table(index='Language',columns='Type',aggfunc={'Price':np.mean,'Count':'sum'}))
复制代码

运行结果

 回到目录

17. 高性能列间运算 eval 与 query

pandas 还提供了一个 eval 函数,可以对 DataFrame 进行快速运算,还可以快速生成列
下面例子是以 单价*数据 计算出总体价格,并插入 DataFrame

复制代码
1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
3 column=['Book','Count','Price']
4 book=pd.DataFrame(data=data,index=None,columns=column)
5 print(str(book)+'\n')
6 book.eval('Total=Price*Count',inplace=True)
7 print(book)
复制代码

运行结果

 

 

除引以外,还可以与变量进行运算,下面例子就是计算 Price 与平均价格的差额
注意当 eval 方法用到外部变量时,需要加上@符号

复制代码
1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
3 column=['Book','Count','Price']
4 book=pd.DataFrame(data=data,index=None,columns=column)
5 print(str(book)+'\n')
6 avg=book['Price'].mean()
7 book.eval('Dif=Price-@avg',inplace=True)
8 print(book)
复制代码

 

除了可以使用 eval 进行计算外,还提供了一个 query 进行条件运算
下面的例子就是打印出价格高于平均价格的数据

复制代码
1 data=[['Python Learning from Scratch',1,68],['Pro Apahe Hadoop',1,105],['Python Crash Course',2,89]
2     ,['Beginning Python From Novice',1,76],['Python Appclications',2,120],['Deep Learning with TensorFlow',1,58]]
3 column=['Book','Count','Price']
4 book=pd.DataFrame(data=data,index=None,columns=column)
5 print(str(book)+'\n')
6 avg=book['Price'].mean()
7 reult=book.query('Price>@avg')
8 print(reult)
复制代码

运行结果

由于 eval() 与 query()是基于 Numexpr 库实现的,它比 python 的重复运算更具高效性,而且耗费的内存更少,当需要使用大数据进行运算时,推荐使用。

 
文章出处:https://www.cnblogs.com/leslies2/p/14764130.html

相关教程