Pandas进阶笔记（一） Groupby 重难点总结

当前位置:

首页 > temp > 简明python教程 >

Pandas进阶笔记（一） Groupby 重难点总结

如果Pandas只是能把一些数据变成 dataframe 这样优美的格式，那么Pandas绝不会成为叱咤风云的数据分析中心组件。因为在数据分析过程中，描述数据是通过一些列的统计指标实现的，分析结果也需要由具体的分组行为，对各组横向纵向对比。

GroupBy 就是这样的一个有力武器。事实上，SQL语言在Pandas出现的几十年前就成为了高级数据分析人员的标准工具，很大一部分原因正是因为它有标准的SELECT xx FROM xx WHERE condition GROUP BY xx HAVING condition 范式。

感谢 Wes Mckinney及其团队，除了SQL之外，我们多了一个更灵活、适应性更强的工具，而非困在SQL Shell或Python里步履沉重。

【示例】将一段SQL语句用Pandas表达

SQL

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
WHERE Condition 1
GROUP BY Column1, Column2
HAVING Condition2

Pandas

df [Condition1].groupby([Column1, Column2], as_index=False).agg({Column3: "mean", Column4: "sum"}).filter(Condition2)

Group By: split - apply - combine

GroupBy可以分解为三个步骤：

Splitting: 把数据按主键划分为很多个小组
Applying: 对每个小组独立地使用函数
Combining: 把所得到的结果组合

那么，这一套行云流水的动作是如何完成的呢？

Splitting 由 groupby 实现
Applying 由 agg、apply、transform、filter实现具体的操作
Combining 由 concat 等实现

其中，在apply这一步，通常由以下四类操作：

Aggregation:做一些统计性的计算
Apply：做一些数据转换
Transformation:做一些数据处理方面的变换
Filtration:做一些组级别的过滤

注意，这里讨论的apply,agg,transform,filter方法都是限制在 pandas.core.groupby.DataFrameGroupBy里面，不能跟 pandas.core.groupby.DataFrame混淆。

先导入需要用到的模块

import numpy as np
import pandas as pd
import sys, traceback
from itertools import chain

Part 1: Groupby 详解

df_0 = pd.DataFrame({'A': list(chain(*[['foo', 'bar']*4])),
                     'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                     'C': np.random.randn(8),
                     'D': np.random.randn(8)})

df_0

	A	B	C	D
0	foo	one	1.145852	0.210586
1	bar	one	-1.343518	-2.064735
2	foo	two	0.544624	1.125505
3	bar	three	1.090288	-0.296160
4	foo	two	-1.854274	1.348597
5	bar	two	-0.246072	-0.598949
6	foo	one	0.348484	0.429300
7	bar	three	1.477379	0.917027

Talk 1：创建一个Groupby对象时应注意的问题

Good Practice

df_01 = df_0.copy()
df_01.groupby(["A", "B"], as_index=False, sort=False).agg({"C": "sum", "D": "mean"})

	A	B	C	D
0	foo	one	1.494336	0.319943
1	bar	one	-1.343518	-2.064735
2	foo	two	-1.309649	1.237051
3	bar	three	2.567667	0.310433
4	bar	two	-0.246072	-0.598949

Poor Practice

df_02 = df_0.copy()
df_02.groupby(["A", "B"]).agg({"C": "sum", "D": "mean"}).reset_index()

	A	B	C	D
0	bar	one	-1.343518	-2.064735
1	bar	three	2.567667	0.310433
2	bar	two	-0.246072	-0.598949
3	foo	one	1.494336	0.319943
4	foo	two	-1.309649	1.237051

直接使用 as_index=False 参数是一个好的习惯，因为如果dataframe非常巨大（比如达到GB以上规模）时，先生成一个Groupby对象，然后再调用reset_index()会有额外的时间消耗。
在任何涉及数据的操作中，排序都是非常"奢侈的"。如果只是单纯的分组，不关心顺序，在创建Groupby对象的时候应当关闭排序功能，因为这个功能默认是开启的。尤其当你在较大的大数据集上作业时更当注意这个问题。
值得注意的是：groupby会按照数据在原始数据框内的顺序安排它们在每个新组内的顺序。这与是否指定排序无关。

如果要得到一个多层索引的数据框，使用默认的as_index=True即可，例如下面的例子：

df_03 = df_0.copy()
df_03.groupby(["A", "B"]).agg({"C": "sum", "D": "mean"})

		C	D
A	B
bar	one	-1.343518	-2.064735
	three	2.567667	0.310433
	two	-0.246072	-0.598949
foo	one	1.494336	0.319943
foo	two	-1.309649	1.237051

注意，as_index仅当做aggregation操作时有效，如果是其他操作，例如transform，指定这个参数是无效的

df_04 = df_0.copy()
df_04.groupby(["A", "B"], as_index=True).transform(lambda x: x * x)

	C	D
0	1.312976	0.044347
1	1.805040	4.263130
2	0.296616	1.266761
3	1.188727	0.087711
4	3.438331	1.818714
5	0.060552	0.358740
6	0.121441	0.184298
7	2.182650	0.840938

可以看到，我们得到了一个和df_0一样长度的新dataframe，同时我们还希望A,B能成为索引，但这并没有生效。

Talk 2：使用 `pd.Grouper`

pd.Grouper 比 groupby更强大、更灵活，它不仅支持普通的分组，还支持按照时间进行升采样或降采样分组

df_1 = pd.read_excel("dataset\sample-salesv3.xlsx")
df_1["date"] = pd.to_datetime(df_1["date"])

df_1.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

【例子】计算每个月的ext price总和

df_1.set_index("date").resample("M")["ext price"].sum()

date
2014-01-31    185361.66
2014-02-28    146211.62
2014-03-31    203921.38
2014-04-30    174574.11
2014-05-31    165418.55
2014-06-30    174089.33
2014-07-31    191662.11
2014-08-31    153778.59
2014-09-30    168443.17
2014-10-31    171495.32
2014-11-30    119961.22
2014-12-31    163867.26
Freq: M, Name: ext price, dtype: float64

df_1.groupby(pd.Grouper(key="date", freq="M"))["ext price"].sum()

date
2014-01-31    185361.66
2014-02-28    146211.62
2014-03-31    203921.38
2014-04-30    174574.11
2014-05-31    165418.55
2014-06-30    174089.33
2014-07-31    191662.11
2014-08-31    153778.59
2014-09-30    168443.17
2014-10-31    171495.32
2014-11-30    119961.22
2014-12-31    163867.26
Freq: M, Name: ext price, dtype: float64

两种写法都得到了相同的结果，并且看上去第二种写法似乎有点儿难以理解。再看一个例子

【例子】计算每个客户每个月的ext price总和

df_1.set_index("date").groupby("name")["ext price"].resample("M").sum().head(20)

name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

df_1.groupby(["name", pd.Grouper(key="date",freq="M")])["ext price"].sum().head(20)

name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64