2012 美国大选献金项目数据分析

众所周知,美国大选是一个非常耗钱的事情。每个总统参加竞选时,都需要大量的经费。这些经费,几乎不会由候选人自己出,它们往往来自于政治献金。

美国法律要求,政治献金数据必须公开透明。于是,我们有了可以分析的数据。

需求:

  • 加载数据
  • 查看数据的基本信息
  • 指定数据截取,将如下字段的数据进行提取,其他数据舍弃
    • cand_nm:候选人姓名
    • contbr_nm:捐赠人姓名
    • contbr_st:捐赠人所在州
    • contbr_employer:捐赠人所在公司
    • contbr_occupation:捐赠人职业
    • contb_receipt_amt:捐赠数额(美元)
    • contb_receipt_dt:捐款的日期
  • 对新数据进行总览,查看是否存在缺失数据
  • 用统计学指标快速描述数值型属性的概要
  • 空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为 NOT PROVIDE
  • 异常值处理。将捐款金额 <= 0 的数据删除
  • 新建一列为各个候选人所在党派 party
  • 查看 party 这一列中有哪些不同的元素
  • 统计 party 列中各个元素出现次数
  • 查看各个党派收到的政治献金总数 contb_receipt_amt
  • 查看具体每天各个党派收到的政治献金总数 contb_receipt_amt
  • 将表中日期格式转换为 'yyyy-mm-dd'
  • 查看老兵(捐献者职业)DISABLED VETERAN 主要支持谁

在开始之前,导入各种需要的模块:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

加载数据。数据存储在文本文件中,直接用 read_csv 读取即可:

df = pd.read_csv('data/usa_election.txt')
df.head()

前五条数据为:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	20-JUN-11	NaN	NaN	NaN	SA17A	736166
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	23-JUN-11	NaN	NaN	NaN	SA17A	736166
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11	NaN	NaN	NaN	SA17A	749073
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	01-AUG-11	NaN	NaN	NaN	SA17A	749073
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	20-JUN-11	NaN	NaN	NaN	SA17A	736166

查看数据的基本信息,调用 info 即可搞定:

df.info()

数据信息拿到。我们可以分析出来,总共有 536041 条数据,有些列是含有空值的:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536026 non-null object
contbr_st            536040 non-null object
contbr_zip           535973 non-null object
contbr_employer      525088 non-null object
contbr_occupation    530520 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         8479 non-null object
memo_cd              49718 non-null object
memo_text            52740 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

指定数据截取,将指定字段的数据进行提取,其他数据舍弃。注意提取数据的时候,内部是一个列表:

data = df[['cand_nm', 'contbr_nm', 'contbr_st', 'contbr_employer', 'contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt']]
data.head()

截取后的前五条数据为:

	cand_nm	contbr_nm	contbr_st	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt
0	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	250.0	20-JUN-11
1	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	50.0	23-JUN-11
2	Bachmann, Michelle	SMITH, LANIER	AL	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11
3	Bachmann, Michelle	BLEVINS, DARONDA	AR	NONE	RETIRED	250.0	01-AUG-11
4	Bachmann, Michelle	WARDENBURG, HAROLD	AR	NONE	RETIRED	300.0	20-JUN-11

对新数据进行总览,查看是否存在缺失数据

data.info()

除了捐助者和被捐助者,其余列都有空值存在:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_st            536040 non-null object
contbr_employer      525088 non-null object
contbr_occupation    530520 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
dtypes: float64(1), object(6)
memory usage: 28.6+ MB

用统计学指标快速描述数值型属性的概要

data.describe()

因为只有献金金额是数值型数据,所以结果中只有现金金额这一列:

	contb_receipt_amt
count	5.360410e+05
mean	3.750373e+02
std	3.564436e+03
min	-3.080000e+04
25%	5.000000e+01
50%	1.000000e+02
75%	2.500000e+02
max	1.944042e+06

空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为 NOT PROVIDE

data.fillna('NOT PROVIDE', inplace=True)
data.info()

然后就可以看到,数据中已经没有空值了:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_st            536041 non-null object
contbr_employer      536041 non-null object
contbr_occupation    536041 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
dtypes: float64(1), object(6)
memory usage: 28.6+ MB

异常值处理。将捐款金额 <= 0 的数据删除:

data = data.loc[data['contb_receipt_amt'] > 0]
data.describe()

从统计数据中可以看出,最小值也是大于 0 的:

	contb_receipt_amt
count	5.303140e+05
mean	3.931043e+02
std	3.576869e+03
min	1.000000e-02
25%	5.000000e+01
50%	1.000000e+02
75%	2.500000e+02
max	1.944042e+06

新建一列为各个候选人所在党派 party,使用映射可以实现:

dic = {
    'Bachmann, Michelle': 'Republican',
    'Romney, Mitt': 'Republican',
    'Obama, Barack': 'Democrat',
    "Roemer, Charles E. 'Buddy' III": 'Reform',
    'Pawlenty, Timothy': 'Republican',
    'Johnson, Gary Earl': 'Libertarian',
    'Paul, Ron': 'Republican',
    'Santorum, Rick': 'Republican',
    'Cain, Herman': 'Republican',
    'Gingrich, Newt': 'Republican',
    'McCotter, Thaddeus G': 'Republican',
    'Huntsman, Jon': 'Republican',
    'Perry, Rick': 'Republican',
}
data['party'] = data['cand_nm'].map(dic)
data.head()

数据中的前五行为:

	cand_nm	contbr_nm	contbr_st	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	party
0	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	250.0	20-JUN-11	Republican
1	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	50.0	23-JUN-11	Republican
2	Bachmann, Michelle	SMITH, LANIER	AL	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11	Republican
3	Bachmann, Michelle	BLEVINS, DARONDA	AR	NONE	RETIRED	250.0	01-AUG-11	Republican
4	Bachmann, Michelle	WARDENBURG, HAROLD	AR	NONE	RETIRED	300.0	20-JUN-11	Republican

查看 party 这一列中有哪些不同的元素

data['party'].unique()

候选人来自的党派有:

array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)

统计 party 列中各个元素出现次数。一般情况下,只有能够接收到足够献金的党派胜选的概率才会更大:

data['party'].value_counts()

毫无疑问,民主和共和两党收到的献金是最多的:

Democrat       289999
Republican     234300
Reform           5313
Libertarian       702
Name: party, dtype: int64

查看各个党派收到的政治献金总数 contb_receipt_amt

data.groupby('party')['contb_receipt_amt'].sum()

各党派收到献金的总数为:

party
Democrat       8.259441e+07
Libertarian    4.132769e+05
Reform         3.429658e+05
Republican     1.251181e+08
Name: contb_receipt_amt, dtype: float64

查看具体每天各个党派收到的政治献金总数 contb_receipt_amt

data.groupby(by = ['contb_receipt_dt', 'party'])['contb_receipt_amt'].sum()

各党派每天收到的献金总数为:

contb_receipt_dt  party      
01-APR-11         Reform              50.00
                  Republican       12635.00
01-AUG-11         Democrat        182198.00
                  Libertarian       1000.00
                  Reform            1847.00
                  Republican      268903.02
01-DEC-11         Democrat        651982.82
                  Libertarian        725.00
                  Reform             875.00
                  Republican      505255.96
01-FEB-11         Republican         250.00
01-JAN-11         Republican        8600.00
01-JAN-12         Democrat         74303.80
                  Reform             515.00
                  Republican       76804.72
01-JUL-11         Democrat        175364.00
                  Libertarian       2000.00
                  Reform             100.00
                  Republican      125973.72
01-JUN-11         Democrat        148409.00
                  Libertarian        500.00
                  Reform              50.00
                  Republican      435609.20
01-MAR-11         Republican        1000.00
01-MAY-11         Democrat         82644.00
                  Reform             480.00
                  Republican       28663.87
01-NOV-11         Democrat        129309.87
                  Libertarian       3000.00
                  Reform            1792.00
                                    ...    
30-OCT-11         Reform            3910.00
                  Republican       46413.16
30-SEP-11         Democrat       3409587.24
                  Libertarian        550.00
                  Reform            2050.00
                  Republican     5094824.20
31-AUG-11         Democrat        375487.44
                  Libertarian      10750.00
                  Reform             450.00
                  Republican     1038330.90
31-DEC-11         Democrat       3571793.57
                  Reform             695.00
                  Republican     1165777.72
31-JAN-11         Republican        6000.00
31-JAN-12         Democrat       1421887.31
                  Reform             150.00
                  Republican      963681.41
31-JUL-11         Democrat         20305.00
                  Reform            1066.00
                  Republican       12781.02
31-MAR-11         Reform             200.00
                  Republican       74575.00
31-MAY-11         Democrat        352005.66
                  Libertarian        250.00
                  Reform             100.00
                  Republican      313839.80
31-OCT-11         Democrat        216971.87
                  Libertarian       4250.00
                  Reform            3205.00
                  Republican      751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64

将表中日期格式转换为 'yyyy-mm-dd',可通过运算工具实现:

months = {'JAN': 1, 'FEB': 2, 'MAR': 3, 'APR': 4, 'MAY': 5, 'JUN': 6,
          'JUL': 7, 'AUG': 8, 'SEP': 9, 'OCT': 10, 'NOV': 11, 'DEC': 12}
def transfer_date(d):
    day, month, year = d.split('-')
    month = months[month]
    return '20%s-%02d-%2s' % (year, month, day)
data['contb_receipt_dt'] = data['contb_receipt_dt'].map(transfer_date)
data.head()

数据的前五行为:

	cand_nm	contbr_nm	contbr_st	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	party
0	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	250.0	2011-06-20	Republican
1	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	50.0	2011-06-23	Republican
2	Bachmann, Michelle	SMITH, LANIER	AL	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	2011-07-05	Republican
3	Bachmann, Michelle	BLEVINS, DARONDA	AR	NONE	RETIRED	250.0	2011-08-01	Republican
4	Bachmann, Michelle	WARDENBURG, HAROLD	AR	NONE	RETIRED	300.0	2011-06-20	Republican

查看老兵(捐献者职业)DISABLED VETERAN 主要支持谁。

我们认为,哪个候选人接收到老兵的献金越多,老兵就越支持谁。

运算分两步,首先提取出老兵对应的行数据,然后根据候选人分组,运算得到老兵给每个候选人的献金:

old_soldier = data.loc[data['contbr_occupation'] == 'DISABLED VETERAN']
old_soldier.groupby('cand_nm').groups

我们可以看到有哪些老兵给哪位候选人献金:

{'Cain, Herman': Int64Index([459440, 459447, 464508], dtype='int64'),
 'Obama, Barack': Int64Index([149790, 150910, 174041, 175244, 183790, 184211, 184417, 184761,
             185140, 185264, 186026, 186361, 194750, 205094, 210809, 210810,
             210811, 225948, 225954, 231074, 231773, 232434, 233434, 235014,
             235112, 235135, 235846, 244060, 260154, 260544, 264154, 289927],
            dtype='int64'),
 'Paul, Ron': Int64Index([379163, 379537, 379555, 396211, 408043, 408743, 419163, 420332,
             420517, 421427, 421429, 421535, 421621, 421741, 421754, 421839,
             421895, 429191, 430649, 431049, 435011, 446880],
            dtype='int64'),
 'Santorum, Rick': Int64Index([453659, 453660, 453661], dtype='int64')}

对分组数据中的献金金额求和,即可得到每位候选人收到老兵献金的总额:

old_soldier.groupby('cand_nm')['contb_receipt_amt'].sum()

可见,奥巴马是最受老兵支持的:

cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64