23. Pandas的数据清洗-数据转换

回忆一下之前的Categorical Data数据，是通过类似字典编码的方式存储了原始数据，降低了存储所需的空间，也就是说有的时候DataFrame里的数据未必就是原始数据，那么需要对原始数据，例如采集时设定了一些协议，1代表某某、2代表另一个事务或内容，那么当开发者得到这些协议后的编码，需要做一些数据上的变换，以便真实的反映数据本身，本章研究数据的转换，会介绍一些函数例如map、applymap等函数，将数据转换成其他的数据。

23.1 map函数

map函数可以将某列数据映射成其它数据，语法结构如下：

outerSeries.map(innerSeries)

调用map函数的Series即outerSeries其value可被括号里的形参innerSeries的值替换掉，替换规则是outerSeries的value和innerSeries的index进行匹配，即最后的结果是调用map的outerSeries的index和形参innerSeries的value。

import pandas as pd
oSeries = pd.Series(["a", "b", "c"], index = [2,3,1])
iSeries = pd.Series([100,200, 300], index = ["c","b","a"])
print oSeries
print iSeries
print oSeries.map(iSeries)

程序执行结果：

2    a
3    b
1    c
dtype: object
c    100
b    200
a    300
dtype: int64
2    300
3    200
1    100
dtype: int64

23.2 replace函数

之前章节的fillna函数可以将NaN数据填充为0,这里的replace函数可以将数据替换成其他数据。replace函数的使用方式有很多，可以一对一的替换也可一堆多的替换数据。

一对一替换数据，在replace里指定要被替换的和替换成的两个数据。

import pandas as pd
ss = pd.Series(["a", "b", "c"], index = [2,3,1])
print ss
ss.replace("b", "hello", inplace = True)
print ss

程序执行结果：

2    a
3    b
1    c
dtype: object
2        a
3    hello
1        c
dtype: object

多对多的替换数据，给出两个列表，分别是要被替换的和替换称的。

import pandas as pd
ss = pd.Series(["a", "b", "c", "a", "c"], index = [2,3,1, 4, 5])
print ss
ss.replace(["c", "a"], ["hello", "world"], inplace = True)
print ss

程序执行结果：

2    a
3    b
1    c
4    a
5    c
dtype: object
2    world
3        b
1    hello
4    world
5    hello
dtype: object

字典方式指定替换。

1) 对于Series通过字典的key指定要被替换的数据，value为替换成的数据。

import pandas as pd
ss = pd.Series(["a", "b", "c", "a", "c"], index = [2,3,1, 4, 5])
print ss
ss.replace({"c":"hello", "a" : "world"}, inplace = True)
print ss

程序执行结果：

2    a
3    b
1    c
4    a
5    c
dtype: object
2    world
3        b
1    hello
4    world
5    hello
dtype: object

2) 对于DataFrame，可以通过字典的key指定列、value指定要被替换的数据，第二个参数为替换成的数据。

import pandas as pd
idx = [1,3,2,4]
val = {'name' : "hello the cruel world".split(),
       'growth' : [100, 125, 150, 200]}
df = pd.DataFrame(val, idx)
print df
df.replace({"name" : "the"}, "THE", inplace = True)
print df

程序的执行结果：

   growth   name
1     100  hello
3     125    the
2     150  cruel
4     200  world
   growth   name
1     100  hello
3     125    THE
2     150  cruel
4     200  world