有数据显示说，数据分析师或者数据科学家通常会花掉80%的时间去清理数据，20%的时间去分析数据和写数据报告。因为通常在真实世界当中拿到的数据往往跟在课堂中拿到的数据样本有着千差万别。真实世界的数据一般会非常的dirty，所以需要花大量的时间去清洗。一般来说，造成数据脏乱差的原因通常有两种，一是人为输入错误，而是技术原因造成的错误。不干净整洁的数据会给我们的分析造成很大的困扰，甚至会产生错误的报告从而影响决策。所以，当我们拿到一大堆数据的时候，第一个工作是要判断数据的来源，并且尽量的以科学的方法把数据清洗干净，然后再进行接下来的分析操作。这篇文章会分步介绍一些常见的数据清理基本操作。数据是从datacamp中获取的，我在次基础上又把相对干净的数据弄脏了一些一遍举例演示，这是一个叫做airlines的csv文件，其中包含了从旧金山国际机场几天出发的航班的数据。我们先来简单的看一下airlines.csv都包含哪些数据。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('datasets/airlines.csv')
df.info()
df.head()

Out [1]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2487 entries, 0 to 2486
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                2487 non-null   int64  
 1   full_name         2487 non-null   object 
 2   day               2487 non-null   object 
 3   airline           2487 non-null   object 
 4   destination       2487 non-null   object 
 5   dest_region       2487 non-null   object 
 6   dest_size         2487 non-null   object 
 7   boarding_area     2487 non-null   object 
 8   dept_time         2487 non-null   object 
 9   temperature       2487 non-null   float64
 10  wait_min          2487 non-null   object 
 11  cleanliness       2487 non-null   object 
 12  safety            2487 non-null   object 
 13  satisfaction      2487 non-null   object 
 14  rating            2487 non-null   int64  
 15  frequent_flyer    2487 non-null   int64  
 16  points_gain       2487 non-null   int64  
 17  points_last_time  2487 non-null   int64  
 18  total_points      2487 non-null   int64  
dtypes: float64(1), int64(6), object(12)
memory usage: 369.3+ KB

通过输出我们大概可以判断出这个df有id，乘客姓名，航空公司，目的地，目的地区域，登机口，起飞时间，温度，等待时间，还有一些满意度的调查，评分以及常旅和积分情况。在Spyder编译环境下，还可以看到跟Excel类似的表格，来对数据有一个更加直观的感受。

数据类型问题

通过df.info()以及对数据的观察，我们发现了两个数据类型。第一个是wait_min等待时间，这个数据应该是数字int或者float类型，而原来的数据是一个对象类型。第二个是frequent_flyer，我们发现数据中有0，1，2三个数字，这三个数字应该代表的是不同类型，或者不同等级的常旅会员，所以应该是category类型，而不是int类型。我们首先要做的是把这组数据类型转换成正确的数据类型：

#将frequent_flyer使用astype()函数转变成category类型
df['frequent_flyer_cat'] = df['frequent_flyer'].astype('category')
#使用assert验证是否转换成功
assert df['frequent_flyer_cat'].dtype == 'category'
df['frequent_flyer_cat'].describe()
#使用str.strip()把wait_time中的mins字段去除，并使用astype()函数把wait_time转换成int类型
df['wait_time'] = df['wait_min'].str.strip('mins').astype('int')
#使用assert验证是否转换成功
assert df['wait_time'].dtype == 'int'

然后我们打印一下我们清理好的wait_time这个新的变量看一下效果，然后再看一下乘客平均的等待时间：

print(df[['wait_min', 'wait_time']])
print(df['wait_time'].mean())
Out [2]:
      wait_min  wait_time
0      85 mins         85
1      80 mins         80
2      75 mins         75
3     170 mins        170
4     140 mins        140
       ...        ...
2482  100 mins        100
2483  124 mins        124
2484  124 mins        124
2485  335 mins        335
2486  335 mins        335
[2487 rows x 2 columns]
165.92279855247287

数据范围问题

有些时候，我们拿到的一些数据可能会出现一些数据范围错误，比如说一些当天起飞的航班日期却错误的记录成了一个未来的日期，或者某些数字超出了设置范围。例如，在airlines的rating当中，我们就发现了rating中有一些评分超出了预设范围：

1 2	plt.hist(df['rating']) plt.title('Average rating of satisfaction (1-5)')

通过上图，我们发现了一些打分超过了1-5的评分范围。一般我们处理数据范围问题的时候通常有4种方式：

简单的去除这些数据
设置特定的最大最小值
把这些超出范围的数据设置为空值并进行补全操作
根据特定情况为这些数据设置一个特定值

df[df['rating'] > 5 ] 
Out [4]:
      id            full_name    ...    frequent_flyer_cat  wait_time
0        1      Dr. Neil Dunlap  ...                  0        85
1        3     Marquise Osborne  ...                  0        80
2        4  Miss. Marissa Doyle  ...                  0        75
3        5      Cassidy Meadows  ...                  1       170
4        6       Salvatore Vega  ...                  0       140
5        9         Darion Lopez  ...                  0       110
6       10           Bailee Lam  ...                  0       155
7       11          Reilly Koch  ...                  0       200
8       12           Evan Dixon  ...                  0        80
1458  2107       William Huerta  ...                  2       100
[10 rows x 21 columns]

我们发现仅仅有10个rating大于5的数据，对于一个有2000多条数据的df来说，简单的去除这些数据不会对后面的分析产生较大的影响，所以我们这里就简单的把这些数据drop掉：

#第一种drop方式：使用过滤器筛选符合条件的数据
df = df[df['rating'] <= 5]
#第二种drop方式：使用drop()函数
df.drop(df[df['rating'] > 5].index, inplace = True)
#使用assert来验证是否drop完成
assert df['rating'].max() <= 5

数据重复问题

重复的数据是一个非常常见的问题。通常造成数据重复的原因一般有三种，人为输入错误，合并数据出现的问题以及bug或者设计问题。我们可以通过.duplicated()函数来检查df中存在重复的数据：

#将会产生一个以duplicates命名的boolean series，其中所有重复的数据将会以True显示，否则为False
duplicates = df.duplicated(subset = 'id', keep = False)
#选出所有重复的数据
duplicated_passenger = df[duplicates].sort_values('id')
print(duplicated_passenger)
Out [5]:
       id             full_name  ... frequent_flyer_cat   wait_time
2467  3286           Lilly Wong  ...                  1        95
2468  3286           Lilly Wong  ...                  1        95
2469  3287         Prince Poole  ...                  0        65
2470  3287         Prince Poole  ...                  0        65
2471  3288           Koen Meyer  ...                  0        85
2472  3288           Koen Meyer  ...                  0        85
2473  3289  Christian Blackburn  ...                  0        95
2474  3289  Christian Blackburn  ...                  0        95
2476  3291    Lindsay Valentine  ...                  0        95
2475  3291    Lindsay Valentine  ...                  0        95
2477  3292       Johnny Mueller  ...                  0       145
2478  3292       Johnny Mueller  ...                  0       120
2479  9001          Devyn Rocha  ...                  0       135
2480  9001          Devyn Rocha  ...                  0       135
2481  9002            Zoe Payne  ...                  0       120
2482  9002            Zoe Payne  ...                  0       120
2483  9003        Karley Burton  ...                  1       124
2484  9003        Karley Burton  ...                  1       124
2485  9004    Evangeline Flores  ...                  2       335
2486  9004    Evangeline Flores  ...                  2       335

通过对以上重复的数据观察，我们发现大部分重复的数据为完全重复，就是一模一样的数据出现了两条或两条以上，对于这种情况，我们简单的使用.drop_duplicates()函数把重复的数据保留下来一条就可以了。但是数据中那些不是完全重复的情况我们需要进一步对其处理，比如以上重复数据中就出现了一对不完全重复的数据：

1
2
3

 id            		  full_name  ... 	frequent_flyer_cat  wait_time
2477  3292       Johnny Mueller  ...                  0       145
2478  3292       Johnny Mueller  ...                  0       120

这一对数据中其他数据一模一样，但是在wait_time中出现了差别。对于这种差异我们可以根据出现的实际情况进行评估，然后对数据进行impute。以上这种情况，我们可以简单的取两次wait_time的平均值即可：

#我们首先把完全重复的数据drop掉
pass_dup = df.drop_duplicates()
#为agg()创建统计字典
column_names = ['id']
statistics = {'wait_time': 'mean'}
df_wt = pass_dup.groupby(by = column_names).agg(statistics).reset_index()
#把agg()操作过的df_wt与之前去过重的pass_dup合并
df = pd.merge(df_wt, pass_dup, on='id')

在合并这两个df之后，我们需要再对数据进行一次查重操作，因为merge的过程是有可能产生重复数据的，尤其是inner merge:

duplicates = df.duplicated(subset = column_names, keep = False)
duplicated_passenger = df[duplicates == True]
print(duplicated_passenger)
Out [6]:
        id  wait_time_x  ... frequent_flyer_cat wait_time_y
2462  3292        132.5  ...                  0         145
2463  3292        132.5  ...                  0         120

我们发现由于merge操作df中又出现了数据重复的情况，原因是因为在我们合并的过程中，由于wait_time在pass_dup中有两个值145和120。所以我们只需要留一条数据即可：

#根据id去除重复数据并且保留重复数据的第一条数据
df = df.drop_duplicates(subset = 'id', keep = 'first')
#重新检查是否还存在重复的数据
duplicates = df.duplicated(subset = column_names, keep = False)
duplicated_passenger = df[duplicates == True]
print(duplicated_passenger)
Out [7]:
Empty DataFrame
Columns: [id, wait_time_x, full_name, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, temperature, wait_min, cleanliness, safety, satisfaction, rating, frequent_flyer, points_gain, points_last_time, total_points, frequent_flyer_cat, wait_time_y]
Index: []
#验证重复数据是否被去除
assert duplicated_passenger.shape[0] == 0
#删除多余的wait_time_y列，重新命名wait_time_x列
df = df.drop('wait_time_y', axis = 1)
df = df.rename(columns = {'wait_time_x': 'wait_time'})

通过打印和assert验证，我们发现已经没有重复的数据了。

分类数据问题

在分类变量中通常也会出现数据分类错误的问题，例如在采集数据的时候会出现类别输入错误。一般我们处理这类数据问题时，有三种处理方法。

去除这些数据
对这些数据进行重新映射
根据实际情况进行数据推理

在airlines数据中的评价数据cleanliness中也出现了分类数据的问题。

#先对cleanliness，safety和satisfaction三类数据指定评价范围categories数据框，在接下来进行比较
categories = pd.DataFrame({'cleanliness': ["Clean", "Average", "Somewhat clean", \
              "Somewhat dirty", "Dirty"],
              'safty': ["Neutral", "Very safe", "Somewhat safe", \
              "Very unsafe", "Somewhat unsafe"],
              'satisfaction': ["Very satisfied", "Neutral", \
              "Somewhat satisfied", "Somewhat unsatisfied", "Very unsatisfied"]})
print(categories) 
Out [8]:
		cleanliness         safty          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very safe               Neutral
2  Somewhat clean    Somewhat safe    Somewhat satisfied
3  Somewhat dirty      Very unsafe  Somewhat unsatisfied
4           Dirty  Somewhat unsafe      Very unsatisfied

每个变量中有5个评分标准，接下来检查数据中是否有评分标准不在这15个评分标准中：

print('Cleanliness: ', df['cleanliness'].unique(), "\n")
print('Safety: ', df['safety'].unique(), "\n")
print('Satisfaction: ', df['satisfaction'].unique(), "\n")
Out [9]:
Cleanliness:  ['Average' 'Unacceptable' 'Somewhat clean' 'Clean' 'Somewhat dirty'
 'Dirty'] 
Safety:  ['Somewhat safe' 'Very safe' 'Neutral' 'Somewhat unsafe' 'Very unsafe'] 
Satisfaction:  ['Somewhat satsified' 'Neutral' 'Very satisfied' 'Somewhat unsatisfied'
 'Very unsatisfied']

在cleanliness，出现了一个Unacceptable的评分，这个评分超出了我们预设的标准categories，所以需要对它进行一定的处理。

#使用set()函数取出从cleanliness中唯一的值，然后用difference()函数判断哪个值不在预设评分标准categories中
cat_clean = set(df['cleanliness']).difference(categories['cleanliness'])
#取出评分不在预设评分标准中的数据
cat_clean_rows = df['cleanliness'].isin(cat_clean)
#打印不在评分标准中的数据
print(df[cat_clean_rows])
Out [10]:
  id  wait_time     ... total_points        frequent_flyer_cat
2   100      120.0  ...            0                  0
42  257      205.0  ...          838                  2
[2 rows x 21 columns]
#打印在评分标准中的数据
print(df[~cat_clean_rows])
Out [11]:
 id  wait_time  	   ... total_points        frequent_flyer_cat
0       13       90.0  ...            0                  0
1       14       89.0  ...          989                  1
3      101      140.0  ...          390                  2
4      102      200.0  ...         1816                  1
5      103      130.0  ...          832                  1
   ...        ...  ...          ...                ...
2462  3292      132.5  ...            0                  0
2464  9001      135.0  ...            0                  0
2465  9002      120.0  ...            0                  0
2466  9003      124.0  ...         1244                  1
2467  9004      335.0  ...         1322                  2
[2465 rows x 21 columns]
#保留所有符合评分标准的数据到df中
df = df[~cat_clean_rows]

在浏览数据的过程当中dest_region出现了字母大小写不统一，相同数据使用不同分类标记的现象，在dest_size中看到了数据有空格的现象。这些问题python都会把他们当作不同的类别来处理，但是实际上有一些数据的类别是相同的，只是由于各种各样的原因，造成了一些数据错误的录入，所以需要对它们也进行清理：

#首先打印出这两个变量下的唯一值，并且使用value_counts()函数观察一下不同类别的数量
print(df['dest_region'].unique())
print(df['dest_region'].value_counts())
print(df['dest_size'].unique())
print(df['dest_size'].value_counts())
Out [12]:
['West US' 'EAST US' 'Midwest US' 'Canada/Mexico' 'East US' 'eur' 'Europe'
 'middle east' 'Middle East' 'Asia' 'Central/South America'
 'Australia/New Zealand']
West US                  855
East US                  367
Europe                   272
Midwest US               251
Asia                     226
Canada/Mexico            196
eur                       79
EAST US                   68
Australia/New Zealand     60
Middle East               48
Central/South America     22
middle east               21
Name: dest_region, dtype: int64
        
['Hub' 'Medium     ' '    Medium' '    Hub' 'Hub     ' 'Medium'
 'Small     ' '    Small' 'Small' 'Large     ' '    Large' 'Large']
Hub            1197
Medium          457
    Hub         222
Small           165
Hub             121
Large           110
    Medium       96
Medium           45
    Small        20
Small            15
    Large        11
Large             6
Name: dest_size, dtype: int64

在dest_region中，我们看到python把East US和EAST US，Middle East和middle east，Europe和eur当作了不同的类别来处理，我们知道他们其实应该是相同的类别。类似的问题出现在了dest_size中，python同样把前面有空格的类别和没有空格的类别当作了不同的类别。我们需要对这些问题数据进行处理，其中关于dest_region的问题，我们可以把所有的类别名称全部换成大写.str.upper()或者小写.str.lower()，然后把eur替换成europe即可。在dest_size中，我们可以把空格去除便可解决问题：

#将dest_region列中的数据字母全部改为小写
df['dest_region'] = df['dest_region'].str.lower()
#替换'eur'为'europe'
df['dest_region'] = df['dest_region'].replace({'eur':'europe'})
#去除dest_size中的空格
df['dest_size'] = df['dest_size'].str.strip()
#再次打印观察分类数据的情况
print(df['dest_region'].unique())
print(df['dest_region'].value_counts())
print(df['dest_size'].unique())
print(df['dest_size'].value_counts())
Out [13]:
['west us' 'east us' 'midwest us' 'canada/mexico' 'europe' 'middle east'
 'asia' 'central/south america' 'australia/new zealand']
west us                  855
east us                  435
europe                   351
midwest us               251
asia                     226
canada/mexico            196
middle east               69
australia/new zealand     60
central/south america     22
Name: dest_region, dtype: int64
['Hub' 'Medium' 'Small' 'Large']
Hub       1540
Medium     598
Small      200
Large      127
Name: dest_size, dtype: int64

有时候我们需要更直观的表现一些分类数据的时候，需要为一些数据进行分组，比如说wait_time这个变量：

1 2	plt.hist(x = 'wait_time', data = df) plt.show()

可以看出大部分人的等待时间是在0-300分钟之间，如果我们对等待时间进行分组，分别分成等待时间较短（short），等待时间适中（medium）和等待时间较长（long）我们可以这样操作：

#设置范围0-60，60-180，180-无限三个分组
label_ranges = [0, 60, 180, np.inf]
#三个分组分别对应的类别名称
label_names = ['short', 'medium', 'long']
#对wait_time进行分组
df['wait_type'] = pd.cut(df['wait_time'], bins = label_ranges, 
                                labels = label_names)
plt.hist(x = 'wait_type', data = df)
plt.show()

除了可以根据数据划分组别，我们还可以重新映射数据，比如说我们可以把day变量的星期几重新映射成工作日（weekday）和周末（weekend）：

mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}
df['day_week'] = df['day'].replace(mappings)

文本数据清理

full_name变量中的姓名中有个别数据出现了一些敬称如Dr. Mr. Ms. Miss.等，我们为了使得数据变得统一，可以把这些敬称去除：

df['full_name'] = df['full_name'].str.replace('Dr.', '').str.replace('Mr.', '')\
    .str.replace('Ms.', '').str.replace('Miss.', '').str.strip()
# 验证数据里是否还存在已经去除的敬称
assert df['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

关于文本数据清理，可能会在实际情况中遇到更多的问题，多数情况我们需要使用replace()函数进行替换或者删除文本，同时在清理更加复杂的文本数据的时候还会用到正则表达式。

格式统一问题

数据的格式统一性也是经常出现的问题，例如在airlines中，起飞日期dept_time的数据中就出现了日期格式不统一的问题，甚至出现了13/31/2018这种不存在的日期：

1 2	#我们首先把dept_time转换成合适的数据类型datetime类型 df['dept_time_dt'] = pd.to_datetime(df['dept_time'])

出现了报错：

1	DateParseError: Invalid date specified (13/31)

因为日期类型超出了范围，所以pandas抛出一个DateParseError错误，我们可以在pd.to_datetime()函数中增加一个errors参数来解决这个问题：

1
2
3

df['dept_time_dt'] = pd.to_datetime(df['dept_time'],\
                                    infer_datetime_format = True, \
                                    errors = 'coerce')

df.info()
Out [14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2465 entries, 0 to 2467
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   id                  2465 non-null   int64         
 1   wait_time           2465 non-null   float64       
 2   full_name           2465 non-null   object        
 3   day                 2465 non-null   object        
 4   airline             2465 non-null   object        
 5   destination         2465 non-null   object        
 6   dest_region         2465 non-null   object        
 7   dest_size           2465 non-null   object        
 8   boarding_area       2465 non-null   object        
 9   dept_time           2465 non-null   object        
 10  temperature         2465 non-null   float64       
 11  wait_min            2465 non-null   object        
 12  cleanliness         2465 non-null   object        
 13  safety              2465 non-null   object        
 14  satisfaction        2465 non-null   object        
 15  rating              2465 non-null   int64         
 16  frequent_flyer      2465 non-null   int64         
 17  points_gain         2465 non-null   int64         
 18  points_last_time    2465 non-null   int64         
 19  total_points        2465 non-null   int64         
 20  frequent_flyer_cat  2465 non-null   category      
 21  wait_type           2465 non-null   category      
 22  day_week            2465 non-null   object        
 23  dept_time_dt        2464 non-null   datetime64[ns]
dtypes: category(2), datetime64[ns](1), float64(2), int64(6), object(13)
memory usage: 447.9+ KB
df.head()
Out [15]:
  id  wait_time       full_name    ... wait_type day_week dept_time_dt
0   13       90.0     Krista Leon  ...    medium  weekday          NaT
1   14       89.0  Andrew Salazar  ...    medium  weekday   2018-12-31
3  101      140.0  Kelvin Richard  ...    medium  weekday   2018-12-31
4  102      200.0    Kylan Harper  ...      long  weekday   2018-12-31
5  103      130.0      Cesar Lang  ...    medium  weekday   2018-12-31
[5 rows x 24 columns]

这时dept_time_dt变量已经变为了datetime64[ns]类型，并且infer_datetime_format = True参数把December 31st, 2018映射成了2018-12-31跟其他数据统一的格式。13/31/2018这个日期程序无法映射和判断其到底为哪天，所以它被赋值为了NaT。

在dept_time_dt中还出现了一些当天起飞的航班日期被标记成了未来的日期，我们把这些日期进行一些处理：

#获取今天的日期
today = pd.to_datetime('today').floor('D')
#选取`dept_time_dt`中大于今天日期的数据，把其日期设置为今天的日期
df.loc[df['dept_time_dt'] > today, 'dept_time_dt'] = today
#打印出目前数据中最大的日期
print(df['dept_time_dt'].max())

到目前为止，我们几乎对所有变量都进行了一定的清理，未被清理的数据也就只剩下个别的变量了。接下来我们来看一下temperature这个变量的数据有没有问题：

plt.scatter(x = 'dept_time_dt', y = 'temperature', data = df)
# Create title, xlabel and ylabel
plt.title('Temperature in Fahrenheit - SFO')
plt.xlabel('Dates')
plt.ylabel('Temperature in Fahrenheit')
plt.xticks(rotation = 90)
# Show plot
plt.show()

通过散点图发现了2018年1月和2019年1月有出现了一些极寒的温度，在旧金山这种地方不太会出现10华氏度左右的问题，通过常识判断，应该是10摄氏度，我们对小于20的温度值进行温度转换：

temp_cel = df.loc[df['temperature'] < 20, 'temperature']
#摄氏度转华氏度公式
temp_fah = 1.8 * temp_cel + 32
df.loc[df['temperature'] < 20, 'temperature'] = temp_fah
#验证数据是否还存在最小值低于20的情况
assert df['temperature'].min() > 20

交叉验证

最后我们对常旅会员的积分进行交叉验证，看是否存在积分不对等的情况。关于积分一共有三个变量，分别是本次飞行获得的积分（points_gain），本次飞行之前的基本（points_last_time），还有总积分(total_points)。

所以，total_points = points_gain + points_last_time

我们来验证一下是否如此：

points_col = ['points_gain', 'points_last_time']
points_equ = df[points_col].sum(axis=1) == df['total_points']
consistent_points = df[points_equ]
inconsistent_points = df[~points_equ]
print("Number of inconsistent points: ", inconsistent_points.shape[0])
Out [16]:
Number of inconsistent points:  1

发现了存在1个不相等的数据。一般来讲，当我们发现了这种不前后矛盾的数据之后，最简单的办法就是drop，或者我们可以设置为NA空值，然后对其进行impute。我们的例子中的这种情况处理办法比较简单，可以就矛盾的数据拿出来检查，然后看是哪里出了问题，然后进行处理即可。

借此，我们对airlines这个df的数据清洗就到此为止了，接下来我们就可以对此进行分析了。

Harnessing Data to Drive Marketing

Python中的基本数据清理