读入多个数据文件

一般Pandas中最常使用读入文件的方法就是pd.read_csv(filepath)了，这个方法中有很多的变量，具体查文档。另外还有一些其他常用的读取数据文件的函数比如pd.read_excel(), pd.read_html()和pd.read_json()等。

一般我们使用最笨的方法如下，就是每读取一个文件我们就写一句df = pd.read_csv()

1
2
3

In [1]: import pandas as pd
In [2]: dataframe0 = pd.read_csv('sales-jan-2015.csv')
In [3]: dataframe1 = pd.read_csv('sales-feb-2015.csv')

更简便的一些方法读取多个文件便是使用循环：

In [4]: filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
In [5]: dataframes = []
In [6]: for f in filenames:
...: dataframes.append(pd.read_csv(f))

或者也可以用comprehension循环list：

1 2	In [7]: filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv'] In [8]: dataframes = [pd.read_csv(f) for f in filenames]

还有就是可以用comprehension和glob包来循环：

1
2
3

In [9]: from glob import glob
In [10]: filenames = glob('sales*.csv')
In [11]: dataframes = [pd.read_csv(f) for f in filenames]

连接与合并DataFrame

append() Vs. concat()

连接或者合并DataFrame的时候一般有两种方式：纵向和横向。听起来总是觉得有点迷迷糊糊的。通俗的解释就是，纵向就是把两个或多个DataFrame纵向（从上到下）连接到一个DataFrame当中，index和column有重复情况也不进行任何操作，就是粗暴的纵向拼接DataFrame。横向就是会考虑如果有相同的index的话就会把相同index上所有列的数据合并在一起了，简单点理解就是相当于使用Excel中的V-lookup在两张有相同id但不同数据的表中进行了数据的融合。连接与合并DataFrame的常用函数有两个append(),concat()还有merge()。其中append()只能进行纵向连接，而concat()和merge()可以进行both。concat()默认是进行纵向连接，也就是跟append()效果一样，如果想要使用concat()进行横向合并则需要在concat()中声明变量axis。默认值：concat(axis=0)纵向连接，concat(axis=1)横向合并。下面举几个例子：

In [1]: population = pd.read_csv('population_00.csv', index_col=0)
In [1]: unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
In [1]: print(population)
|               | 2010 Census Population |
|---------------|------------------------|
| Zip Code ZCTA |                        |
| 57538         | 322                    |
| 59916         | 130                    |
| 37660         | 40038                  |
| 2860          | 45199                  |
In [4]: print(unemployment)
|       | unemployment | participants |
|-------|--------------|--------------|
| Zip   |              |              |
| 2860  | 0.11         | 34447        |
| 46167 | 0.02         | 4800         |
| 1097  | 0.33         | 42           |
| 80808 | 0.07         | 4310         |

以上为两个数据文件中数据的情况，下面讲举例说明append()和concat(axis=0)默认值对DataFrame纵向连接的结果，两种方式得到的结果是完全相同的：

In [5]: population.append(unemployment)
Out[5]:
|       | 2010 Census Population participants unemployment | participants | unemployment |
|-------|--------------------------------------------------|--------------|--------------|
| 57538 | 322.0                                            | NaN          | NaN          |
| 59916 | 130.0                                            | NaN          | NaN          |
| 37660 | 40038.0                                          | NaN          | NaN          |
| 2860  | 45199.0                                          | NaN          | NaN          |
| 2860  | NaN                                              | 34447.0      | 0.11         |
| 46167 | NaN                                              | 4800.0       | 0.02         |
| 1097  | NaN                                              | 42.0         | 0.33         |
| 80808 | NaN                                              | 4310.0       | 0.07         |
In [6]: pd.concat([population, unemployment], axis=0)
Out[6]:
|       | 2010 Census Population participants unemployment | participants | unemployment |
|-------|--------------------------------------------------|--------------|--------------|
| 57538 | 322.0                                            | NaN          | NaN          |
| 59916 | 130.0                                            | NaN          | NaN          |
| 37660 | 40038.0                                          | NaN          | NaN          |
| 2860  | 45199.0                                          | NaN          | NaN          |
| 2860  | NaN                                              | 34447.0      | 0.11         |
| 46167 | NaN                                              | 4800.0       | 0.02         |
| 1097  | NaN                                              | 42.0         | 0.33         |
| 80808 | NaN                                              | 4310.0       | 0.07         |

这里我们可以看到zip邮编下的”2860”出现了两次。如果我们想把相同zip下两个DataFrame的数据信息合并，我们就得用到横向合并，concat()提供了一个非常方便的办法就是concat(axis=1)或者concat(axis=’columns’)就可以实现横向合并了：

In [7]: pd.concat([population, unemployment], axis=1)
Out[17]:
|       | 2010 Census Population participants unemployment | participants | unemployment |
|-------|--------------------------------------------------|--------------|--------------|
| 1097  | NaN                                              | 0.33         | 42.0         |
| 2860  | 45199.0                                          | 0.11         | 34447.0      |
| 37660 | 40038.0                                          | NaN          | NaN          |
| 46167 | NaN                                              | 0.02         | 4800.0       |
| 57538 | 322.0                                            | NaN          | NaN          |
| 59916 | 130.0                                            | NaN          | NaN          |
| 80808 | NaN                                              | 0.07         | 4310.0       |

concat() Vs. merge()

在上面说完了concat()和append()横向纵向的连接与合并之后，下面要说一下concat()和merge()的区别和关系。上面我们说了concat()和merge()都可以进行横纵向的合并，在用法上和输出结果上两者有一些区别。这里要引入join的概念。concat()的默认join方式是outer join，而merge()的默认join方式是inner join。另外concat()和merge()在合并DataFrame的时候还有一个重要的区别就是，concat()是通过index来合并的，而merge()是通过列明（column label ）来合并的，如果列名设置成为了index的话需要把用来合并列名的index去掉之后再进行合并，否则会出现KeyError错误提示找不到列名。下面继续使用population和unemployment两个DataFrame来进行相关展示：

In [1]: population = pd.read_csv('population_00.csv', index_col=0)
In [1]: unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
In [1]: print(population)
|               | 2010 Census Population |
|---------------|------------------------|
| Zip Code ZCTA |                        |
| 57538         | 322                    |
| 59916         | 130                    |
| 37660         | 40038                  |
| 2860          | 45199                  |
In [2]: print(unemployment)
|       | unemployment | participants |
|-------|--------------|--------------|
| Zip   |              |              |
| 2860  | 0.11         | 34447        |
| 46167 | 0.02         | 4800         |
| 1097  | 0.33         | 42           |
| 80808 | 0.07         | 4310         |
In [3]: pd.concat([population, unemployment], axis=1) #pd.concat(join='outer')默认值为outer
Out[3]:
|       | 2010 Census Population participants unemployment | participants | unemployment |
|-------|--------------------------------------------------|--------------|--------------|
| 1097  | NaN                                              | 0.33         | 42.0         |
| 2860  | 45199.0                                          | 0.11         | 34447.0      |
| 37660 | 40038.0                                          | NaN          | NaN          |
| 46167 | NaN                                              | 0.02         | 4800.0       |
| 57538 | 322.0                                            | NaN          | NaN          |
| 59916 | 130.0                                            | NaN          | NaN          |
| 80808 | NaN                                              | 0.07         | 4310.0       |
In [4]: pd.concat([population, unemployment], axis=1, join='inner') #pd.concat(join='outer')默认值为outer，这里把join设置成了inner
Out[4]:
|       | 2010 Census Population participants unemployment | participants | unemployment |
|-------|--------------------------------------------------|--------------|--------------|
| 2860  | 45199.0                                          | 0.11         | 34447.0      |

接下来是对相同df进行merge操作：

In [5]: population = pd.read_csv('population_00.csv', index_col=0)
In [5]: unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
    
#这里的导入我们还是设置了第一列ZipCode和Zip为各df的index，然后看一下使用merge()的时候会出现什么情况
In [5]: pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip')
Out[5]: KeyError: "None of ['ZipCode'] are in the columns"
        
#因为ZipCode被设置成了index所以merge找不到该列名，无法进行merge，我们可以.reset_index()，或者在导入数据的时候不设置index就可以解决该问题。
In [6]: population = population.reset_index()
In [6]: unemployment = unemployment.reset_index()
In [6]: pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip')
    
#pd.merge(how='inner')默认值为inner，merge()的合并方式参数是how不是join
Out[6]:
|   | ZipCode | 2010 Census Population | Zip  | Unemployment | Participants |
|---|---------|------------------------|------|--------------|--------------|
| 0 | 2860    | 45199                  | 2860 | 0.11         | 34447        |
#merge的join和concat的join出来的结果会有一些不同，concat出来的df没有index，merge出来的df会有默认index和两个df合并的的列ZipCode和Zip
In [7]: pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip',
               how='outer')
Out[7]: 
|   | ZipCode | 2010 Census Population | Zip     | Unemployment | Participants |
|---|---------|------------------------|---------|--------------|--------------|
| 0 | 57538.0 | 322.0                  | NaN     | NaN          | NaN          |
| 1 | 59916.0 | 130.0                  | NaN     | NaN          | NaN          |
| 2 | 37660.0 | 40038.0                | NaN     | NaN          | NaN          |
| 3 | 2860.0  | 45199.0                | 2860.0  | 0.11         | 34447.0      |
| 4 | NaN     | NaN                    | 46167.0 | 0.02         | 4800.0       |
| 5 | NaN     | NaN                    | 1097.0  | 0.33         | 42.0         |
| 6 | NaN     | NaN                    | 80808.0 | 0.07         | 4310.0       |
#这里有点奇怪，ZipCode和Zip经过outer join之后变成了float类型。

我暂且认为更改ZipCode和Zip的这个行为是个bug，并且已经提交给git了。可以看下之后的反馈：https://github.com/pandas-dev/pandas/issues/34017

当然还是有一些办法去解决这个问题，可是使用会concat()方法来进行合并，或者我们可以通过统一两个DataFrame邮编的label来使用on= [‘Zip’]来进行合并，实验表明通过on= [‘Zip’]进行合并不会出现上述问题：

In [8]: population.rename(columns={'ZipCode':'Zip'}, inplace=True) #更改population中的column label
In [8]: merge_2 = pd.merge(population, unemployment, on=['Zip'], how='outer')
print(merge_2)
Out[8]:
|   | Zip   | 2010 Census Population | Unemployment | Participants | Participants |
|---|-------|------------------------|--------------|--------------|--------------|
| 0 | 57538 | 322.0                  | NaN          | NaN          | NaN          |
| 1 | 59916 | 130.0                  | NaN          | NaN          | NaN          |
| 2 | 37660 | 40038.0                | NaN          | NaN          | NaN          |
| 3 | 2860  | 45199.0                | 0.11         | 34447.0      | 34447.0      |
| 4 | 46167 | NaN                    | 0.02         | 4800.0       | 4800.0       |
| 5 | 1097  | NaN                    | 0.33         | 42.0         | 42.0         |
| 6 | 80808 | NaN                    | 0.07         | 4310.0       | 4310.0       |

join() Vs. concat()

join有四种合并方法，分别是how='left‘, how='right', how='inner'和how='outer'。当然这些合并方法merge()也是全部都有的。所以看到这里也应该对append(), concat(), join()和merge()有很充分的理解了。merge()是四个函数里面最强大的，但是在使用原则上来讲并不是每次对数据操作都要用merge()，有时候append()和concat()使用起来可能会更加方便，在最后会总结一下这四个函数的分类和原则。这里先看一下join()的实际操作：

In [1]: population.join(unemployment) #join的默认合并方式是how='left'
Out[1]:
|         | 2010 Census Population | unemployment | participants |
|---------|------------------------|--------------|--------------|
| ZipCode |                        |              |              |
| 57538   | 322                    | NaN          | NaN          |
| 59916   | 130                    | NaN          | NaN          |
| 37660   | 40038                  | NaN          | NaN          |
| 2860    | 45199                  | 0.11         | 34447.0      |

df1.join(df2, how=’left’)的意思是指以左边的DataFrame为准进行合并，population在unemployment左边，所以这个合并就会以population的index也就是ZipCode为准进行合并。所以df1.join(df2, how=’right’)就会以unemployment的index进行合并：

In [2]: population.join(unemployment, how= 'right')
Out[2]:
|       | 2010 Census Population | unemployment | participants |
|-------|------------------------|--------------|--------------|
| Zip   |                        |              |              |
| 2860  | 45199.0                | 0.11         | 34447        |
| 46167 | NaN                    | 0.02         | 4800         |
| 1097  | NaN                    | 0.33         | 42           |
| 80808 | NaN                    | 0.07         | 4310         |

join和concat都是要以index来进行合并，所以在合并时，必须要有对应的index。concat相比join缺少了left和right两种合并方式，但是在outer和inner合并方式来讲得到的结果是一模一样的：

population.join(unemployment, how='outer')
pd.concat([population, unemployment], join='outer', axis=1)
#以上两者结果相同
population.join(unemployment, how='inner')
pd.concat([population, unemployment], join='inner', axis=1)
#以上两者结果相同

append(), concat(), join()和merge()总结

append()

语法：df1.append(df2)

说明：append()就是简单的把两个DataFrame纵向罗列起来，不需要index。

concat()

语法：pd.concat([df1, df2])

说明：concat()可以横纵向合并多行或者多列，可以使用inner或者outer方式来合并，需要index。

join()

语法：df1.join(df2)

说明：join()可以使用多种合并方式，除了inner和outer之外还可以用left和right，这些操作同样需要index。

merge()

语法：pd.merge([df1, df2])

说明：方式最多的合并函数。不需要index。

merge_order()函数

merge_order()函数可以用一个函数进行两个操作，即merge()和sort_value()。

1	pd.merge_ordered(hardware, software, on=['', ''], suffixes=['', '']，fill_method='ffill')

Harnessing Data to Drive Marketing

Pandas对DataFrame的合并