Introductions to Pandas

1
import pandas as pd

DataFrame

1
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Yes No
0 50 131
1 21 2

DataFrame is actually a dictionary whose keys are the column names (String), and whose values are a list of entries(List)

1
2
3
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.

Sometimes we can manually assign index of row instead of defaut index which starts from 0

Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

1
pd.Series([1, 2, 3, 4, 5])
0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

1
2
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales',
'2017 Sales'], name='Product A')
2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Reading data files

1
data = pd.read_csv("test.csv")

Use function pd.read_csv() to load a csv file.

We can use the shape attribute to check how large the resulting DataFrame is:

1
data.shape
(6, 4)

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:

1
data.head()
Unnamed: 0 col1 col2 col3
0 0 1 2 a
1 1 1 2 b
2 2 2 4 c
3 3 3 6 a
4 4 5 10 b

The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.

1
2
data = pd.read_csv("test.csv", index_col=0)
data.head()
col1 col2 col3
0 1 2 a
1 1 2 b
2 2 4 c
3 3 6 a
4 5 10 b

Naive accessors

DataFrame in pandas is very similar to a dictionary. You can access it by using key-value format or calling title built in DataFrame

1
2
3
c1 = data['col1'] # key-value 
c2 = data.col1 # dic.column_name
print(c1 == c2)
0    True
1    True
2    True
3    True
4    True
5    True
Name: col1, dtype: bool

Because of the value,or the column, in DataFrame is organized in List, we can access row value using index operator []

1
data.col1[3]
3

Indexing in pandas

pandas has its own accessor operators, loc and iloc. The second paramter distinct them: index for iloc and string for 'loc'.

iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix.

loc, by contrast, uses the information in the indices to do its work. When we use loc we want to make our code more meaningful and humanable (easier to read)

Index-based selection

selecting data based on its numerical position in the data

1
data.iloc[3, 0] # second parameter is a numerical index
3

Label-based selection

selecting data based on its row index and column name

1
data.loc[3, 'col1'] # second parameter is a list of column names in string
3

Conditional selection

Select specified value using condition operation and its return is a matrix of bool value.

1
data.col1 == 3
0    False
1    False
2    False
3     True
4    False
5    False
Name: col1, dtype: bool

By converging conditional selection, we can use some colums' value to locate other rows or columns. You can also include the set operation: &and |

1
data.loc[data.col1 == 3]
col1 col2 col3
3 3 6 a

pandas has some built-in conditional selectors, two of which we will highlight here.

isin()

The first is isin. isin is lets you select data whose value "is in" a list of values. For example, you can use it to select rows whose value are 2 or 5 in col1.

1
data.loc[data.col1.isin([2, 5])]
col1 col2 col3
2 2 4 c
4 5 10 b

isnull() & notnull()

The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN).

1
data.loc[data.col1.isnull()]
col1 col2 col3
1
data.loc[data.col1.notnull()]
col1 col2 col3
0 1 2 a
1 1 2 b
2 2 4 c
3 3 6 a
4 5 10 b
5 8 16 c

Summary functions

pandas provides some useful fuctions that can help us get some statistic information

1
data.col1.describe() # for numerical values
count    6.000000
mean     3.333333
std      2.732520
min      1.000000
25%      1.250000
50%      2.500000
75%      4.500000
max      8.000000
Name: col1, dtype: float64

list statistic information of that row

1
data.col1.mean()
3.3333333333333335

access single attribute

1
data.col3.describe() # for string values
count     6
unique    3
top       c
freq      2
Name: col3, dtype: object

list statistic information of string frequence. top means the string with the max frequence

1
2
print(data.col1.unique())
print(data.col3.unique())
[1 2 3 5 8]
['a' 'b' 'c']
1
data.col3.value_counts()
c    2
a    2
b    2
Name: col3, dtype: int64

Maps

it just as map function in python

1
data.col2.map(lambda p: p / 2)
0    1.0
1    1.0
2    2.0
3    3.0
4    5.0
5    8.0
Name: col2, dtype: float64

Note: this operation does not change the original data.

apply function is equivalent to map function, although they operate value sightly different.
this function may change the original data!!!

1
2
3
4
5
6
def fun1(data):
data.col2 = data.col2 - 2 * data.col1
return data # must return the modified data


data.apply(fun1, axis='columns')
col1 col2 col3
0 1 0 a
1 1 0 b
2 2 0 c
3 3 0 a
4 5 0 b
5 8 0 c
1
2
3
4
5
6
def fun2(data):
data.iloc[1] = data.iloc[5]
return data # must return the modified data


data.apply(fun2, axis='rows')
/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
col1 col2 col3
0 1 2 a
1 8 16 c
2 2 4 c
3 3 6 a
4 5 10 b
5 8 16 c

This function apply is used for the situation when you do some complex data transforamtions. Any it also does not change the original data.

Groupwise analysis

groupby() function groups the same value from a given column into the same group. For instance, col3' only has 3 distinct characters: a,b,c. It generates 3 groups (3 rows): data.groupby('col3').col1/col2/col3. We can manipulate these groups seperately. More specify, data.groupby('col3') is a new DataFrame and we operate data in each group seperately.

1
2
3
print(data.groupby('col3').col1.describe())
print(data.groupby('col3').col2.describe())
print(data.groupby('col3').col3.describe())
      count  mean       std  min  25%  50%  75%  max
col3                                                
a       2.0   2.0  1.414214  1.0  1.5  2.0  2.5  3.0
b       1.0   5.0       NaN  5.0  5.0  5.0  5.0  5.0
c       3.0   6.0  3.464102  2.0  5.0  8.0  8.0  8.0
      count  mean       std   min   25%   50%   75%   max
col3                                                     
a       2.0   4.0  2.828427   2.0   3.0   4.0   5.0   6.0
b       1.0  10.0       NaN  10.0  10.0  10.0  10.0  10.0
c       3.0  12.0  6.928203   4.0  10.0  16.0  16.0  16.0
     count unique top freq
col3                      
a        2      1   a    2
b        1      1   b    1
c        3      1   c    3
1
data.groupby(['col3', 'col1']).col2.describe()
count mean std min 25% 50% 75% max
col3 col1
a 1 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0
3 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
b 5 1.0 10.0 NaN 10.0 10.0 10.0 10.0 10.0
c 2 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
8 2.0 16.0 0.0 16.0 16.0 16.0 16.0 16.0

groupby() function can group multiple columns. And the index here is called multi-index. We can call reset_index() to restore it back to a normal index.

1
2
dt = data.groupby(['col3', 'col1']).col2.describe()
dt.reset_index()
col3 col1 count mean std min 25% 50% 75% max
0 a 1 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0
1 a 3 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0
2 b 5 1.0 10.0 NaN 10.0 10.0 10.0 10.0 10.0
3 c 2 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
4 c 8 2.0 16.0 0.0 16.0 16.0 16.0 16.0 16.0

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

1
data.groupby('col3').col1.agg([len, min, max])
len min max
col3
a 2 1 3
b 1 5 5
c 3 2 8

Effective use of groupby() will allow you to do lots of really powerful things with your dataset.

Sorting

sort_values()

1
data.sort_values(by=["col1"], ascending=False)
col1 col2 col3
1 8 16 c
5 8 16 c
4 5 10 b
3 3 6 a
2 2 4 c
0 1 2 a

sort_index()

1
data.sort_index(by=["col1"], ascending=False)
col1 col2 col3
1 8 16 c
5 8 16 c
4 5 10 b
3 3 6 a
2 2 4 c
0 1 2 a

Missing data & Replace

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:

1
2
data.loc[pd.isnull(data.col1)]
# data.col1.isnull() is equivalent to pd.isnull(data.col1)
col1 col2 col3

Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

1
data.col1.fillna("Unknown")
0    1
1    8
2    2
3    3
4    5
5    8
Name: col1, dtype: int64

Alternatively, replace() function is used widely. We now replace all 'a' to 'b' in col3.

1
data.col3.replace('a', 'b')
0    b
1    c
2    c
3    b
4    b
5    c
Name: col3, dtype: object

Renaming

rename() function lets you change index names and/or column names

1
data.rename(columns={'col1': 'column1', 'col2': 'column2'})
column1 column2 col3
0 1 2 a
1 8 16 c
2 2 4 c
3 3 6 a
4 5 10 b
5 8 16 c
1
data.rename(index={1: 'row1', 4: 'row4'})
col1 col2 col3
0 1 2 a
row1 8 16 c
2 2 4 c
3 3 6 a
row4 5 10 b
5 8 16 c

rename_axis() function can give name to data's Title and Index

1
data.rename_axis('index', axis='rows').rename_axis('fields', axis='columns')
fields col1 col2 col3
index
0 1 2 a
1 8 16 c
2 2 4 c
3 3 6 a
4 5 10 b
5 8 16 c

Combining

concat()

1
pd.concat([data, data])
col1 col2 col3
0 1 2 a
1 8 16 c
2 2 4 c
3 3 6 a
4 5 10 b
5 8 16 c
0 1 2 a
1 8 16 c
2 2 4 c
3 3 6 a
4 5 10 b
5 8 16 c

join()

1
data.join(data, lsuffix='lll', rsuffix='rrr')
col1lll col2lll col3lll col1rrr col2rrr col3rrr
0 1 2 a 1 2 a
1 8 16 c 8 16 c
2 2 4 c 2 4 c
3 3 6 a 3 6 a
4 5 10 b 5 10 b
5 8 16 c 8 16 c