Pandas

Jérémie Decock (www.jdhp.org)

Open in Colab

Open in Binder

Import directives

In [1]:
%matplotlib inline
#%matplotlib notebook

from IPython.display import display

import matplotlib
matplotlib.rcParams['figure.figsize'] = (9, 9)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import datetime
import pandas as pd
import numpy as np
In [2]:
pd.__version__
Out[2]:
'0.23.4'

Make data

Series (1D data)

With automatic indices

In [3]:
data_list = [1, 3, np.nan, 7]
series = pd.Series(data_list)
series
Out[3]:
0    1.0
1    3.0
2    NaN
3    7.0
dtype: float64
In [4]:
data_array = np.array(data_list)
series = pd.Series(data_array)
series
Out[4]:
0    1.0
1    3.0
2    NaN
3    7.0
dtype: float64

With defined indices

In [5]:
indices = pd.Series([1, 3, 5, 7])
series = pd.Series([10, 30, 50, 70], index=indices)
series
Out[5]:
1    10
3    30
5    50
7    70
dtype: int64
In [6]:
indices = pd.Series(['A', 'B', 'C', 'D'])
series = pd.Series([10, 30, 50, 70], index=indices)
series
Out[6]:
A    10
B    30
C    50
D    70
dtype: int64
In [7]:
data_dict = {'A': 10, 'B': 30, 'C': 50, 'D': 70}
series = pd.Series(data_dict)
series
Out[7]:
A    10
B    30
C    50
D    70
dtype: int64

Get information about a series

In [8]:
series.index
Out[8]:
Index(['A', 'B', 'C', 'D'], dtype='object')
In [9]:
series.values
Out[9]:
array([10, 30, 50, 70])
In [10]:
series.shape
Out[10]:
(4,)
In [11]:
series.dtypes
Out[11]:
dtype('int64')
In [12]:
series.describe()
Out[12]:
count     4.000000
mean     40.000000
std      25.819889
min      10.000000
25%      25.000000
50%      40.000000
75%      55.000000
max      70.000000
dtype: float64
In [13]:
type(series.describe())
Out[13]:
pandas.core.series.Series
In [14]:
series.memory_usage()
Out[14]:
224

Date ranges

In [15]:
dates = pd.date_range('20130101', periods=6)
dates
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
In [16]:
dates = pd.date_range(start='2013-01-01', end='2013-01-08')
dates
Out[16]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08'],
              dtype='datetime64[ns]', freq='D')
In [17]:
dates = pd.date_range('2013-01-01', periods=4, freq='M')
dates
Out[17]:
DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30'], dtype='datetime64[ns]', freq='M')
In [18]:
num_days = 7

data = np.random.random(num_days)
index = pd.date_range('2017-01-01', periods=num_days)

series = pd.Series(data, index)
series
Out[18]:
2017-01-01    0.252865
2017-01-02    0.486144
2017-01-03    0.736894
2017-01-04    0.417840
2017-01-05    0.925004
2017-01-06    0.450721
2017-01-07    0.447128
Freq: D, dtype: float64

Frames (2D data)

With automatic indices and columns

In [19]:
data_list = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(data_array)
df
Out[19]:
0
0 1.0
1 3.0
2 NaN
3 7.0
In [20]:
data_array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data_array)
df
Out[20]:
0 1 2
0 1 2 3
1 4 5 6

With defined indices and columns

Using lists:

In [21]:
data = [[1, 2, 3], [4, 5, 6]]
index = [10, 20]
columns = ['A', 'B', 'C']

df = pd.DataFrame(data, index, columns)
df
Out[21]:
A B C
10 1 2 3
20 4 5 6

Using numpy arrays:

In [22]:
data = np.array([[1, 2, 3], [4, 5, 6]])
index = np.array([10, 20])
columns = np.array(['A', 'B', 'C'])

df = pd.DataFrame(data, index=index, columns=columns)
df
Out[22]:
A B C
10 1 2 3
20 4 5 6

Using Series:

In [23]:
data = np.array([[1, 2, 3], [4, 5, 6]])
index = pd.Series([10, 20])
columns = pd.Series(['A', 'B', 'C'])

df = pd.DataFrame(data, index=index, columns=columns)
df
Out[23]:
A B C
10 1 2 3
20 4 5 6

With columns from dict

Dictionary keys define columns label.

In [24]:
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3}
df = pd.DataFrame(data_dict)
df
Out[24]:
A B C
0 foo 10 3
1 foo 20 3
2 foo 30 3

To define index as well:

In [25]:
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3}
df = pd.DataFrame(data_dict, index=[10, 20, 30])
df
Out[25]:
A B C
10 foo 10 3
20 foo 20 3
30 foo 30 3

Get information about a dataframe

In [26]:
df.index
Out[26]:
Int64Index([10, 20, 30], dtype='int64')
In [27]:
df.columns
Out[27]:
Index(['A', 'B', 'C'], dtype='object')
In [28]:
df.values
Out[28]:
array([['foo', 10, 3],
       ['foo', 20, 3],
       ['foo', 30, 3]], dtype=object)
In [29]:
df.shape
Out[29]:
(3, 3)
In [30]:
df.dtypes
Out[30]:
A    object
B     int64
C     int64
dtype: object
In [31]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 10 to 30
Data columns (total 3 columns):
A    3 non-null object
B    3 non-null int64
C    3 non-null int64
dtypes: int64(2), object(1)
memory usage: 96.0+ bytes
In [32]:
df.describe()
Out[32]:
B C
count 3.0 3.0
mean 20.0 3.0
std 10.0 0.0
min 10.0 3.0
25% 15.0 3.0
50% 20.0 3.0
75% 25.0 3.0
max 30.0 3.0
In [33]:
type(df.describe())
Out[33]:
pandas.core.frame.DataFrame
In [34]:
df.memory_usage()
Out[34]:
Index    24
A        24
B        24
C        24
dtype: int64

More details about dtype

DataFrame's columns can have different types. But what about rows ?

What append when a DataFrame with columns having different type is transposed ?

In [35]:
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3}
df = pd.DataFrame(data_dict)
df
Out[35]:
A B C
0 foo 10 3
1 foo 20 3
2 foo 30 3
In [36]:
df.dtypes
Out[36]:
A    object
B     int64
C     int64
dtype: object
In [37]:
df2 = df.T
df2
Out[37]:
0 1 2
A foo foo foo
B 10 20 30
C 3 3 3
In [38]:
df2.dtypes
Out[38]:
0    object
1    object
2    object
dtype: object

Panels (3D data)

Panels are deprecated.

Pandas now focuses on 1D (Series) and 2D (DataFrame) data structures.

The recommended alternative to work with 3-dimensional data is the xarray python library.

An other workaround: one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

See http://pandas.pydata.org/pandas-docs/stable/dsintro.html#deprecate-panel.

Panel4D and PanelND (ND data)

Panel4D and PanelND are deprecated.

Pandas now focuses on 1D (Series) and 2D (DataFrame) data structures.

The recommended alternative to work with n-dimensional data is the xarray python library.

An other workaround: one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

See http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel4d-and-panelnd-deprecated.

Export/import data (write/read files)

Reader functions are accessibles from the top level pd object.

Writer functions are accessibles from data objects (i.e. Series, DataFrame or Panel objects).

In [39]:
data_array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data_array, index=[10, 20], columns=[100, 200, 300])
df
Out[39]:
100 200 300
10 1 2 3
20 4 5 6

HDF5 files

See python_pandas_hdf5_en.ipynb...

CSV files

Write CSV files

Simplest version:

In [40]:
df.to_csv(path_or_buf="python_pandas_io_test.csv")
In [41]:
!cat python_pandas_io_test.csv
,100,200,300
10,1,2,3
20,4,5,6

Setting more options:

In [42]:
# FYI, many other options are available
df.to_csv(path_or_buf="python_pandas_io_test.csv",
          sep=',',
          columns=None,
          header=True,
          index=True,
          index_label=None,
          compression=None,  # allowed values are 'gzip', 'bz2' or 'xz'
          date_format=None)
In [43]:
!cat python_pandas_io_test.csv
,100,200,300
10,1,2,3
20,4,5,6

Read CSV files

Simplest version:

In [44]:
df = pd.read_csv("python_pandas_io_test.csv")
df
Out[44]:
Unnamed: 0 100 200 300
0 10 1 2 3
1 20 4 5 6

Setting more options:

In [45]:
df = pd.read_csv("python_pandas_io_test.csv",
                 sep=',',
                 delimiter=None,
                 header='infer',
                 names=None,
                 index_col=0,
                 usecols=None,
                 squeeze=False,
                 prefix=None,
                 mangle_dupe_cols=True,
                 dtype=None,
                 engine=None,
                 converters=None,
                 true_values=None,
                 false_values=None,
                 skipinitialspace=False,
                 skiprows=None,
                 nrows=None,
                 na_values=None,
                 keep_default_na=True,
                 na_filter=True,
                 verbose=False,
                 skip_blank_lines=True,
                 parse_dates=False,
                 infer_datetime_format=False,
                 keep_date_col=False,
                 date_parser=None,
                 dayfirst=False,
                 iterator=False,
                 chunksize=None,
                 compression='infer',
                 thousands=None,
                 decimal=b'.',
                 lineterminator=None,
                 quotechar='"',
                 quoting=0,
                 escapechar=None,
                 comment=None,
                 encoding=None,
                 dialect=None,
                 #tupleize_cols=False,
                 error_bad_lines=True,
                 warn_bad_lines=True,
                 skipfooter=0,
                 #skip_footer=0,
                 doublequote=True,
                 delim_whitespace=False,
                 #as_recarray=False,
                 #compact_ints=False,
                 #use_unsigned=False,
                 low_memory=True,
                 #buffer_lines=None,
                 memory_map=False,
                 float_precision=None)
df
Out[45]:
100 200 300
10 1 2 3
20 4 5 6
In [46]:
!rm python_pandas_io_test.csv

JSON files

In [47]:
import io

Write JSON files

Simplest version
In [48]:
df.to_json(path_or_buf="python_pandas_io_test.json")
In [49]:
!cat python_pandas_io_test.json
{"100":{"10":1,"20":4},"200":{"10":2,"20":5},"300":{"10":3,"20":6}}
Setting orient="split"
In [50]:
df.to_json(path_or_buf="python_pandas_io_test_split.json",
           orient="split")
In [51]:
!cat python_pandas_io_test_split.json
{"columns":["100","200","300"],"index":[10,20],"data":[[1,2,3],[4,5,6]]}
Setting orient="records"
In [52]:
df.to_json(path_or_buf="python_pandas_io_test_records.json",
           orient="records")
In [53]:
!cat python_pandas_io_test_records.json
[{"100":1,"200":2,"300":3},{"100":4,"200":5,"300":6}]
Setting orient="index" (the default option for Series)
In [54]:
df.to_json(path_or_buf="python_pandas_io_test_index.json",
           orient="index")
In [55]:
!cat python_pandas_io_test_index.json
{"10":{"100":1,"200":2,"300":3},"20":{"100":4,"200":5,"300":6}}
Setting orient="columns" (the default option for DataFrame) (for DataFrame only)
In [56]:
df.to_json(path_or_buf="python_pandas_io_test_columns.json",
           orient="columns")
In [57]:
!cat python_pandas_io_test_columns.json
{"100":{"10":1,"20":4},"200":{"10":2,"20":5},"300":{"10":3,"20":6}}
Setting orient="values" (for DataFrame only)
In [58]:
df.to_json(path_or_buf="python_pandas_io_test_values.json",
           orient="values")
In [59]:
!cat python_pandas_io_test_values.json
[[1,2,3],[4,5,6]]
Setting more options
In [60]:
# FYI, many other options are available
df.to_json(path_or_buf="python_pandas_io_test.json",
           orient='columns',     # For DataFrame: 'split','records','index','columns' or 'values'
           date_format=None,     # None, 'epoch' or 'iso'
           double_precision=10,
           force_ascii=True,
           date_unit='ms')
In [61]:
!cat python_pandas_io_test.json
{"100":{"10":1,"20":4},"200":{"10":2,"20":5},"300":{"10":3,"20":6}}

Read JSON files

Using orient="split"

Dict like data {index -> [index], columns -> [columns], data -> [values]}

In [62]:
!cat python_pandas_io_test_split.json
{"columns":["100","200","300"],"index":[10,20],"data":[[1,2,3],[4,5,6]]}
In [63]:
df = pd.read_json("python_pandas_io_test_split.json",
                  orient="split")
df
Out[63]:
100 200 300
10 1 2 3
20 4 5 6
Using orient="records"

List like [{column -> value}, ... , {column -> value}]

In [64]:
!cat python_pandas_io_test_records.json
[{"100":1,"200":2,"300":3},{"100":4,"200":5,"300":6}]
In [65]:
df = pd.read_json("python_pandas_io_test_records.json",
                  orient="records")
df
Out[65]:
100 200 300
0 1 2 3
1 4 5 6
Using orient="index"

Dict like {index -> {column -> value}}

In [66]:
!cat python_pandas_io_test_index.json
{"10":{"100":1,"200":2,"300":3},"20":{"100":4,"200":5,"300":6}}
In [67]:
df = pd.read_json("python_pandas_io_test_index.json",
                  orient="index")
df
Out[67]:
100 200 300
10 1 2 3
20 4 5 6
Using orient="columns"

Dict like {column -> {index -> value}}

In [68]:
!cat python_pandas_io_test_columns.json
{"100":{"10":1,"20":4},"200":{"10":2,"20":5},"300":{"10":3,"20":6}}
In [69]:
df = pd.read_json("python_pandas_io_test_columns.json",
                  orient="columns")
df
Out[69]:
100 200 300
10 1 2 3
20 4 5 6
Using orient="values" (for DataFrame only)

Just the values array

In [70]:
!cat python_pandas_io_test_values.json
[[1,2,3],[4,5,6]]
In [71]:
df = pd.read_json("python_pandas_io_test_values.json",
                  orient="values")
df
Out[71]:
0 1 2
0 1 2 3
1 4 5 6
Setting more options
In [72]:
df = pd.read_json("python_pandas_io_test.json",
                  orient=None,
                  typ='frame',
                  dtype=True,
                  convert_axes=True,
                  convert_dates=True,
                  keep_default_dates=True,
                  numpy=False,
                  precise_float=False,
                  date_unit=None,
                  encoding=None,
                  lines=False)
df
Out[72]:
100 200 300
10 1 2 3
20 4 5 6
In [73]:
!rm python_pandas_io_test*.json

YAML

In [74]:
!echo "- {A: 1, B: 2}"     > python_pandas_io_test.yaml
!echo "- {A: 3}" >> python_pandas_io_test.yaml
!echo "- {B: 4}" >> python_pandas_io_test.yaml
In [75]:
!cat python_pandas_io_test.yaml
- {A: 1, B: 2}
- {A: 3}
- {B: 4}
In [76]:
try:
    import yaml

    with open('python_pandas_io_test.yaml', 'r') as f:
        df = pd.io.json.json_normalize(yaml.load(f))

    print(df)
except:
    pass
     A    B
0  1.0  2.0
1  3.0  NaN
2  NaN  4.0
In [77]:
!rm python_pandas_io_test.yaml

Other file formats

Many other file formats can be used to import or export data with JSON.

See the following link for more information: http://pandas.pydata.org/pandas-docs/stable/io.html

Select columns

In [78]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[78]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900

Select a single column

The following instructions return a Series.

Label based selection

In [79]:
df.B
Out[79]:
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
Name: B, dtype: int64
In [80]:
df["B"]
Out[80]:
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
Name: B, dtype: int64
In [81]:
df.loc[:,"B"]
Out[81]:
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
Name: B, dtype: int64

Index based selection

In [82]:
df.iloc[:,1]
Out[82]:
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
Name: B, dtype: int64

Select multiple columns

Label based selection

In [83]:
df[['A','B']]
Out[83]:
A B
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
7 7 70
8 8 80
9 9 90
In [84]:
df.loc[:,['A','B']]
Out[84]:
A B
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
7 7 70
8 8 80
9 9 90

Index based selection

In [85]:
df.iloc[:,0:2]
Out[85]:
A B
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
7 7 70
8 8 80
9 9 90

Select rows

In [86]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=["i" + str(i+1) for i in range(9)],
                  columns=['A', 'B', 'C'])
df
Out[86]:
A B C
i1 1 10 100
i2 2 20 200
i3 3 30 300
i4 4 40 400
i5 5 50 500
i6 6 60 600
i7 7 70 700
i8 8 80 800
i9 9 90 900

Select a single row

The following instructions return a Series.

Label based selection

In [87]:
df.loc["i3"]
Out[87]:
A      3
B     30
C    300
Name: i3, dtype: int64
In [88]:
df.loc["i3",:]
Out[88]:
A      3
B     30
C    300
Name: i3, dtype: int64

Index based selection

In [89]:
df.iloc[2]      # Select over index
Out[89]:
A      3
B     30
C    300
Name: i3, dtype: int64
In [90]:
df.iloc[2,:]      # Select over index
Out[90]:
A      3
B     30
C    300
Name: i3, dtype: int64

Select multiple rows

Label based selection

In [91]:
df.loc[["i3", "i4"],:]
Out[91]:
A B C
i3 3 30 300
i4 4 40 400

Index based selection

In [92]:
df.iloc[2:4,:]      # Select over index
Out[92]:
A B C
i3 3 30 300
i4 4 40 400

Select rows based on values

In [93]:
df.B < 50.
Out[93]:
i1     True
i2     True
i3     True
i4     True
i5    False
i6    False
i7    False
i8    False
i9    False
Name: B, dtype: bool
In [94]:
type(df.B < 50.)
Out[94]:
pandas.core.series.Series
In [95]:
df[[True, True, True, True, False, False, False, False, False]]
Out[95]:
A B C
i1 1 10 100
i2 2 20 200
i3 3 30 300
i4 4 40 400
In [96]:
series_mask = pd.Series({'i1': True,
                         'i2': True,
                         'i3': True,
                         'i4': True,
                         'i5': False,
                         'i6': False,
                         'i7': False,
                         'i8': False,
                         'i9': False})
df[series_mask]
Out[96]:
A B C
i1 1 10 100
i2 2 20 200
i3 3 30 300
i4 4 40 400
In [97]:
df[df.B < 50.]
Out[97]:
A B C
i1 1 10 100
i2 2 20 200
i3 3 30 300
i4 4 40 400
In [98]:
df[df['B'] < 50.]
Out[98]:
A B C
i1 1 10 100
i2 2 20 200
i3 3 30 300
i4 4 40 400
In [99]:
df[(df.A >= 2) & (df.B < 50)]
Out[99]:
A B C
i2 2 20 200
i3 3 30 300
i4 4 40 400

This can be written:

In [100]:
df.loc[(df.A >= 2) & (df.B < 50)]
Out[100]:
A B C
i2 2 20 200
i3 3 30 300
i4 4 40 400

This could be written df[df.A >= 2][df.B < 50] but this is a bad practice (named "chained indexing").

"When setting values in a pandas object, care must be taken to avoid what is called chained indexing".

See:

Select rows and columns

In [101]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[101]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900
In [102]:
df[(df.A >= 2) & (df.B < 50)]
Out[102]:
A B C
2 2 20 200
3 3 30 300
4 4 40 400
In [103]:
df[(df.B < 20) | (df.B > 50)]
Out[103]:
A B C
1 1 10 100
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900
In [104]:
df.loc[(df.B < 20) | (df.B > 50), 'C']
Out[104]:
1    100
6    600
7    700
8    800
9    900
Name: C, dtype: int64
In [105]:
df[(df['A'] >= 2) & (df['B'] < 50)]
Out[105]:
A B C
2 2 20 200
3 3 30 300
4 4 40 400
In [106]:
df.loc[(df.A >= 2) & (df.B < 50), ['A','B']]
Out[106]:
A B
2 2 20
3 3 30
4 4 40

Setting values

Apply a function to selected colunms values

In [107]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[107]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900
In [108]:
df.B *= 2.
df
Out[108]:
A B C
1 1 20.0 100
2 2 40.0 200
3 3 60.0 300
4 4 80.0 400
5 5 100.0 500
6 6 120.0 600
7 7 140.0 700
8 8 160.0 800
9 9 180.0 900
In [109]:
df.B = pow(df.B, 2)
df
Out[109]:
A B C
1 1 400.0 100
2 2 1600.0 200
3 3 3600.0 300
4 4 6400.0 400
5 5 10000.0 500
6 6 14400.0 600
7 7 19600.0 700
8 8 25600.0 800
9 9 32400.0 900

Apply a function to selected rows values

In [110]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[110]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900
In [111]:
df[df.B < 50.] *= -1.
df
Out[111]:
A B C
1 -1.0 -10.0 -100.0
2 -2.0 -20.0 -200.0
3 -3.0 -30.0 -300.0
4 -4.0 -40.0 -400.0
5 5.0 50.0 500.0
6 6.0 60.0 600.0
7 7.0 70.0 700.0
8 8.0 80.0 800.0
9 9.0 90.0 900.0
In [112]:
# df['B'][df['B'] < 50.] = 0    # OK but chain indexing is bad...
# df.A[df.B < 50.] = 0          # OK but chain indexing is bad...
df.loc[df.B < 50., 'A'] = 0
df
Out[112]:
A B C
1 0.0 -10.0 -100.0
2 0.0 -20.0 -200.0
3 0.0 -30.0 -300.0
4 0.0 -40.0 -400.0
5 5.0 50.0 500.0
6 6.0 60.0 600.0
7 7.0 70.0 700.0
8 8.0 80.0 800.0
9 9.0 90.0 900.0

WARNING: df[df.B < 50.].A = 0 does NOT work even if df.A[df.B < 50.] and df[df.B < 50.].A seems to produce the same result...

"When setting values in a pandas object, care must be taken to avoid what is called chained indexing".

See:

In [113]:
df.loc[(df.B < 50.) & (df.B > 20), 'C'] = 0
df
Out[113]:
A B C
1 0.0 -10.0 -100.0
2 0.0 -20.0 -200.0
3 0.0 -30.0 -300.0
4 0.0 -40.0 -400.0
5 5.0 50.0 500.0
6 6.0 60.0 600.0
7 7.0 70.0 700.0
8 8.0 80.0 800.0
9 9.0 90.0 900.0
In [114]:
df.loc[(df.B < 20) | (df.B > 50), 'C'] = -1
df
Out[114]:
A B C
1 0.0 -10.0 -1.0
2 0.0 -20.0 -1.0
3 0.0 -30.0 -1.0
4 0.0 -40.0 -1.0
5 5.0 50.0 500.0
6 6.0 60.0 -1.0
7 7.0 70.0 -1.0
8 8.0 80.0 -1.0
9 9.0 90.0 -1.0
In [115]:
df[df.B < 50.] = pow(df[df.B < 50.], 2)
df
Out[115]:
A B C
1 0.0 100.0 1.0
2 0.0 400.0 1.0
3 0.0 900.0 1.0
4 0.0 1600.0 1.0
5 5.0 50.0 500.0
6 6.0 60.0 -1.0
7 7.0 70.0 -1.0
8 8.0 80.0 -1.0
9 9.0 90.0 -1.0

Sample rows

In [116]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[116]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900

With replacement

Draw 3 samples:

In [117]:
df.sample(n=30, replace=True)
Out[117]:
A B C
1 1 10 100
8 8 80 800
3 3 30 300
3 3 30 300
1 1 10 100
7 7 70 700
2 2 20 200
6 6 60 600
9 9 90 900
5 5 50 500
2 2 20 200
7 7 70 700
9 9 90 900
2 2 20 200
2 2 20 200
3 3 30 300
3 3 30 300
1 1 10 100
5 5 50 500
2 2 20 200
9 9 90 900
4 4 40 400
4 4 40 400
9 9 90 900
1 1 10 100
4 4 40 400
6 6 60 600
9 9 90 900
4 4 40 400
4 4 40 400

Sample 90% of the rows:

In [118]:
df.sample(frac=0.9, replace=True)
Out[118]:
A B C
8 8 80 800
3 3 30 300
8 8 80 800
1 1 10 100
2 2 20 200
6 6 60 600
2 2 20 200
5 5 50 500

Without replacement

Draw 3 samples:

In [119]:
df.sample(n=3)
Out[119]:
A B C
9 9 90 900
3 3 30 300
4 4 40 400

Sample 90% of the rows:

In [120]:
df.sample(frac=0.9)
Out[120]:
A B C
7 7 70 700
8 8 80 800
6 6 60 600
5 5 50 500
9 9 90 900
1 1 10 100
4 4 40 400
2 2 20 200

Weighted sampling

In [121]:
df.sample(n=30, replace=True, weights=np.arange(len(df)))
Out[121]:
A B C
5 5 50 500
8 8 80 800
9 9 90 900
4 4 40 400
7 7 70 700
4 4 40 400
9 9 90 900
9 9 90 900
8 8 80 800
9 9 90 900
7 7 70 700
7 7 70 700
9 9 90 900
6 6 60 600
6 6 60 600
6 6 60 600
3 3 30 300
8 8 80 800
4 4 40 400
6 6 60 600
7 7 70 700
7 7 70 700
9 9 90 900
9 9 90 900
6 6 60 600
9 9 90 900
9 9 90 900
9 9 90 900
5 5 50 500
5 5 50 500

Shuffle/permute rows

In [122]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[122]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900
In [123]:
df = df.sample(frac=1)
df
Out[123]:
A B C
5 5 50 500
1 1 10 100
8 8 80 800
7 7 70 700
6 6 60 600
2 2 20 200
3 3 30 300
9 9 90 900
4 4 40 400

To reset indexes too:

In [124]:
df = df.sample(frac=1).reset_index(drop=True)
df
Out[124]:
A B C
0 2 20 200
1 4 40 400
2 3 30 300
3 5 50 500
4 7 70 700
5 8 80 800
6 9 90 900
7 1 10 100
8 6 60 600

Sort a DataFrame

In [125]:
NROWS = 7

col1 = np.arange(1., NROWS, 1)
col2 = np.arange(10., NROWS*10, 10)
col3 = np.arange(100., NROWS*100, 100)
np.random.shuffle(col1)
np.random.shuffle(col2)
np.random.shuffle(col3)

data = np.array([col1,
                 col2,
                 col3]).T
index = np.arange(1, NROWS, 1)
columns = np.array(['A', 'B', 'C'])

np.random.shuffle(index)
np.random.shuffle(data)
np.random.shuffle(columns)

df = pd.DataFrame(data,
                  index=index,
                  columns=columns)
df
Out[125]:
C A B
1 2.0 40.0 100.0
5 1.0 30.0 500.0
4 6.0 50.0 600.0
6 3.0 20.0 400.0
3 5.0 60.0 300.0
2 4.0 10.0 200.0

Sorting by row index or column label

Sorting by rows

In [126]:
df.sort_index()
Out[126]:
C A B
1 2.0 40.0 100.0
2 4.0 10.0 200.0
3 5.0 60.0 300.0
4 6.0 50.0 600.0
5 1.0 30.0 500.0
6 3.0 20.0 400.0
In [127]:
df.sort_index(axis=0)    # axis=0 -> sort by row index
Out[127]:
C A B
1 2.0 40.0 100.0
2 4.0 10.0 200.0
3 5.0 60.0 300.0
4 6.0 50.0 600.0
5 1.0 30.0 500.0
6 3.0 20.0 400.0
In [128]:
df.sort_index(ascending=False)
Out[128]:
C A B
6 3.0 20.0 400.0
5 1.0 30.0 500.0
4 6.0 50.0 600.0
3 5.0 60.0 300.0
2 4.0 10.0 200.0
1 2.0 40.0 100.0

Sorting by columns

In [129]:
df.sort_index(axis=1)    # axis=1 -> sort by column label
Out[129]:
A B C
1 40.0 100.0 2.0
5 30.0 500.0 1.0
4 50.0 600.0 6.0
6 20.0 400.0 3.0
3 60.0 300.0 5.0
2 10.0 200.0 4.0
In [130]:
df.sort_index(axis=1, ascending=False)
Out[130]:
C B A
1 2.0 100.0 40.0
5 1.0 500.0 30.0
4 6.0 600.0 50.0
6 3.0 400.0 20.0
3 5.0 300.0 60.0
2 4.0 200.0 10.0

Sorting by values

In [131]:
df.sort_values(by='B')
Out[131]:
C A B
1 2.0 40.0 100.0
2 4.0 10.0 200.0
3 5.0 60.0 300.0
6 3.0 20.0 400.0
5 1.0 30.0 500.0
4 6.0 50.0 600.0
In [132]:
df.sort_values(by='B', ascending=False)
Out[132]:
C A B
4 6.0 50.0 600.0
5 1.0 30.0 500.0
6 3.0 20.0 400.0
3 5.0 60.0 300.0
2 4.0 10.0 200.0
1 2.0 40.0 100.0
In [133]:
df.sort_values(by='B', inplace=True)
df
Out[133]:
C A B
1 2.0 40.0 100.0
2 4.0 10.0 200.0
3 5.0 60.0 300.0
6 3.0 20.0 400.0
5 1.0 30.0 500.0
4 6.0 50.0 600.0

Missing data

In [134]:
a = np.array([[3, np.nan, 5, np.nan, 7],
              [2, 4, np.nan, 3, 1],
              [3, 4, 5, 6, 1]]).T
df = pd.DataFrame(a,
                  columns=['A', 'B', 'C'])

df
Out[134]:
A B C
0 3.0 2.0 3.0
1 NaN 4.0 4.0
2 5.0 NaN 5.0
3 NaN 3.0 6.0
4 7.0 1.0 1.0

Get the boolean mask where values are nan

In [135]:
df.isnull()
Out[135]:
A B C
0 False False False
1 True False False
2 False True False
3 True False False
4 False False False

Drop any rows that have missing data

In [136]:
df.dropna()
Out[136]:
A B C
0 3.0 2.0 3.0
4 7.0 1.0 1.0
In [137]:
df.dropna(how='any')      # but 'any' is the default value...
Out[137]:
A B C
0 3.0 2.0 3.0
4 7.0 1.0 1.0

Drop any rows that have missing data in a given column

In [138]:
df.dropna(subset=['B'])
Out[138]:
A B C
0 3.0 2.0 3.0
1 NaN 4.0 4.0
3 NaN 3.0 6.0
4 7.0 1.0 1.0
In [139]:
df.dropna(subset=['B', 'C'])
Out[139]:
A B C
0 3.0 2.0 3.0
1 NaN 4.0 4.0
3 NaN 3.0 6.0
4 7.0 1.0 1.0

Drop any columns that have missing data

In [140]:
df.dropna(axis=1)
Out[140]:
C
0 3.0
1 4.0
2 5.0
3 6.0
4 1.0
In [141]:
df.dropna(axis=1, how='any')      # but 'any' is the default value...
Out[141]:
C
0 3.0
1 4.0
2 5.0
3 6.0
4 1.0

Drop any columns that have missing data in a given row

In [142]:
df.dropna(axis=1, subset=[2])
Out[142]:
A C
0 3.0 3.0
1 NaN 4.0
2 5.0 5.0
3 NaN 6.0
4 7.0 1.0
In [143]:
df.dropna(axis=1, subset=[1, 2])
Out[143]:
C
0 3.0
1 4.0
2 5.0
3 6.0
4 1.0

Filling missing data

In [144]:
df.fillna(value=999)
Out[144]:
A B C
0 3.0 2.0 3.0
1 999.0 4.0 4.0
2 5.0 999.0 5.0
3 999.0 3.0 6.0
4 7.0 1.0 1.0

Count the number of NaN values in a given column

In [145]:
df.A.isnull().sum()
Out[145]:
2

Miscellaneous operations on data frames

Transpose of a data frame

In [146]:
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df
Out[146]:
A B C
1 1 10 100
2 2 20 200
3 3 30 300
4 4 40 400
5 5 50 500
6 6 60 600
7 7 70 700
8 8 80 800
9 9 90 900
In [147]:
df.T
Out[147]:
1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9
B 10 20 30 40 50 60 70 80 90
C 100 200 300 400 500 600 700 800 900

Merge

In [148]:
a1 = np.array([np.arange(1, 5, 1), np.arange(10, 50, 10), np.arange(100, 500, 100)]).T
df1 = pd.DataFrame(a1,
                   columns=['ID', 'B', 'C'])

a2 = np.array([np.arange(1, 5, 1), np.arange(1000, 5000, 1000), np.arange(10000, 50000, 10000)]).T
df2 = pd.DataFrame(a2,
                   columns=['ID', 'B', 'C'])

display(df1)
display(df2)

df = pd.merge(df1, df2, on="ID", suffixes=('_1', '_2'))  #.dropna(how='any')

display(df)
ID B C
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
ID B C
0 1 1000 10000
1 2 2000 20000
2 3 3000 30000
3 4 4000 40000
ID B_1 C_1 B_2 C_2
0 1 10 100 1000 10000
1 2 20 200 2000 20000
2 3 30 300 3000 30000
3 4 40 400 4000 40000

Merge with NaN

In [149]:
a1 = np.array([np.arange(1, 5, 1), np.arange(10, 50, 10), np.arange(100, 500, 100)]).T
df1 = pd.DataFrame(a1,
                   columns=['ID', 'B', 'C'])

a2 = np.array([np.arange(1, 5, 1), np.arange(1000, 5000, 1000), np.arange(10000, 50000, 10000)]).T
df2 = pd.DataFrame(a2,
                   columns=['ID', 'B', 'C'])

df1.iloc[0,2] = np.nan
df1.iloc[1,1] = np.nan
df1.iloc[2,2] = np.nan
df1.iloc[3,1] = np.nan

df2.iloc[0,1] = np.nan
df2.iloc[1,2] = np.nan
df2.iloc[2,1] = np.nan
df2.iloc[3,2] = np.nan

df = pd.merge(df1, df2, on="ID", suffixes=('_1', '_2'))  #.dropna(how='any')

display(df1)
display(df2)
display(df)
ID B C
0 1 10.0 NaN
1 2 NaN 200.0
2 3 30.0 NaN
3 4 NaN 400.0
ID B C
0 1 NaN 10000.0
1 2 2000.0 NaN
2 3 NaN 30000.0
3 4 4000.0 NaN
ID B_1 C_1 B_2 C_2
0 1 10.0 NaN NaN 10000.0
1 2 NaN 200.0 2000.0 NaN
2 3 30.0 NaN NaN 30000.0
3 4 NaN 400.0 4000.0 NaN

Merge with missing rows

In [150]:
a1 = np.array([np.arange(1, 5, 1), np.arange(10, 50, 10), np.arange(100, 500, 100)]).T
df1 = pd.DataFrame(a1,
                   columns=['ID', 'B', 'C'])

a2 = np.array([np.arange(1, 3, 1), np.arange(1000, 3000, 1000), np.arange(10000, 30000, 10000)]).T
df2 = pd.DataFrame(a2,
                   columns=['ID', 'B', 'C'])

display(df1)
display(df2)

print("Left: use only keys from left frame (SQL: left outer join)")
df = pd.merge(df1, df2, on="ID", how="left", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

print("Right: use only keys from right frame (SQL: right outer join)")
df = pd.merge(df1, df2, on="ID", how="right", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

print("Inner: use intersection of keys from both frames (SQL: inner join) [DEFAULT]")
df = pd.merge(df1, df2, on="ID", how="inner", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

print("Outer: use union of keys from both frames (SQL: full outer join)")
df = pd.merge(df1, df2, on="ID", how="outer", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)
ID B C
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
ID B C
0 1 1000 10000
1 2 2000 20000
Left: use only keys from left frame (SQL: left outer join)
ID B_1 C_1 B_2 C_2
0 1 10 100 1000.0 10000.0
1 2 20 200 2000.0 20000.0
2 3 30 300 NaN NaN
3 4 40 400 NaN NaN
Right: use only keys from right frame (SQL: right outer join)
ID B_1 C_1 B_2 C_2
0 1 10 100 1000 10000
1 2 20 200 2000 20000
Inner: use intersection of keys from both frames (SQL: inner join) [DEFAULT]
ID B_1 C_1 B_2 C_2
0 1 10 100 1000 10000
1 2 20 200 2000 20000
Outer: use union of keys from both frames (SQL: full outer join)
ID B_1 C_1 B_2 C_2
0 1 10 100 1000.0 10000.0
1 2 20 200 2000.0 20000.0
2 3 30 300 NaN NaN
3 4 40 400 NaN NaN

GroupBy

In [151]:
a = np.array([[3, 5, 5, 5, 7, 7, 7, 7],
              [2, 4, 4, 3, 1, 3, 3, 2],
              [3, 4, 5, 6, 1, 8, 9, 8]]).T
df = pd.DataFrame(a,
                  columns=['A', 'B', 'C'])

df
Out[151]:
A B C
0 3 2 3
1 5 4 4
2 5 4 5
3 5 3 6
4 7 1 1
5 7 3 8
6 7 3 9
7 7 2 8

GroupBy with single key

In [152]:
df.groupby(["A"]).count()
Out[152]:
B C
A
3 1 1
5 3 3
7 4 4
In [153]:
df.groupby(["A"]).sum().B
Out[153]:
A
3     2
5    11
7     9
Name: B, dtype: int64
In [154]:
df.groupby(["A"]).mean().B
Out[154]:
A
3    2.000000
5    3.666667
7    2.250000
Name: B, dtype: float64

GroupBy with multiple keys

In [155]:
df.groupby(["A","B"]).count()
Out[155]:
C
A B
3 2 1
5 3 1
4 2
7 1 1
2 1
3 2

Rolling

Basic example

Rolling with an aggregation window of size 2.

In [156]:
s = pd.Series([1., 0., 5., 2., 1.])
print("DATA:")
print(s)

mean_s = s.rolling(2).mean()
print()
print("ROLLING MEAN:")
print(mean_s)

sum_s = s.rolling(2).sum()
print()
print("ROLLING SUM:")
print(sum_s)

min_s = s.rolling(2).min()
print()
print("ROLLING MIN:")
print(min_s)

max_s = s.rolling(2).max()
print()
print("ROLLING MAX:")
print(max_s)

ax = s.plot(figsize=(18, 3), color="blue")
mean_s.plot(color="red", label="mean", ax=ax)
sum_s.plot(color="green", label="sum", style="--", alpha=0.5, ax=ax)
min_s.plot(color="black", label="min", style=":", alpha=0.25, ax=ax)
max_s.plot(color="black", label="max", style=":", alpha=0.25, ax=ax)
ax.legend();
DATA:
0    1.0
1    0.0
2    5.0
3    2.0
4    1.0
dtype: float64

ROLLING MEAN:
0    NaN
1    0.5
2    2.5
3    3.5
4    1.5
dtype: float64

ROLLING SUM:
0    NaN
1    1.0
2    5.0
3    7.0
4    3.0
dtype: float64

ROLLING MIN:
0    NaN
1    0.0
2    0.0
3    2.0
4    1.0
dtype: float64

ROLLING MAX:
0    NaN
1    1.0
2    5.0
3    5.0
4    2.0
dtype: float64

More realistic example

In [157]:
index = np.arange(0, 20, 0.05)
s = pd.Series(np.sin(index))
s = s + np.random.normal(scale=0.4, size=s.shape)

ax = s.plot(figsize=(18, 3))
In [158]:
s.shape
Out[158]:
(400,)

Rolling with an aggregation window of size 20.

In [159]:
s_mean = s.rolling(20).mean()
s_median = s.rolling(20).median()
s_min = s.rolling(20).min()
s_max = s.rolling(20).max()

ax = s_mean.plot(y='duration', figsize=(18, 8), color="red", label="mean", alpha=0.75)
s_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
s_min.plot(ax=ax, color="blue", alpha=0.5, style=":", label="min")
s_max.plot(ax=ax, color="blue", alpha=0.5, style=":", label="max")

plt.fill_between(s_min.index, s_min.values, s_max.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');
In [160]:
s_mean.shape
Out[160]:
(400,)

Pivot

In [161]:
df = pd.DataFrame([["i1", "A", 1],
                   ["i1", "B", 2],
                   ["i2", "A", 3],
                   ["i2", "B", 4]], columns=["foo", "bar", "baz"])
df
Out[161]:
foo bar baz
0 i1 A 1
1 i1 B 2
2 i2 A 3
3 i2 B 4
In [162]:
df.pivot(index="foo",  columns="bar", values="baz")
Out[162]:
bar A B
foo
i1 1 2
i2 3 4

Count the number of occurrences of a column value

In [163]:
a = np.array([[3, 5, 5, 5, 7, 7, 7, 7],
              [2, 4, 4, 3, 1, 3, 3, 2],
              [3, 4, 5, 6, 1, 8, 9, 8]]).T
df = pd.DataFrame(a,
                  columns=['A', 'B', 'C'])

df
Out[163]:
A B C
0 3 2 3
1 5 4 4
2 5 4 5
3 5 3 6
4 7 1 1
5 7 3 8
6 7 3 9
7 7 2 8
In [164]:
df.A.value_counts()
Out[164]:
7    4
5    3
3    1
Name: A, dtype: int64
In [165]:
df.A.value_counts().plot.bar()
Out[165]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f2971b0f0>

Stats

In [166]:
df = pd.DataFrame(np.random.normal(size=100000))
In [167]:
df.quantile(0.50)
Out[167]:
0   -0.000576
Name: 0.5, dtype: float64
In [168]:
df.quantile([0.25, 0.75])
Out[168]:
0
0.25 -0.677411
0.75 0.670510
In [169]:
df.quantile([0.01, 0.001])
Out[169]:
0
0.010 -2.332336
0.001 -3.085517

Time series

There are 3 main time related types in Pandas (and the equivalent type for Series and DataFrame indices):

  • pandas.Timestamp (pandas.DatetimeIndex for indices): pandas equivalent of python's datetime.datetime
  • pandas.Period (pandas.PeriodIndex for indices): represents a period of time
  • pandas.Timedelta (pandas.TimedeltaIndex for indices): represents a duration (the difference between two dates or times) i.e. the pandas equivalent of python's datetime.timedelta

A Timestamp is a point in time:

In [170]:
pd.Timestamp(year=2018, month=1, day=1, hour=12, minute=30)
Out[170]:
Timestamp('2018-01-01 12:30:00')

A Period is a range in time (with a "anchored" start time and a "anchored" end time):

In [171]:
p = pd.Period(freq='D', year=2018, month=1, day=1, hour=12, minute=30)
print(p)
print("Start time:", p.start_time)
print("End time:", p.end_time)
2018-01-01
Start time: 2018-01-01 00:00:00
End time: 2018-01-01 23:59:59.999999999

A Timedelta is a "floating" duration (i.e. not "anchored" in time):

In [172]:
print(pd.Timedelta(days=5, seconds=30))

ts1 = pd.Timestamp(year=2018, month=1, day=1, hour=12, minute=30)
ts2 = pd.Timestamp(year=2018, month=1, day=2, hour=12, minute=30)
print(ts2 - ts1)
5 days 00:00:30
1 days 00:00:00

Generate datetime index (with a fixed frequency)

In [173]:
pd.date_range('2018-01-01', '2018-03-01', freq='D')
Out[173]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
               '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
               '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20',
               '2018-01-21', '2018-01-22', '2018-01-23', '2018-01-24',
               '2018-01-25', '2018-01-26', '2018-01-27', '2018-01-28',
               '2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',
               '2018-02-02', '2018-02-03', '2018-02-04', '2018-02-05',
               '2018-02-06', '2018-02-07', '2018-02-08', '2018-02-09',
               '2018-02-10', '2018-02-11', '2018-02-12', '2018-02-13',
               '2018-02-14', '2018-02-15', '2018-02-16', '2018-02-17',
               '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21',
               '2018-02-22', '2018-02-23', '2018-02-24', '2018-02-25',
               '2018-02-26', '2018-02-27', '2018-02-28', '2018-03-01'],
              dtype='datetime64[ns]', freq='D')
In [174]:
pd.date_range('2018-01-01', periods=10, freq='h')
Out[174]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00', '2018-01-01 03:00:00',
               '2018-01-01 04:00:00', '2018-01-01 05:00:00',
               '2018-01-01 06:00:00', '2018-01-01 07:00:00',
               '2018-01-01 08:00:00', '2018-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')
In [175]:
pd.date_range('1/1/2012', periods=10, freq='S')
Out[175]:
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:00:01',
               '2012-01-01 00:00:02', '2012-01-01 00:00:03',
               '2012-01-01 00:00:04', '2012-01-01 00:00:05',
               '2012-01-01 00:00:06', '2012-01-01 00:00:07',
               '2012-01-01 00:00:08', '2012-01-01 00:00:09'],
              dtype='datetime64[ns]', freq='S')
In [176]:
pd.date_range('3/6/2012 00:00', periods=5, freq='D')
Out[176]:
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')
In [177]:
pd.date_range('1/1/2012', periods=5, freq='M')
Out[177]:
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')

Generate period index

In [178]:
pd.period_range('2018-01-01', '2018-03-01', freq='D')
Out[178]:
PeriodIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
             '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
             '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
             '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
             '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20',
             '2018-01-21', '2018-01-22', '2018-01-23', '2018-01-24',
             '2018-01-25', '2018-01-26', '2018-01-27', '2018-01-28',
             '2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',
             '2018-02-02', '2018-02-03', '2018-02-04', '2018-02-05',
             '2018-02-06', '2018-02-07', '2018-02-08', '2018-02-09',
             '2018-02-10', '2018-02-11', '2018-02-12', '2018-02-13',
             '2018-02-14', '2018-02-15', '2018-02-16', '2018-02-17',
             '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21',
             '2018-02-22', '2018-02-23', '2018-02-24', '2018-02-25',
             '2018-02-26', '2018-02-27', '2018-02-28', '2018-03-01'],
            dtype='period[D]', freq='D')
In [179]:
pd.date_range('2018-01-01', '2018-03-01', freq='D').to_period()
Out[179]:
PeriodIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
             '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
             '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
             '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
             '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20',
             '2018-01-21', '2018-01-22', '2018-01-23', '2018-01-24',
             '2018-01-25', '2018-01-26', '2018-01-27', '2018-01-28',
             '2018-01-29', '2018-01-30', '2018-01-31', '2018-02-01',
             '2018-02-02', '2018-02-03', '2018-02-04', '2018-02-05',
             '2018-02-06', '2018-02-07', '2018-02-08', '2018-02-09',
             '2018-02-10', '2018-02-11', '2018-02-12', '2018-02-13',
             '2018-02-14', '2018-02-15', '2018-02-16', '2018-02-17',
             '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21',
             '2018-02-22', '2018-02-23', '2018-02-24', '2018-02-25',
             '2018-02-26', '2018-02-27', '2018-02-28', '2018-03-01'],
            dtype='period[D]', freq='D')

Plot time series

In [180]:
dti = pd.date_range('2012-01-01 00:00', periods=40, freq='D')
ts = pd.Series(np.random.randint(0, 200, len(dti)), index=dti)
In [181]:
ts.plot();
In [182]:
ts.plot(x_compat=True);
In [183]:
dti = pd.date_range('2018-01-01 00:00', '2018-01-03 00:00', freq='H')
ts = pd.Series(np.random.randint(0, 100, len(dti)), index=dti)

ax = ts.plot(x_compat=True, figsize=(16, 4))    # x_compat is required as matplotlib doesn't understand pandas datetime format -> x_compat=True makes the conversion...

# set monthly locator
ax.xaxis.set_major_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))

# set formatter
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%m-%Y'))

# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()

Indexing (select datetime)

In [184]:
dti = pd.date_range('2012-1-1 00:00', periods=40, freq='D')
ts = pd.Series(np.random.randint(0, 200, len(dti)), index=dti)
In [185]:
ts
Out[185]:
2012-01-01     55
2012-01-02     51
2012-01-03     59
2012-01-04     27
2012-01-05    105
2012-01-06    146
2012-01-07    179
2012-01-08    105
2012-01-09     12
2012-01-10    160
2012-01-11     68
2012-01-12     51
2012-01-13    166
2012-01-14     66
2012-01-15    147
2012-01-16     67
2012-01-17     89
2012-01-18     56
2012-01-19    177
2012-01-20     35
2012-01-21      6
2012-01-22    139
2012-01-23    114
2012-01-24     44
2012-01-25      7
2012-01-26     68
2012-01-27     97
2012-01-28    142
2012-01-29     54
2012-01-30    193
2012-01-31     98
2012-02-01    171
2012-02-02     92
2012-02-03    150
2012-02-04     66
2012-02-05    181
2012-02-06     82
2012-02-07    164
2012-02-08     84
2012-02-09     96
Freq: D, dtype: int64
In [186]:
ts["2012-01-09"]
Out[186]:
12
In [187]:
ts[datetime.datetime(2012, 1, 9)]
Out[187]:
12
In [188]:
ts[ts.index < "2012-01-09"]
Out[188]:
2012-01-01     55
2012-01-02     51
2012-01-03     59
2012-01-04     27
2012-01-05    105
2012-01-06    146
2012-01-07    179
2012-01-08    105
Freq: D, dtype: int64
In [189]:
ts[ts.index > "2012-01-20"]
Out[189]:
2012-01-21      6
2012-01-22    139
2012-01-23    114
2012-01-24     44
2012-01-25      7
2012-01-26     68
2012-01-27     97
2012-01-28    142
2012-01-29     54
2012-01-30    193
2012-01-31     98
2012-02-01    171
2012-02-02     92
2012-02-03    150
2012-02-04     66
2012-02-05    181
2012-02-06     82
2012-02-07    164
2012-02-08     84
2012-02-09     96
Freq: D, dtype: int64
In [190]:
ts["2012-01-09":"2012-01-20"]
Out[190]:
2012-01-09     12
2012-01-10    160
2012-01-11     68
2012-01-12     51
2012-01-13    166
2012-01-14     66
2012-01-15    147
2012-01-16     67
2012-01-17     89
2012-01-18     56
2012-01-19    177
2012-01-20     35
Freq: D, dtype: int64
In [191]:
ts[datetime.datetime(2012, 1, 9):datetime.datetime(2012, 1, 20)]
Out[191]:
2012-01-09     12
2012-01-10    160
2012-01-11     68
2012-01-12     51
2012-01-13    166
2012-01-14     66
2012-01-15    147
2012-01-16     67
2012-01-17     89
2012-01-18     56
2012-01-19    177
2012-01-20     35
Freq: D, dtype: int64
In [192]:
ts[ts.index.day <= 3]
Out[192]:
2012-01-01     55
2012-01-02     51
2012-01-03     59
2012-02-01    171
2012-02-02     92
2012-02-03    150
dtype: int64
In [193]:
ts[ts.index.month == 2]
Out[193]:
2012-02-01    171
2012-02-02     92
2012-02-03    150
2012-02-04     66
2012-02-05    181
2012-02-06     82
2012-02-07    164
2012-02-08     84
2012-02-09     96
Freq: D, dtype: int64
In [194]:
ts["2012-02"]
Out[194]:
2012-02-01    171
2012-02-02     92
2012-02-03    150
2012-02-04     66
2012-02-05    181
2012-02-06     82
2012-02-07    164
2012-02-08     84
2012-02-09     96
Freq: D, dtype: int64
In [195]:
ts[ts.index.dayofweek == 1]
Out[195]:
2012-01-03     59
2012-01-10    160
2012-01-17     89
2012-01-24     44
2012-01-31     98
2012-02-07    164
dtype: int64

Rolling

Basic example

Rolling window size: 1 day

In [196]:
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 06:45', '2018-1-1 12:00',
                        '2018-1-2 00:00', '2018-1-2 06:00', '2018-1-2 12:00'])
ts = pd.Series([2., 1., 3., 2., 2., 0.], index=dti)

print("DATA:")
print(ts)

ax = ts.plot(figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00']), ymin=0, ymax=8, color="red", linestyle=":", alpha=0.3);

ts_rw = ts.rolling('D').sum()   # Rolling window size: 1 day

print()
print("MEAN:")
print(ts_rw)

ts_rw.plot(color="red", label="sum", style="*-", alpha=0.75, ax=ax)

ax.legend()
ax.set_xlabel('Time')
ax.grid(True);
DATA:
2018-01-01 00:00:00    2.0
2018-01-01 06:45:00    1.0
2018-01-01 12:00:00    3.0
2018-01-02 00:00:00    2.0
2018-01-02 06:00:00    2.0
2018-01-02 12:00:00    0.0
dtype: float64

MEAN:
2018-01-01 00:00:00    2.0
2018-01-01 06:45:00    3.0
2018-01-01 12:00:00    6.0
2018-01-02 00:00:00    6.0
2018-01-02 06:00:00    8.0
2018-01-02 12:00:00    4.0
dtype: float64
In [197]:
ts.rolling('6h').min()
Out[197]:
2018-01-01 00:00:00    2.0
2018-01-01 06:45:00    1.0
2018-01-01 12:00:00    1.0
2018-01-02 00:00:00    2.0
2018-01-02 06:00:00    2.0
2018-01-02 12:00:00    0.0
dtype: float64
In [198]:
ts.rolling('3h').mean()
Out[198]:
2018-01-01 00:00:00    2.0
2018-01-01 06:45:00    1.0
2018-01-01 12:00:00    3.0
2018-01-02 00:00:00    2.0
2018-01-02 06:00:00    2.0
2018-01-02 12:00:00    0.0
dtype: float64

More realistic example

In [199]:
dti = pd.date_range('1/1/2018 00:00', periods=6*480, freq='10min')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=480/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);
In [200]:
ts.shape
Out[200]:
(2880,)
In [201]:
ts_mean = ts.rolling('5H').mean()
ts_median = ts.rolling('5H').median()
ts_min = ts.rolling('5H').min()
ts_max = ts.rolling('5H').max()

ax = ts_mean.plot(y='duration', figsize=(18, 3), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_min.plot(ax=ax, color="blue", alpha=0.5, style=":", label="min")
ts_max.plot(ax=ax, color="blue", alpha=0.5, style=":", label="max")

plt.fill_between(ts_min.index, ts_min.values, ts_max.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');
In [202]:
ts_mean.shape
Out[202]:
(2880,)

Resampling

resample() is a time-based groupby, followed by a reduction method on each of its groups. resample() is similar to using a rolling() operation with a time-based offset.

Basic example

In [203]:
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 06:45', '2018-1-1 12:00',
                        '2018-1-2 00:00', '2018-1-2 12:00'])
ts = pd.Series([1., 0., 5., 2., 0.], index=dti)

print("DATA:")
print(ts)

ax = ts.plot(figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00']), ymin=0, ymax=5, color="red", linestyle=":", alpha=0.3);

ts_resampled = ts.resample('D').mean()

print()
print("MEAN:")
print(ts_resampled)

ts_resampled.plot(color="red", style="*-", label="mean", alpha=0.75, ax=ax)

ax.legend()
ax.set_xlabel('Time');
DATA:
2018-01-01 00:00:00    1.0
2018-01-01 06:45:00    0.0
2018-01-01 12:00:00    5.0
2018-01-02 00:00:00    2.0
2018-01-02 12:00:00    0.0
dtype: float64

MEAN:
2018-01-01    2.0
2018-01-02    1.0
Freq: D, dtype: float64
In [204]:
ts.resample('6h').min()
Out[204]:
2018-01-01 00:00:00    1.0
2018-01-01 06:00:00    0.0
2018-01-01 12:00:00    5.0
2018-01-01 18:00:00    NaN
2018-01-02 00:00:00    2.0
2018-01-02 06:00:00    NaN
2018-01-02 12:00:00    0.0
Freq: 6H, dtype: float64
In [205]:
ts.resample('3h').sum()
Out[205]:
2018-01-01 00:00:00    1.0
2018-01-01 03:00:00    0.0
2018-01-01 06:00:00    0.0
2018-01-01 09:00:00    0.0
2018-01-01 12:00:00    5.0
2018-01-01 15:00:00    0.0
2018-01-01 18:00:00    0.0
2018-01-01 21:00:00    0.0
2018-01-02 00:00:00    2.0
2018-01-02 03:00:00    0.0
2018-01-02 06:00:00    0.0
2018-01-02 09:00:00    0.0
2018-01-02 12:00:00    0.0
Freq: 3H, dtype: float64

Is there an offset ?

No. resample('D') aggregates values for each day between "00:00:00" and "23:59:59", whatever the first index of ts. See the examples bellow...

In [206]:
dti = pd.DatetimeIndex(['2018-1-1 12:00',
                        '2018-1-2 08:00', '2018-1-2 18:00', '2018-1-2 23:59:59',
                        '2018-1-3 00:00'])
ts = pd.Series([0.,
                10., 20., 30.,
                5.], index=dti)

print("DATA:")
print(ts)

ts_resampled = ts.resample('D').mean()

print()
print("MEAN:")
print(ts_resampled)

# Illustrative plot

ax = ts.plot(x_compat=True, figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00', '2018-1-3 00:00']), ymin=-10, ymax=40, color="red", linestyle=":", linewidth=2, alpha=0.5);
ax.vlines(pd.DatetimeIndex(['2018-1-1 12:00', '2018-1-2 12:00', '2018-1-3 12:00']), ymin=-10, ymax=40, color="green", linestyle=":", linewidth=2, alpha=0.5);

ax.plot(pd.DatetimeIndex(['2018-1-1 12:15', '2018-1-2 11:45']), [40, 40], marker="|", markersize=20, color="green")
ax.plot(pd.DatetimeIndex(['2018-1-2 12:15', '2018-1-3 11:45']), [40, 40], marker="|", markersize=20, color="green")

ax.plot(pd.DatetimeIndex(['2018-1-1 00:15', '2018-1-1 23:45']), [35, 35], marker="|", markersize=20, color="red")
ax.plot(pd.DatetimeIndex(['2018-1-2 00:15', '2018-1-2 23:45']), [35, 35], marker="|", markersize=20, color="red")

ts_resampled.plot(color="red", style="*-", label="mean", alpha=0.75, ax=ax)

# set monthly locator
ax.xaxis.set_major_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))

# set formatter
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%m-%Y  %H:%M'))

# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()

ax.legend()
ax.set_xlabel('Time');
DATA:
2018-01-01 12:00:00     0.0
2018-01-02 08:00:00    10.0
2018-01-02 18:00:00    20.0
2018-01-02 23:59:59    30.0
2018-01-03 00:00:00     5.0
dtype: float64

MEAN:
2018-01-01     0.0
2018-01-02    20.0
2018-01-03     5.0
Freq: D, dtype: float64
In [207]:
dti = pd.DatetimeIndex(['2018-1-1 01:00',
                        '2018-1-1 05:30', '2018-1-1 07:30',
                        '2018-1-1 10:00'])
ts = pd.Series([0.,
                10., 20.,
                5.], index=dti)

print("DATA:")
print(ts)

ts_resampled = ts.resample('5h').mean()

print()
print("MEAN:")
print(ts_resampled)

# Illustrative plot

ax = ts.plot(x_compat=True, figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 05:00', '2018-1-1 10:00']), ymin=-10, ymax=40, color="red", linestyle=":", linewidth=2, alpha=0.5);
ax.vlines(pd.DatetimeIndex(['2018-1-1 01:00', '2018-1-1 06:00', '2018-1-1 11:00']), ymin=-10, ymax=40, color="green", linestyle=":", linewidth=2, alpha=0.5);

ax.plot(pd.DatetimeIndex(['2018-1-1 01:05', '2018-1-1 05:55']), [40, 40], marker="|", markersize=20, color="green")
ax.plot(pd.DatetimeIndex(['2018-1-1 06:05', '2018-1-1 10:55']), [40, 40], marker="|", markersize=20, color="green")

ax.plot(pd.DatetimeIndex(['2018-1-1 00:05', '2018-1-1 04:55']), [35, 35], marker="|", markersize=20, color="red")
ax.plot(pd.DatetimeIndex(['2018-1-1 05:05', '2018-1-1 09:55']), [35, 35], marker="|", markersize=20, color="red")

ts_resampled.plot(color="red", style="*-", label="mean", alpha=0.75, ax=ax)

# set monthly locator
ax.xaxis.set_major_locator(mdates.HourLocator(interval=1))
#ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))

# set formatter
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()

ax.legend()
ax.set_xlabel('Time');
DATA:
2018-01-01 01:00:00     0.0
2018-01-01 05:30:00    10.0
2018-01-01 07:30:00    20.0
2018-01-01 10:00:00     5.0
dtype: float64

MEAN:
2018-01-01 00:00:00     0.0
2018-01-01 05:00:00    15.0
2018-01-01 10:00:00     5.0
Freq: 5H, dtype: float64

More realistic example

In [208]:
dti = pd.date_range('1/1/2018 00:00', periods=60*480, freq='min')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=480/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);
In [209]:
ts.shape
Out[209]:
(28800,)
In [210]:
ts_mean = ts.resample('2H').mean()
ts_median = ts.resample('2H').median()
ts_min = ts.resample('2H').min()
ts_max = ts.resample('2H').max()

ax = ts_mean.plot(y='duration', figsize=(18, 8), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_min.plot(ax=ax, color="blue", alpha=0.5, style=":", label="min")
ts_max.plot(ax=ax, color="blue", alpha=0.5, style=":", label="max")

plt.fill_between(ts_min.index, ts_min.values, ts_max.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');
In [211]:
ts_mean.shape
Out[211]:
(240,)

Difference between rolling() and resample()

In [212]:
rolling_window = '6H'
start = '2018-1-1 00:00'
end = '2018-1-4 00:00'
In [213]:
dti = pd.date_range(start=start, end=end, freq='min')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)
ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range(start=start, end=end, freq=rolling_window), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.5);
In [214]:
ts2 = ts.rolling(rolling_window).mean()   # Rolling window size: 1 day

ax = ts2.plot(figsize=(18, 3), color="red", alpha=0.75)
ax.vlines(pd.date_range(start=start, end=end, freq=rolling_window), ymin=-1, ymax=1, color="red", linestyle=":", alpha=0.5);
In [215]:
ts2 = ts.resample(rolling_window).mean()   # Rolling window size: 1 day

ax = ts2.plot(figsize=(18, 3), color="red", alpha=0.75)
ax.vlines(pd.date_range(start=start, end=end, freq=rolling_window), ymin=-1, ymax=1, color="red", linestyle=":", alpha=0.5);

Group by

Basic example

In [216]:
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 12:00', '2018-1-2 00:00', '2018-1-2 12:00'])
ts = pd.Series([1., 0., 2., 1.], index=dti)

print(ts)

ax = ts.plot(figsize=(18, 3))

dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00'])
ax.vlines(dti, ymin=0, ymax=2, color="red", linestyle=":", alpha=0.3);
2018-01-01 00:00:00    1.0
2018-01-01 12:00:00    0.0
2018-01-02 00:00:00    2.0
2018-01-02 12:00:00    1.0
dtype: float64
In [217]:
ts_mean = ts.groupby(ts.index.time).mean()

print(ts_mean)

ax = ts_mean.plot(y='duration', figsize=(10, 4), color="red", label="mean", alpha=0.75)
ax.legend()
ax.set_xlabel('Time');
00:00:00    1.5
12:00:00    0.5
dtype: float64

Basic example of wrong usage

In [218]:
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 12:00', '2018-1-2 00:31', '2018-1-2 12:25'])  # Here time is not aligned (non constant frequency)
ts = pd.Series([1., 0., 2., 1.], index=dti)

print(ts)

ax = ts.plot(figsize=(18, 3));
2018-01-01 00:00:00    1.0
2018-01-01 12:00:00    0.0
2018-01-02 00:31:00    2.0
2018-01-02 12:25:00    1.0
dtype: float64
In [219]:
ts_mean = ts.groupby(ts.index.time).mean()

print(ts_mean)

ax = ts_mean.plot(y='duration', figsize=(10, 4), color="red", label="mean", alpha=0.75)
ax.legend()
ax.set_xlabel('Time');
00:00:00    1.0
00:31:00    2.0
12:00:00    0.0
12:25:00    1.0
dtype: float64

More realistic example

In [220]:
dti = pd.date_range('1/1/2018 00:00', periods=960, freq='h')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=960/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);
In [221]:
ts_mean = ts.groupby(ts.index.time).mean()
ts_median = ts.groupby(ts.index.time).median()
ts_quartile_1 = ts.groupby(ts.index.time).quantile(0.25)
ts_quartile_3 = ts.groupby(ts.index.time).quantile(0.75)

ax = ts_mean.plot(y='duration', figsize=(14, 8), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_quartile_1.plot(ax=ax, color="blue", alpha=0.5, style=":", label="1st quartile")
ts_quartile_3.plot(ax=ax, color="blue", alpha=0.5, style=":", label="3rd quartile")

plt.fill_between(ts_quartile_1.index, ts_quartile_1.values, ts_quartile_3.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');

With Periods

In [222]:
dti = pd.period_range('1/1/2018 00:00', periods=960, freq='h')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=960/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);
In [223]:
ts_mean = ts.groupby(ts.index.start_time.time).mean()                 # Note the ".start_time" here
ts_median = ts.groupby(ts.index.start_time.time).median()             # Note the ".start_time" here
ts_quartile_1 = ts.groupby(ts.index.start_time.time).quantile(0.25)   # Note the ".start_time" here
ts_quartile_3 = ts.groupby(ts.index.start_time.time).quantile(0.75)   # Note the ".start_time" here

ax = ts_mean.plot(y='duration', figsize=(14, 8), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_quartile_1.plot(ax=ax, color="blue", alpha=0.5, style=":", label="1st quartile")
ts_quartile_3.plot(ax=ax, color="blue", alpha=0.5, style=":", label="3rd quartile")

plt.fill_between(ts_quartile_1.index, ts_quartile_1.values, ts_quartile_3.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');

Round

In [224]:
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 12:00', '2018-1-2 00:31', '2018-1-2 12:25'])  # Here time is not aligned (non constant frequency)
ts = pd.Series([1., 0., 2., 1.], index=dti)

print(ts)
2018-01-01 00:00:00    1.0
2018-01-01 12:00:00    0.0
2018-01-02 00:31:00    2.0
2018-01-02 12:25:00    1.0
dtype: float64
In [225]:
ts.index.round('H')
Out[225]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 12:00:00',
               '2018-01-02 01:00:00', '2018-01-02 12:00:00'],
              dtype='datetime64[ns]', freq=None)

Count

TODO: is it the cleanest way to do this ?

In [226]:
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 06:45', '2018-1-1 12:00',
                        '2018-1-3 00:00', '2018-1-3 06:00'])

s = pd.Series(np.ones(dti.shape), index=dti)
In [227]:
#dti.groupby(dti.date)     # it works but it returns a dictionary...
In [228]:
s.groupby(dti.date).count().plot.bar(color="blue", alpha=0.5);
In [229]:
s.resample('1d').count().plot.bar(color="blue", alpha=0.5);

Plot

In [230]:
#help(df.plot)

Line plot

In [231]:
x = np.arange(0, 6, 0.1)
y1 = np.cos(x)
y2 = np.sin(x)
Y = np.array([y1, y2]).T

df = pd.DataFrame(Y,
                  columns=['cos(x)', 'sin(x)'],
                  index=x)
df.iloc[:10]
Out[231]:
cos(x) sin(x)
0.0 1.000000 0.000000
0.1 0.995004 0.099833
0.2 0.980067 0.198669
0.3 0.955336 0.295520
0.4 0.921061 0.389418
0.5 0.877583 0.479426
0.6 0.825336 0.564642
0.7 0.764842 0.644218
0.8 0.696707 0.717356
0.9 0.621610 0.783327
In [232]:
df.plot(legend=True)
Out[232]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f237c4ac8>

or

In [233]:
df.plot.line(legend=True)
Out[233]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f237bfc18>

Steps

In [234]:
df = pd.DataFrame(np.random.randn(36, 2))

df.plot(drawstyle="steps", linewidth=2)
df.plot(drawstyle="steps-post", linewidth=2);

Bar plot

In [235]:
x = np.arange(0, 6, 0.5)
y1 = np.cos(x)
y2 = np.sin(x)
Y = np.array([y1, y2]).T

df = pd.DataFrame(Y,
                  columns=['cos(x)', 'sin(x)'],
                  index=x)
df
Out[235]:
cos(x) sin(x)
0.0 1.000000 0.000000
0.5 0.877583 0.479426
1.0 0.540302 0.841471
1.5 0.070737 0.997495
2.0 -0.416147 0.909297
2.5 -0.801144 0.598472
3.0 -0.989992 0.141120
3.5 -0.936457 -0.350783
4.0 -0.653644 -0.756802
4.5 -0.210796 -0.977530
5.0 0.283662 -0.958924
5.5 0.708670 -0.705540

Vertical

In [236]:
df.plot.bar(legend=True)
Out[236]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f29607748>
In [237]:
df.plot.bar(legend=True, stacked=True)
Out[237]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f293add30>

Horizontal

In [238]:
df.plot.barh(legend=True)
Out[238]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f2966f9b0>

Histogram

In [239]:
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.hist(alpha=0.5, bins=100, legend=True);

To normalize the $y$ axis, use density=True:

In [240]:
df.plot.hist(alpha=0.5, bins=100, legend=True, density=True);

Box plot

In [241]:
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.box()
Out[241]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f291a9f60>

Hexbin plot

In [242]:
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df['b'] = df['b'] + np.arange(1000)
df.plot.hexbin(x='a', y='b', gridsize=25)
Out[242]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f23615b00>

Kernel Density Estimation (KDE) plot

In [243]:
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.kde()
Out[243]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f23594278>

Area plot

In [244]:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])

df.plot.area()
Out[244]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f21c001d0>

Pie chart

In [245]:
x = np.random.randint(low=0, high=6, size=(50))

df = pd.DataFrame(x, columns=["A"])
df.A.value_counts()
Out[245]:
1    15
2     8
5     7
3     7
0     7
4     6
Name: A, dtype: int64
In [246]:
df.A.value_counts().plot.pie(y="A")
Out[246]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f21b7b860>

Scatter plot

In [247]:
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.scatter(x=r'$\mathcal{N}(0,1)$',
                y=r'$\mathcal{N}(3,2)$',
                alpha=0.2)
Out[247]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5f21b4a4a8>