Week 1B: Time Series: Objects in Python and Visualization

Time Series

A time series is a sequence of observations taken sequentially in time.

Cross-sectional data

ID calories
0 1 420
1 2 380
2 3 390

Observations that come from different individuals or groups at a single point in time.

Time series data

Year Sales
0 2019 490
1 2020 980
2 2021 260

A set of observations, along with some information about what times those observations were recorded.


Cross-sectional data

import pandas as pd
data = {
  "ID": [1, 2, 3],
  "calories": [420, 380, 390]

#load data into a DataFrame object:
dfc = pd.DataFrame(data)
ID calories
0 1 420
1 2 380
2 3 390

Time series data

data = {
  "Year": [2019, 2020, 2021],
  "Sales": [490, 980, 260]

#load data into a DataFrame object:
dft = pd.DataFrame(data)
Year Sales
0 2019 490
1 2020 980
2 2021 260


Cross-sectional data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   ID        3 non-null      int64
 1   calories  3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes

Time series data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Year    3 non-null      int64
 1   Sales   3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes

Necessary packages

import pandas as pd
import numpy as np
import datetime

Read AirPassenger

airpassenger = pd.read_csv('AirPassengers.csv')
Month #Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
... ... ...
139 1960-08 606
140 1960-09 508
141 1960-10 461
142 1960-11 390
143 1960-12 432

144 rows × 2 columns

AirPassenger dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Month        144 non-null    object
 1   #Passengers  144 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ KB

Data Visualization

import plotnine
from plotnine import *
ggplot(airpassenger, aes(x='Month', y='#Passengers'))+geom_line()
<ggplot: (-9223372036479770161)>

Convert to Date and Time

from datetime import datetime
airpassenger['Month']= pd.to_datetime(airpassenger['Month'])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Month        144 non-null    datetime64[ns]
 1   #Passengers  144 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 2.4 KB

Data Visualization

ggplot(airpassenger, aes(x='Month', y='#Passengers'))+geom_line()
<ggplot: (-9223372036479774183)>

Data Visualization

ggplot(airpassenger, aes(x='Month', y='#Passengers'))+geom_line()+geom_point()
<ggplot: (-9223372036479770771)>

Split date into month and year

airpassenger['year'] = airpassenger['Month'].dt.year
airpassenger['month'] = airpassenger['Month'].dt.month

Split date into month and year (cont.)

Month #Passengers year month
0 1949-01-01 112 1949 1
1 1949-02-01 118 1949 2
2 1949-03-01 132 1949 3
3 1949-04-01 129 1949 4
4 1949-05-01 121 1949 5
... ... ... ... ...
139 1960-08-01 606 1960 8
140 1960-09-01 508 1960 9
141 1960-10-01 461 1960 10
142 1960-11-01 390 1960 11
143 1960-12-01 432 1960 12

144 rows × 4 columns

Time Series Patterns


Long-term increase or decrease in the data.


A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Seasonality is always of a fixed and known period. Hence, seasonal time series are sometimes called periodic time series.

Period is unchanging and associated with some aspect of the calendar.

Time Series Patterns (cont)


A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. The duration of these fluctuations is usually of at least 2 years. In general,

the average length of cycles is longer than the length of a seasonal pattern.

the magnitude of cycles tends to be more variable than the magnitude of seasonal patterns

Example: trend

Example: seasonal

Example: multiple seasonality

Example: Trend + Seasonal


Cyclic + Seasonal

Frequency of a time series: Seasonal periods

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='year'))+geom_point()
<ggplot: (-9223372036478791978)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(year)'))+geom_point()
<ggplot: (-9223372036582484245)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(year)'))+geom_line()
<ggplot: (376120896)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(year)'))+geom_line() + geom_point() 
<ggplot: (-9223372036479294681)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(month)'))+ geom_boxplot() 
<ggplot: (-9223372036479434317)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(month)'))+ geom_point()+ geom_boxplot() 
<ggplot: (-9223372036480369867)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(month)'))+ geom_point()+ geom_boxplot(alpha=0.5) 
<ggplot: (-9223372036477430072)>

Yearly variation

ggplot(airpassenger, aes(x='year', y='#Passengers', color='factor(year)'))+ geom_point()+ geom_boxplot(alpha=0.5) 
<ggplot: (377342307)>

pandas Time Series: index by time

Index - Yearly

Method 1

index1 = pd.DatetimeIndex(['2012', '2013', '2014', '2015', '2016'])
data1 = pd.Series([123, 39, 78, 52, 110], index=index1)
2012-01-01    123
2013-01-01     39
2014-01-01     78
2015-01-01     52
2016-01-01    110
dtype: int64

Index - Yearly (cont.)

Method 2

freq='AS' for start of year

index2 = pd.date_range("2012", periods=5, freq='AS')
DatetimeIndex(['2012-01-01', '2013-01-01', '2014-01-01', '2015-01-01',
              dtype='datetime64[ns]', freq='AS-JAN')
data2 = pd.Series([123, 39, 78, 52, 110], index=index2)
2012-01-01    123
2013-01-01     39
2014-01-01     78
2015-01-01     52
2016-01-01    110
Freq: AS-JAN, dtype: int64

Index - Yearly (cont.)

Method 3

freq='A' end of year frequency

index3 = pd.date_range("2012", periods=5, freq='A')
DatetimeIndex(['2012-12-31', '2013-12-31', '2014-12-31', '2015-12-31',
              dtype='datetime64[ns]', freq='A-DEC')
data3 = pd.Series([123, 39, 78, 52, 110], index=index3)
2012-12-31    123
2013-12-31     39
2014-12-31     78
2015-12-31     52
2016-12-31    110
Freq: A-DEC, dtype: int64

Index - Yearly (cont.)

Method 4

Annual indexing with arbitrary month

index4 = pd.date_range("2012", periods=5, freq='AS-NOV')
DatetimeIndex(['2012-11-01', '2013-11-01', '2014-11-01', '2015-11-01',
              dtype='datetime64[ns]', freq='AS-NOV')
data4 = pd.Series([123, 39, 78, 52, 110], index=index4)
2012-11-01    123
2013-11-01     39
2014-11-01     78
2015-11-01     52
2016-11-01    110
Freq: AS-NOV, dtype: int64

Index - Yearly (cont.)

index = pd.period_range('2012-01', periods=8, freq='A')
PeriodIndex(['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019'], dtype='period[A-DEC]', freq='A-DEC')

Index - Monthly

Method 1

index = pd.period_range('2022-01', periods=8, freq='M')
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08'],
            dtype='period[M]', freq='M')

Method 2

index = pd.period_range(start='2022-01-01', end='2022-08-02', freq='M')
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08'],
            dtype='period[M]', freq='M')

Index - Quarterly

index = pd.period_range('2022-01', periods=8, freq='Q')
PeriodIndex(['2022Q1', '2022Q2', '2022Q3', '2022Q4', '2023Q1', '2023Q2',
             '2023Q3', '2023Q4'],
            dtype='period[Q-DEC]', freq='Q-DEC')

Index - Daily

index = pd.period_range('2022-01-01', periods=8, freq='D')
PeriodIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
             '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08'],
            dtype='period[D]', freq='D')

Index - Hourly

Range of hourly timestamps

pd.period_range('2022-01', periods=8, freq='H')
PeriodIndex(['2022-01-01 00:00', '2022-01-01 01:00', '2022-01-01 02:00',
             '2022-01-01 03:00', '2022-01-01 04:00', '2022-01-01 05:00',
             '2022-01-01 06:00', '2022-01-01 07:00'],
            dtype='period[H]', freq='H')
pd.date_range('2022-01', periods=8, freq='H')
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 01:00:00',
               '2022-01-01 02:00:00', '2022-01-01 03:00:00',
               '2022-01-01 04:00:00', '2022-01-01 05:00:00',
               '2022-01-01 06:00:00', '2022-01-01 07:00:00'],
              dtype='datetime64[ns]', freq='H')

Sequence of durations increasing by an hour

pd.timedelta_range(0, periods=10, freq='H')
TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00'],
               dtype='timedelta64[ns]', freq='H')

Define multiple frequencies

Next lesson



ACF plot

import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
# Select relevant data, index by Date
data = airpassenger[['Month', '#Passengers']].set_index(['Month'])
# Calculate the ACF (via statsmodel)
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
... ...
1960-08-01 606
1960-09-01 508
1960-10-01 461
1960-11-01 390
1960-12-01 432

144 rows × 1 columns

ACF plot

plot_acf(data, lags=50)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 144 entries, 1949-01-01 to 1960-12-01
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   #Passengers  144 non-null    int64
dtypes: int64(1)
memory usage: 2.2 KB






## Time series forecasting

Training and Test Set

