Week 1B: Time Series: Objects in Python and Visualization

Time Series

A time series is a sequence of observations taken sequentially in time.

Cross-sectional data

ID calories
0 1 420
1 2 380
2 3 390

Observations that come from different individuals or groups at a single point in time.

Time series data

Year Sales
0 2019 490
1 2020 980
2 2021 260

A set of observations, along with some information about what times those observations were recorded.

DateTime

Cross-sectional data

import pandas as pd
data = {
  "ID": [1, 2, 3],
  "calories": [420, 380, 390]
  
}

#load data into a DataFrame object:
dfc = pd.DataFrame(data)
dfc
ID calories
0 1 420
1 2 380
2 3 390

Time series data

data = {
  "Year": [2019, 2020, 2021],
  "Sales": [490, 980, 260]
  
}

#load data into a DataFrame object:
dft = pd.DataFrame(data)
dft
Year Sales
0 2019 490
1 2020 980
2 2021 260

DateTime

Cross-sectional data

dfc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   ID        3 non-null      int64
 1   calories  3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes

Time series data

dft.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Year    3 non-null      int64
 1   Sales   3 non-null      int64
dtypes: int64(2)
memory usage: 176.0 bytes

Necessary packages

import pandas as pd
import numpy as np
import datetime

Read AirPassenger

airpassenger = pd.read_csv('AirPassengers.csv')
airpassenger
Month #Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
... ... ...
139 1960-08 606
140 1960-09 508
141 1960-10 461
142 1960-11 390
143 1960-12 432

144 rows × 2 columns

AirPassenger dataset

airpassenger.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Month        144 non-null    object
 1   #Passengers  144 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ KB

Data Visualization

import plotnine
from plotnine import *
ggplot(airpassenger, aes(x='Month', y='#Passengers'))+geom_line()
<ggplot: (-9223372036479770161)>

Convert to Date and Time

from datetime import datetime
airpassenger['Month']= pd.to_datetime(airpassenger['Month'])
airpassenger.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144 entries, 0 to 143
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Month        144 non-null    datetime64[ns]
 1   #Passengers  144 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 2.4 KB

Data Visualization

ggplot(airpassenger, aes(x='Month', y='#Passengers'))+geom_line()
<ggplot: (-9223372036479774183)>

Data Visualization

ggplot(airpassenger, aes(x='Month', y='#Passengers'))+geom_line()+geom_point()
<ggplot: (-9223372036479770771)>

Split date into month and year

airpassenger['year'] = airpassenger['Month'].dt.year
airpassenger['month'] = airpassenger['Month'].dt.month

Split date into month and year (cont.)

airpassenger
Month #Passengers year month
0 1949-01-01 112 1949 1
1 1949-02-01 118 1949 2
2 1949-03-01 132 1949 3
3 1949-04-01 129 1949 4
4 1949-05-01 121 1949 5
... ... ... ... ...
139 1960-08-01 606 1960 8
140 1960-09-01 508 1960 9
141 1960-10-01 461 1960 10
142 1960-11-01 390 1960 11
143 1960-12-01 432 1960 12

144 rows × 4 columns

Time Series Patterns

Trend

Long-term increase or decrease in the data.

Seasonal

A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Seasonality is always of a fixed and known period. Hence, seasonal time series are sometimes called periodic time series.

Period is unchanging and associated with some aspect of the calendar.

Time Series Patterns (cont)

Cyclic

A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. The duration of these fluctuations is usually of at least 2 years. In general,

the average length of cycles is longer than the length of a seasonal pattern.

the magnitude of cycles tends to be more variable than the magnitude of seasonal patterns

Example: trend

Example: seasonal

Example: multiple seasonality

Example: Trend + Seasonal

Cyclic

Cyclic + Seasonal

Frequency of a time series: Seasonal periods

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='year'))+geom_point()
<ggplot: (-9223372036478791978)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(year)'))+geom_point()
<ggplot: (-9223372036582484245)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(year)'))+geom_line()
<ggplot: (376120896)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(year)'))+geom_line() + geom_point() 
<ggplot: (-9223372036479294681)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(month)'))+ geom_boxplot() 
<ggplot: (-9223372036479434317)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(month)'))+ geom_point()+ geom_boxplot() 
<ggplot: (-9223372036480369867)>

Seasonal plots

ggplot(airpassenger, aes(x='month', y='#Passengers', color='factor(month)'))+ geom_point()+ geom_boxplot(alpha=0.5) 
<ggplot: (-9223372036477430072)>

Yearly variation

ggplot(airpassenger, aes(x='year', y='#Passengers', color='factor(year)'))+ geom_point()+ geom_boxplot(alpha=0.5) 
<ggplot: (377342307)>

pandas Time Series: index by time

Index - Yearly

Method 1

index1 = pd.DatetimeIndex(['2012', '2013', '2014', '2015', '2016'])
data1 = pd.Series([123, 39, 78, 52, 110], index=index1)
data1
2012-01-01    123
2013-01-01     39
2014-01-01     78
2015-01-01     52
2016-01-01    110
dtype: int64

Index - Yearly (cont.)

Method 2

freq='AS' for start of year

index2 = pd.date_range("2012", periods=5, freq='AS')
index2
DatetimeIndex(['2012-01-01', '2013-01-01', '2014-01-01', '2015-01-01',
               '2016-01-01'],
              dtype='datetime64[ns]', freq='AS-JAN')
data2 = pd.Series([123, 39, 78, 52, 110], index=index2)
data2
2012-01-01    123
2013-01-01     39
2014-01-01     78
2015-01-01     52
2016-01-01    110
Freq: AS-JAN, dtype: int64

Index - Yearly (cont.)

Method 3

freq='A' end of year frequency

index3 = pd.date_range("2012", periods=5, freq='A')
index3
DatetimeIndex(['2012-12-31', '2013-12-31', '2014-12-31', '2015-12-31',
               '2016-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')
data3 = pd.Series([123, 39, 78, 52, 110], index=index3)
data3
2012-12-31    123
2013-12-31     39
2014-12-31     78
2015-12-31     52
2016-12-31    110
Freq: A-DEC, dtype: int64

Index - Yearly (cont.)

Method 4

Annual indexing with arbitrary month

index4 = pd.date_range("2012", periods=5, freq='AS-NOV')
index4
DatetimeIndex(['2012-11-01', '2013-11-01', '2014-11-01', '2015-11-01',
               '2016-11-01'],
              dtype='datetime64[ns]', freq='AS-NOV')
data4 = pd.Series([123, 39, 78, 52, 110], index=index4)
data4
2012-11-01    123
2013-11-01     39
2014-11-01     78
2015-11-01     52
2016-11-01    110
Freq: AS-NOV, dtype: int64

Index - Yearly (cont.)

index = pd.period_range('2012-01', periods=8, freq='A')
index
PeriodIndex(['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019'], dtype='period[A-DEC]', freq='A-DEC')

Index - Monthly

Method 1

index = pd.period_range('2022-01', periods=8, freq='M')
index
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08'],
            dtype='period[M]', freq='M')

Method 2

index = pd.period_range(start='2022-01-01', end='2022-08-02', freq='M')
index
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08'],
            dtype='period[M]', freq='M')

Index - Quarterly

index = pd.period_range('2022-01', periods=8, freq='Q')
index
PeriodIndex(['2022Q1', '2022Q2', '2022Q3', '2022Q4', '2023Q1', '2023Q2',
             '2023Q3', '2023Q4'],
            dtype='period[Q-DEC]', freq='Q-DEC')

Index - Daily

index = pd.period_range('2022-01-01', periods=8, freq='D')
index
PeriodIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
             '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08'],
            dtype='period[D]', freq='D')

Index - Hourly

Range of hourly timestamps

pd.period_range('2022-01', periods=8, freq='H')
PeriodIndex(['2022-01-01 00:00', '2022-01-01 01:00', '2022-01-01 02:00',
             '2022-01-01 03:00', '2022-01-01 04:00', '2022-01-01 05:00',
             '2022-01-01 06:00', '2022-01-01 07:00'],
            dtype='period[H]', freq='H')
pd.date_range('2022-01', periods=8, freq='H')
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 01:00:00',
               '2022-01-01 02:00:00', '2022-01-01 03:00:00',
               '2022-01-01 04:00:00', '2022-01-01 05:00:00',
               '2022-01-01 06:00:00', '2022-01-01 07:00:00'],
              dtype='datetime64[ns]', freq='H')

Sequence of durations increasing by an hour

pd.timedelta_range(0, periods=10, freq='H')
TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00'],
               dtype='timedelta64[ns]', freq='H')

Define multiple frequencies

Next lesson

Correlation

Autocorrelation

ACF plot

import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
# Select relevant data, index by Date
data = airpassenger[['Month', '#Passengers']].set_index(['Month'])
# Calculate the ACF (via statsmodel)
data
#Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
... ...
1960-08-01 606
1960-09-01 508
1960-10-01 461
1960-11-01 390
1960-12-01 432

144 rows × 1 columns

ACF plot

data.info()
plot_acf(data, lags=50)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 144 entries, 1949-01-01 to 1960-12-01
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   #Passengers  144 non-null    int64
dtypes: int64(1)
memory usage: 2.2 KB

ACF

ACF

ACF

ACF

ACF

## Time series forecasting

Training and Test Set

Simple time series forecasting technique

Simple time series forecasting technique

Simple time series forecasting technique

Simple time series forecasting technique

Simple time series forecasting technique

References

https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html