Week 1A: Data Visualization: Grammar of Graphics

click here

Grammar

English

  • Nouns
  • Article
  • Adjective
  • Verb
  • Adverb
  • Proposition

Graphics

knitrhex

Grammar - Example

English

The little monkey hangs confidently by a branch.

Graphics

<ggplot: (383084815)>

Grammar - Example

English

Article: The

Adjective: little

Noun: monkey

Verb: hangs

Adverb: Confidently

Proposition: by

Noun: a branch

Graphics

knitrhex

Graphics - Grammar components

<ggplot: (383478448)>

import pandas as pd
import numpy as np

from plotnine import *
from plotnine.data import *

%matplotlib inline
(
    ggplot(economics, aes(x='date', y='uempmed')) 
    + geom_line() 
)

geom_line

<ggplot: (-9223372036471287173)>

import pandas as pd
import numpy as np

from plotnine import *
from plotnine.data import *

%matplotlib inline
(
    ggplot(economics, aes(x='date', y='uempmed')) 
    + geom_line() 
)

geom_point

<ggplot: (-9223372036470940265)>

import pandas as pd
import numpy as np

from plotnine import *
from plotnine.data import *

%matplotlib inline
(
    ggplot(economics, aes(x='date', y='uempmed')) 
    + geom_point() # line plot
    + labs(x='date', y='median duration of unemployment, in week')
)

Making your first plot with plotnine

Data

Date: data to be plotted

Packages

import pandas as pd
import plotnine

from plotnine import *
from plotnine.data import *

Dataset: economics

economics.head(3)
date pce pop psavert uempmed unemploy
0 1967-07-01 507.4 198712 12.5 4.5 2944
1 1967-08-01 510.5 198911 12.5 4.7 2945
2 1967-09-01 516.3 199113 11.7 4.6 2958

Dataset: economics

economics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574 entries, 0 to 573
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      574 non-null    datetime64[ns]
 1   pce       574 non-null    float64       
 2   pop       574 non-null    int64         
 3   psavert   574 non-null    float64       
 4   uempmed   574 non-null    float64       
 5   unemploy  574 non-null    int64         
dtypes: datetime64[ns](1), float64(3), int64(2)
memory usage: 27.0 KB

Dataset: economics

economics['year'] = economics['date'].dt.year
economics['month'] = economics['date'].dt.month

Dataset: economics

economics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 574 entries, 0 to 573
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      574 non-null    datetime64[ns]
 1   pce       574 non-null    float64       
 2   pop       574 non-null    int64         
 3   psavert   574 non-null    float64       
 4   uempmed   574 non-null    float64       
 5   unemploy  574 non-null    int64         
 6   year      574 non-null    int64         
 7   month     574 non-null    int64         
dtypes: datetime64[ns](1), float64(3), int64(4)
memory usage: 36.0 KB

Dataset: economics

economics.head(3)
date pce pop psavert uempmed unemploy year month
0 1967-07-01 507.4 198712 12.5 4.5 2944 1967 7
1 1967-08-01 510.5 198911 12.5 4.7 2945 1967 8
2 1967-09-01 516.3 199113 11.7 4.6 2958 1967 9

Tidy data

  • Every column is a variable.

  • Every row is an observation.

  • Every cell is a single value.

Tidy data - Example

Tidy data - Example

Tidy data - Example

Tidy data - Example

Grammar of Graphics - Plot 1 with economics

Data

ggplot(data=economics)
<ggplot: (383844195)>

Aesthetics: mapping variables

Aesthetics: mapping variables

Aesthetic means “something you can see”.

  • position (i.e., on the x and y axes)

  • color (“outside” color)

  • fill (“inside” color)

  • shape (of points)

Aesthetic: position

from plotnine.data import mtcars

ggplot(mtcars, aes('wt', 'mpg')) + geom_point()
<ggplot: (-9223372036470736247)>

Aesthetic: color

ggplot(mtcars, aes('wt', 'mpg', color='factor(cyl)')) + geom_point()
<ggplot: (383119922)>

Aesthetic: shape

ggplot(mtcars, aes('wt', 'mpg', shape='factor(cyl)')) + geom_point()
<ggplot: (384289123)>

Aesthetic: size

ggplot(mtcars, aes('wt', 'mpg', size='factor(cyl)')) + geom_point()
<ggplot: (-9223372036470265845)>

Data + Aesthetics

ggplot(economics, aes(x='date', y='uempmed'))
<ggplot: (-9223372036470514358)>

Geometrics

Actual marks we put on a plot

Data + Aesthetics + Geometrics

ggplot(economics, aes(x='date', y='uempmed')) + geom_point()
<ggplot: (-9223372036469989302)>

Data + Aesthetics + Geometrics

ggplot(economics, aes(x='date', y='uempmed')) + geom_point(alpha=0.5)
<ggplot: (-9223372036469942633)>

Data + Aesthetics + Geometrics

ggplot(economics, aes(x='date', y='uempmed')) + geom_point(size=0.3)
<ggplot: (-9223372036469989246)>

Data + Aesthetics + Geometrics

ggplot(economics, aes(x='date', y='uempmed')) + geom_line()
<ggplot: (-9223372036469721351)>

Data + Aesthetics + Geometrics

ggplot(economics, aes(x='date', y='uempmed')) + geom_line() + geom_point(size=0.3)
<ggplot: (-9223372036469621511)>

Data + Aesthetics + Geometrics

ggplot(economics, aes(x='date', y='uempmed')) + geom_line() + geom_point(size=0.3, colour="blue")
<ggplot: (-9223372036469501724)>

Geoms

source: https://nbisweden.github.io/RaukR-2019/ggplot/presentation/ggplot_presentation.html#17

Grammar of Graphics - Plot 2 with mpg

Dataset

mpg
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

Dataset: variable types

mpg.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   manufacturer  234 non-null    category
 1   model         234 non-null    category
 2   displ         234 non-null    float64 
 3   year          234 non-null    int64   
 4   cyl           234 non-null    int64   
 5   trans         234 non-null    category
 6   drv           234 non-null    category
 7   cty           234 non-null    int64   
 8   hwy           234 non-null    int64   
 9   fl            234 non-null    category
 10  class         234 non-null    category
dtypes: category(6), float64(1), int64(4)
memory usage: 14.0 KB

Data

ggplot(mpg)
<ggplot: (385284458)>

Data + Aesthetics

ggplot(mpg, aes(x='displ', y='hwy'))
<ggplot: (385284528)>

Data + Aesthetics

ggplot(mpg, aes(x='displ', y='hwy'))
<ggplot: (385505775)>

displ - a car’s engine size, in litres.

hwy - a car’s fuel efficiency on the highway, in miles per gallon (mpg)

Data + Aesthetics + Geom

ggplot(mpg, aes(x='displ', y='hwy')) + geom_point()
<ggplot: (385599443)>

Facets: small multiples

Subplots that each display one subset of the data.

Data + Aesthetics + Geom + Facets

ggplot(mpg, aes(x='displ', y='hwy')) + geom_point() + facet_wrap("class", nrow=2)
<ggplot: (-9223372036469377946)>

Data + Aesthetics + Geom + Facets

ggplot(mpg, aes(x='displ', y='hwy')) + geom_point() + facet_wrap("class", nrow=2) 
<ggplot: (383391510)>

Statistics

Data + Aesthetics + Geom + Facets + Statistics

ggplot(mpg, aes(x='displ', y='hwy')) + geom_point() + facet_wrap("class", nrow=2)+ stat_smooth(method = "lm")
<ggplot: (385913671)>

Coordinate

Data + Aesthetics + Geometrics + Facets + Statistics + Coordinate

ggplot(mpg, aes(x='displ', y='hwy')) + geom_point() + facet_wrap("class", nrow=2)+ stat_smooth(method = "lm") + coord_flip()
<ggplot: (385251876)>

Theme

Data + Aesthetics + Geometrics + Facets + Statistics + Coordinate+ Theme

ggplot(mpg, aes(x='displ', y='hwy')) + geom_point() + facet_wrap("class", nrow=2)+ stat_smooth(method = "lm") + coord_flip() + theme_dark()
<ggplot: (383847323)>

Scale

Data + Aesthetics + Geometrics + Scale

ggplot(mpg, aes(x='displ', y='hwy', color='class')) + geom_point() 
<ggplot: (383875883)>

Data + Aesthetics + Geometrics + Scale

ggplot(mpg, aes(x='displ', y='hwy', color='class')) + geom_point() + scale_color_brewer()
<ggplot: (383894269)>

Data + Aesthetics + Geometrics + Scale

ggplot(mpg, aes(x='displ', y='hwy', color='class')) + geom_point() + scale_color_manual(values=['blue', 'red', 'green'])
<ggplot: (-9223372036468854114)>

Your Turn

Visualize AirPassengers dataset.

Dataset: available at https://thiyanga-spatiotemporal.netlify.app/posts/data/