Plotting data

ggplot, part 1

Roland Krause

MADS6

Tuesday, 5 November 2024

Introduction

Learning objectives

  • Learn and apply the basic grammar of graphics
  • Understand how it is implemented in ggplot2
  • Input data structures as data.frame/tibble
  • Mapping columns to display features (aesthetics)
  • Types of graphics (geometries)

Followed by a second lecture

Introduction

ggplot2

  • Stands for grammar of graphics plot v2
  • Inspired by Leland Wilkinson work on the grammar of graphics in 2005.

Graphs are split into layers

  • Separate axis, curve(s), labels.
  • 3 elements are required:
    • data,
    • aesthetics,
    • geometry \(\geqslant 1\)

Creating a plot

Data

A B C D
2 3 4 a
1 2 1 a
4 5 15 b
9 10 80 b

Aesthetics function

x = A

y = C

shape = D

Scaling to physical units \(x = \frac{A-min(A)}{range(A)} \times width\)

\(y = \frac{C-min(C)}{range(C)} \times height\)

\(shape = f_{s}(D)\)

Geometry

Scatter plot

Data drawn as points

Mapped data

x y shape
25 11 circle
0 0 circle
75 53 square
200 300 square

Result

What if we want to split into panels circles and squares?

Faceting

Split by shape, aka trellis or lattice plots

Redundancy

  • shape and facets provide the same information.
  • The shape aesthetic is free for another variable.

Familiar country shape and data

Combining layers

All ggplot layers are functions

library(ggplot2)
swiss |>
  ggplot() +
  aes(x = Education, 
      y = Examination) +
  geom_point() +
  scale_colour_brewer()

Warning

  • ggplot2 layers are combined with +!
  • The magrittr pipe %>% or the base pipe |> will not work!
  • This introduces a break in the workflow.

Using the pipe has an explicit error

  ggplot(swiss) |> 
  aes(x = Education, 
      y = Examination) |>
  geom_point() +
  scale_colour_brewer()
Error:
! Cannot add <ggproto> objects together.
ℹ Did you forget to add this object to a <ggplot> object?

Palmer penguins

install.packages("palmerpenguins")
library(palmerpenguins)

Capabilities in ggplot

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Capabilities in ggplot

ggplot(data = penguins)

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm, 
      y = body_mass_g, 
      color = sex)

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm, 
      y = body_mass_g, 
      color = sex) +
  geom_point()

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm, 
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) 

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass")

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       x = "Flipper length (mm)",
       y = "Body mass (g)")

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       x = "Flipper length (mm)",
       y = "Body mass (g)")

description <- "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins\nat Palmer Station LTER"

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = description, 
       x = "Flipper length (mm)",
       y = "Body mass (g)")

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g, 
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), 
                     na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = description, 
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex")

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g,
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = description,
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) 

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm,
      y = body_mass_g,
      color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = description,
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA)) 

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm, y = body_mass_g, color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = description,
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA)) +
  theme(plot.caption = element_text(hjust = 0, face = "italic"),
        plot.caption.position = "plot") +
  facet_wrap(vars(species))

Capabilities in ggplot

ggplot(data = penguins) +
  aes(x = flipper_length_mm, y = body_mass_g, color = sex) +
  geom_point() +
  scale_color_manual(values = c("darkorange", "cyan4"), na.translate = FALSE) +
    labs(title = "Penguin flipper and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = description,
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(plot.subtitle = element_text(size = 13),
        axis.title   = element_text(size = 11)) +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA)) +
  theme(plot.caption = element_text(hjust = 0, face = "italic"),
        plot.caption.position = "plot") +
  facet_wrap(vars(species)) +
  scale_x_continuous(guide = guide_axis(n.dodge = 2)) +
  scale_y_continuous(labels = scales::label_comma())

Geometric objects define the plot type to be drawn

geom_point()

geom_line()

geom_bar()

geom_violin()

geom_histogram()

geom_density()

Layers

Core layers

Other layers

They are present, it works because they have sensible default:

  • Theme is theme_grey
  • Coordinate is cartesian
  • Statistic is identity
  • Facets are disabled

Three layers are sufficient

  • Data
  • Aesthetics mapping data to plot component
  • Geometry at least one

Your first plot

library(palmerpenguins)
library(ggplot2)
ggplot(data = penguins) +
  geom_point(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm,
                           colour = species))

My first plot

ggplot(data = penguins) +
  geom_point(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm,
                           colour = "green"))

Mapping aesthetics

Requirements

  • aes() map columns/variables data to aesthetics
  • Specific geometries (geom) have different expectations:
    • univariate, one x or y for flipped axes
    • bivariate, x and y like scatterplot
  • Continuous or Discrete variables
    • Continuous for color ➡️ gradient
    • Discrete for color ➡️ qualitative

geom_point() requires both x and y coordinates

ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm))
  • Same as previous slide
  • Without colour mapping

Unmapped parameters

  • geom_point() accepts additional arguments such as the colour
  • Define them to a fixed value without mapping
ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm),
             colour = "black")

Important

Parameters defined outside the aesthetics aes() are applied to all data.

Mapped parameters

Mapped parameters require two conditions:

  • Being defined inside the aesthetics aes()
  • Refer to one of the column data, here: mistake
ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm,
                 colour = country))
Error in FUN(X[[i]], ...): object 'country' not found
  • Passing the unknown column as string as a different effect:
ggplot(penguins) +
  geom_point(aes(x = bill_length_mm,
                 y = bill_depth_mm,
                 colour = "country"))

Mapping rules

In aes()

refer to a valid column.

Mapping aesthetics correctly

In aes() and refer to a data column

ggplot(penguins) +
  geom_bar(aes(y = species,
               fill = sex))

species and sex are 2 valid columns in penguins

Advantages:

  • The legend 🟥/🟦 for free
  • Missing data are highlighted in grey ⬜️
  • Using y axis for categories eases reading

Why no string for mapping?

  • Could we pass an expression?
  • Which penguins are above 4 kg?
  • Use body_mass_g > 4000 that return a boolean to find out
ggplot(penguins) +
  geom_bar(aes(y = species,
               fill = body_mass_g > 4000))

The expression was evaluated in penguins context.

Inheritance of arguments across layers

Compare the code and the results

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm)) +
  geom_point(
    aes(colour = species)) +
  geom_smooth(method = "lm", formula = 'y ~ x')

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", formula = 'y ~ x')

Note

  • aesthetics in ggplot() are passed on to allgeometries.

Note

  • aesthetics in geom_*() are specific (and can overwrite inherited)

Simpson’s paradox

Statistical correlation depending on stratification.

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm)) +
  geom_point(
    aes(colour = species)) +
  geom_smooth(method = "lm", formula = 'y ~ x')

Complete data set $R^2 = $ 0.055

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", formula = 'y ~ x')

Adelie Gentoo Chinstrap
0.153 0.414 0.427

Your turn!

  • Use the classroom practical.
  • Install palmerpenguins package if you haven’t yet
  • Use the penguins data set and plot bill_length_mm, bill_depth_mm and species.
  • Map the variable island to the aesthetics shape.
  • Add a regression line using a linear model.
  • All dots (circles / triangles / squares) with:
    • A size of 5
    • A transparency of 30% (alpha = 0.7)

Goal

15:00

Joining observations

set.seed(212) # tidyr::crossing generate combinations
tib <- tibble(crossing(x = letters[1:4], 
                       g = factor(1:2)), 
              y = rnorm(8))

Suppose we want to connect dots by colors

tib
# A tibble: 8 × 3
  x     g          y
  <chr> <fct>  <dbl>
1 a     1     -0.239
2 a     2      0.677
3 b     1     -2.44 
4 b     2      1.24 
5 c     1     -0.327
6 c     2      0.154
7 d     1      1.04 
8 d     2     -0.780

Warning

Should be the job of geom_line()

Invisible aesthetic: grouping

ggplot(tib, aes(x, y, colour = g)) +
  geom_line() + 
  geom_point(size = 4)

ggplot(tib, aes(x, y, colour = g)) +
  geom_line(aes(group = g)) +  
  geom_point(size = 4)

ggplot(tib, aes(x, y, colour = g)) +
  geom_line(aes(group = 1)) + #<<
  geom_point(size = 4)

Labels

ggplot(penguins,
       aes(x = bill_length_mm,
           y = bill_depth_mm,
           shape = island,
           colour = species)) +
  geom_point() +
  geom_smooth(method = "lm",
              formula = 'y ~ x') +
  labs(title = "Bill ratios of Palmer penguins",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Split per species / island",
       shape = "Islands",
       x = "cumen length (mm)",
       y = "cumen depth (mm)")

Statistics / geometries are interchangeable

ggplot(penguins) +
  geom_bar(aes(y = species))

::: callout-tip fragment ### Collective aesthetics - Feels more natural since visual - But just a preference - Most code in the wild use geom_bar()

ggplot(penguins) +
  stat_count(aes(y = species))

:::

Let ggplot2 doing the stat for you

stat_count could be omitted since default

ggplot(penguins, aes(x = species)) +
  geom_bar(stat = "count")

stat_count acts on the mapped var like dplyr::count()

count(penguins, species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Count yourself geom_col()

Change the stat for counts or …

count(penguins, species) |>
  ggplot(aes(x = species, y = n)) +
  geom_bar(stat = "identity")

use geom_col() - default identity

count(penguins, species) |>
  ggplot(aes(x = species, y = n)) +
  geom_col() 

The stat() function allows computation, like proportions

Classic counting

ggplot(penguins, aes(y = species)) +
  geom_bar(aes(x = stat(count)))

See list in help pages

ggplot(penguins, aes(y = species)) +
  geom_bar(aes(x = stat(count) / sum(count))) +
  scale_x_continuous(labels = scales::label_percent())

  • Now compute proportions
  • Bonus: get x scale in % using scales

Flexibility in the asthetics for flipping axes

geom_bar() requires x OR y

penguins |>
  # horizontal brings readability
  ggplot(aes(y = species)) +
  geom_bar()

Cleanup plot

penguins |>
  ggplot(aes(y = species)) +
  geom_bar() +
  labs(y = NULL) +
  scale_x_continuous(expand = c(0, NA))

Annoying to see those 3 bars in disorder

Reorder the categorical variable (forcats)

Using the function fct_infreq()

library(forcats)
penguins |>
  ggplot(aes(y = fct_infreq(species) |> fct_rev())) + 
  geom_bar() +
  scale_x_continuous(expand = c(0, NA)) +
  labs(title = "Palmer penguins species",
       y = NULL) +
  theme_minimal(14) +
  theme(panel.ontop = TRUE,
        # better to hide the horizontal grid lines
        panel.grid.major.y = element_blank())

Geometries catalogue

Histograms

penguins |>
ggplot(aes(x = body_mass_g,
           fill = species)) +
  geom_histogram(bins = 35,
                 alpha = 0.7, 
                 position = "identity")

  • Default bin value is 30 and will be printed out as a message
  • Default is stack for the position. Here we overlay and use transparency

Density plots

penguins |>
ggplot(aes(x = body_mass_g,
           fill = species,
           colour = species)) +
  geom_density(alpha = 0.7)

  • Use both colour and fill mapped to the same variable for cosmetic purposes

Overlay density and histogram

Naive approach: scale issue

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 30) +
  geom_density(colour = "red")

Solution: scale histogram to density one

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 30,
                 aes(y = stat(density))) + #<<
  geom_density(colour = "red")

Barplot, categories

Default: position = "stack"

ggplot(penguins) +
  geom_bar(aes(y = species, 
               fill = island))

Dodging island: side by side

ggplot(penguins) +
  geom_bar(aes(y = species, fill = island),
           position = "dodge")

But global width per species is preserved

Preserve single bar

ggplot(penguins) +
  geom_bar(aes(y = species,
           fill = island),
           position = position_dodge2(preserve = "single")) +
  labs(y = NULL)

Stacked barchart for proportions

penguins |> 
  drop_na(sex) |> # from tidyr
  ggplot() +
  geom_bar(aes(y = species,
               fill = sex),
           position = "fill") + #<<
  scale_x_continuous(labels = scales::label_percent(),
                     position = "top",
                     expand = c(0, 0)) +
  labs(x = NULL, y = NULL) +
  theme_classic(16)

Boxplot, a continuous y by a categorical x

ggplot(penguins) +
  geom_boxplot(aes(y = body_mass_g,
                   x = species))

Note

geom_boxplot() is assessing that:

  • body_mass_g is continuous
  • species is categorical/discrete

Boxplot, dodging by default

penguins |> # alternative to tidyr::drop_na()
  tidyr::drop_na() |> 
  ggplot() +
  geom_boxplot(aes(y = body_mass_g,
                   x = species,
                   fill = sex))

Filter out NA to avoid additional category for missing data.

Better: violin and jitter

Show the data

penguins |> 
  filter(!is.na(sex)) |>
  # define aes here for both geometries
  ggplot(aes(y = body_mass_g,
             x = species,
             fill = sex,
             # for violin contours and dots
             colour = sex
  )) +        # very transparent filling
  geom_violin(alpha = 0.1, trim = FALSE) +
 geom_point(position = position_jitterdodge(dodge.width = 0.9),
#  geom_point(position = position_dodge2(width = 0.9 , preserve = "total", padding = 0.005),
   
             alpha = 0.5,
             # don't need dots in legend
             show.legend = FALSE)

Coding mistake

What is wrong with this code?

Think about inherited aesthetics!

penguins |>
  ggplot() +
  geom_point(aes(x = bill_length_mm, 
                 y = body_mass_g)) +
  geom_smooth(method = "lm")
Error in `geom_smooth()`:
! Problem while computing stat.
ℹ Error occurred in the 2nd layer.
Caused by error in `compute_layer()`:
! `stat_smooth()` requires the following missing aesthetics: x and y.

Inheritance of aesthetics in main ggplot()

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Before we stop

You learned to:

  • Apprehend Graphics as a language
  • Embrace the layer system
  • Link data columns to aesthetics
  • Discover geometries

Acknowledgments

Further reading

Thank you for your attention!