Exploratory data analysis

Amounts, small multiples, heatmaps

Roland Krause

MADS6

2026-03-11

Exploratory data analysis

Slides
Exploration
plotly

Introduction

Learning objectives

Aims of exploratory data analysis
Finding the right visualisation for amounts
Trends vs amounts
Small multiples with facets

Material

Exploratory data analysis in R for Data Science (Wickham)
Chapters 6 in Fundamentals of data visualisation by Claus O. Wilke

What are we visualising?

Amounts

x-y relationships

Trends

Proportions

Distributions

Uncertainty

Some ggplot tricks

Set the themes globally

theme_set(theme_bw( base_size = 20))

Set a theme for exploration

Set your themes as you prefer early! Don’t copy-paste code for themes, move them to the start of the chunk!

Data for the coming examples: Brazil Rain Forest Loss

Code

brazil_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv"
brazil_loss <- read_csv(brazil_url, 
                        col_select = -c(entity, code),
                        col_types = list(year = col_character()))

pivot_longer(brazil_loss,
           cols = commercial_crops:small_scale_clearing,
           names_to = "reasons",
           values_to = "area_ha") -> brazil_loss_long_complete

(brazil_loss_long_complete |> 
    filter(year == 2003)  -> brazil_loss_long)

# A tibble: 11 × 3
   year  reasons                         area_ha
   <chr> <chr>                             <dbl>
 1 2003  commercial_crops                 550000
 2 2003  flooding_due_to_dams                  0
 3 2003  natural_disturbances              35000
 4 2003  pasture                         2761000
 5 2003  selective_logging                149000
 6 2003  fire                              44000
 7 2003  mining                                0
 8 2003  other_infrastructure               9000
 9 2003  roads                             35000
10 2003  tree_plantations_including_palm   26000
11 2003  small_scale_clearing             358000

Visualising amounts

Standard geometry: bar chart

geom_bar() counts items (takes one geometry).
geom_col() draws columns of a give height (takes two geometries)

Code

brazil_loss_long |>

  ggplot(aes(x = reasons, y = area_ha)) +
    geom_col() +
  labs(title = "Loss of Brazilian rain forest in 2003")

Visualising amounts

Make axis readable by turning the labels

Code

brazil_loss_long |>

  ggplot(aes(x = reasons, y = area_ha)) +
    geom_col(fill = "red") +
  guides(x =  guide_axis(angle = 45))

Flip x and y to make labels readable!

Code

brazil_loss_long |>
  ggplot(aes(
    x = area_ha, y = reasons)) +
  geom_col()

Sorting the chart

Code

brazil_loss_long |>

  ggplot(aes(y = fct_reorder(reasons, area_ha), 
             x = area_ha)) +
    geom_col()

Technical column names

In exploratory analysis, bare column names or even functions that are used as in this example.

For a final, presented product, consider changing than to a more readable form.

Summarizing

Code

brazil_loss_long |>
  ggplot(aes(y = fct_lump_n(reasons,
                            n = 5,
                            w = area_ha),
             x = area_ha)) +
  geom_col()

Sorting the summarized

Code

brazil_loss_long |>
  ggplot(aes(
    y =
      # Collapsing all but the five most common levels
      fct_lump_n(reasons, n = 5, w = area_ha) |>
      # Sorting by area
      fct_infreq(w = area_ha) |>
      # Reverse the sorting
      fct_rev(),
    x = area_ha,
    fill = reasons
  )) +
  geom_col() +
  labs(y = NULL) +
  theme(legend.position = "bottom")

Legend with all reasons

Sorting the lumped

Code

brazil_loss_long |>
  mutate(reasons_fct =       
      # Collapsing all but the five most common levels
      fct_lump_n(reasons, n = 5, w = area_ha) |>
      # Sorting by area
      fct_infreq(w = area_ha) |>
      # Reverse the sorting
      fct_rev() ) |> 
  ggplot(aes(
    y = reasons_fct,
    x = area_ha,
    fill = reasons_fct
  )) +
  geom_col() +
  labs(y = NULL) +
  theme(legend.position = "bottom")

From amounts to trends

Loss of pasture over the years

Basic plot

Code

brazil_loss |> 
  ggplot(aes(x = year, 
             y = pasture)) +
           geom_col(fill = "darkgreen")

Cutting the y-x axis is cheating!

Code

brazil_loss |> 
  ggplot(aes( x = year, y = pasture)) +
           geom_col(fill = "darkgreen") +
  coord_cartesian(ylim = c(500000, 3000000))

Points vs bars for amounts

Preferable for truncation of axes

Code

brazil_loss |>
  ggplot(aes(x = year, y = pasture)) +
  geom_point(color = "darkgreen", size = 3)

Lines to plot trends

brazil_loss |>
  ggplot(aes(x = year, y = pasture)) +
  geom_point(color = "darkgreen", size = 3) +
  geom_line(aes(group = 1), color = "darkgreen", alpha = 0.5)

Area plot

Code

brazil_loss |>
  ggplot(aes(x = year, y = pasture, group = 1)) +
  geom_point(color = "darkgreen", 
             size = 1) +
  geom_line(color = "darkgreen", 
            alpha = 0.4) +
  geom_area(fill = "darkgreen", 
            alpha = 0.2)

Combining amounts and trends

Note

Alpha (transparancy) is a simple tool to emphasize particular plot elements and staying in the same color scheme.

Area plot

Snake plot

brazil_loss |>
  ggplot(aes(y = fct_rev(fct_infreq(year, pasture)), x = pasture)) +
  geom_point(color = "red", size = 3)

Not ideal with years (ordinal category, technically numeric)

year	pasture
2001	1520000
2002	2568000
2003	2761000
2004	2564000
2005	2665000
2006	1861000
2007	1577000
2008	1345000
2009	847000
2010	616000
2011	738000
2012	546000
2013	695000

Grouped bar charts

Code

brazil_loss |>
  pivot_longer(
    cols = c(fire, small_scale_clearing,
      selective_logging, commercial_crops),
    names_to = "reasons",
    values_to = "area_ha" ) |>
  filter(year > 2008) -> brazil_loss_long_select

ggplot(brazil_loss_long_select) +
  geom_col(aes(x = year,
               y = area_ha,
               fill = reasons))

Stacked area

Code

ggplot(brazil_loss_long_select) +
  geom_area(aes(x = year,
               y = area_ha,
               fill = reasons,
               group = reasons))

Simple dodged?

Code

ggplot(brazil_loss_long_select) +
  geom_col(aes(x = year,
               y = area_ha,
               fill = reasons),
           position = "dodge") +
  theme(legend.position = "bottom")

Not fantastic.

Grouped bar charts - rotate?

Code

ggplot(brazil_loss_long_select ) +
  geom_col(aes(x = area_ha, 
               y = reasons, 
               fill = year), 
           position = "dodge") +
    theme(legend.position = "bottom")

Meh.

Grouped bar charts - better use x?

Code

ggplot(brazil_loss_long_select ) +
  geom_col(aes(x = reasons, 
               y = area_ha, 
               fill = year), 
           position = "dodge") +
    theme(legend.position = "right") +
  guides(x =  guide_axis(angle = 90))

Ack!

Facets to the rescue

Code

ggplot(brazil_loss_long_select) +
  geom_col(aes(x = year,
               y = area_ha, 
               fill = reasons,)) +
  facet_wrap(vars(reasons)) +
  theme(legend.position = "none")

Summary for stacked bar charts

Summary using line charts

Exploring a single column

Code

brazil_loss |> 
  ggplot(aes(x = pasture)) +
  geom_histogram()

Code

brazil_loss |> 
  ggplot(aes(x = pasture)) +
  geom_histogram(binwidth = 500000)

Complete set

Code

brazil_loss_long_complete |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The five main reasons for deforestation in Brazil",
       fill = NULL) +
    guides(x = guide_axis(n.dodge = 2))

From exploratory plots to “products”

Code

brazil_loss_long_complete |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The five main reasons for deforestation in Brazil",
       subtitle = "Area in hectar per year",
       fill = NULL,
       x = NULL,
       y = NULL) +
  facet_wrap(vars(fct_lump(reasons, n = 5, w = area_ha)), nrow = 3) +
  theme(legend.position = "none")  +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_y_continuous(labels = scales::comma)

Small multiples

No year label (self-explanatory)
Changed y label to subtitle
Dodged labels rather than turning
Abbreviating the year might be less readable

Desired improvements missing
- Color scheme
- Names of reasons

Heat map

Code

brazil_loss_long_complete |>
  ggplot(aes(x = year, y = reasons)) +
           geom_tile(aes(fill = area_ha))

Heatmap improved

Dodged labels and square root transformation and viridis color scheme

Code

brazil_loss_long_complete |>
  ggplot(aes(x = year, y = reasons)) +
           geom_tile(aes(fill = area_ha)) +
  scale_fill_continuous(trans = "sqrt", type = "viridis") +
  guides(x = guide_axis(n.dodge = 3))

Heatmap improvements

Adding suppressed labels and log-transformation

Code

brazil_loss_long_complete |>
  ggplot(aes(x = year, y = reasons)) +
           geom_tile(aes(fill = area_ha)) +
  scale_fill_continuous(trans = "log10", type = "viridis") +
 scale_x_discrete(
    breaks = seq(from = 2001, to = 2013, by = 3)
  )

Thank you for your attention!