Exploratory data analysis

Amounts, small multiples, heatmaps

Roland Krause

MADS6

Tuesday, 26 November 2024

Exploratory data analysis

  1. Slides

  2. Exploration

  3. plotly

Introduction

Learning objectives

  • Aims of exploratory data analysis

  • Finding the right visualisation for amounts

  • Trends vs amounts

  • Small multiples with facets

Material

What are we visualising?

Amounts

x-y relationships

Proportions

Distributions

Uncertainty

Some ggplot tricks

Set the themes globally

theme_set(theme_bw( base_size = 20))

Set a theme for exploration

Set your themes as you prefer early! Don’t copy-paste code for themes, move them to the start of the chunk!

Data for the coming examples: Brazil Rain Forest Loss

Code
brazil_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv"
brazil_loss <- read_csv(brazil_url, 
                        col_select = -c(entity, code),
                        col_types = list(year = col_character()))

pivot_longer(brazil_loss,
           cols = commercial_crops:small_scale_clearing,
           names_to = "reasons",
           values_to = "area_ha") -> brazil_loss_long_complete

(brazil_loss_long_complete |> 
    filter(year == 2003)  -> brazil_loss_long)
# A tibble: 11 × 3
   year  reasons                         area_ha
   <chr> <chr>                             <dbl>
 1 2003  commercial_crops                 550000
 2 2003  flooding_due_to_dams                  0
 3 2003  natural_disturbances              35000
 4 2003  pasture                         2761000
 5 2003  selective_logging                149000
 6 2003  fire                              44000
 7 2003  mining                                0
 8 2003  other_infrastructure               9000
 9 2003  roads                             35000
10 2003  tree_plantations_including_palm   26000
11 2003  small_scale_clearing             358000

Visualising amounts

Standard geometry: bar chart

  • geom_bar() counts items (takes one geometry).
  • geom_col() draws columns of a give height (takes two geometries)
Code
brazil_loss_long |>

  ggplot(aes(x = reasons, y = area_ha)) +
    geom_col() +
  labs(title = "Loss of Brazilian rain forest in 2003")

Visualising amounts

Make axis readable by turning the labels

Code
brazil_loss_long |>

  ggplot(aes(x = reasons, y = area_ha)) +
    geom_col(fill = "red") +
  guides(x =  guide_axis(angle = 45))  

Flip x and y to make labels readable!

Code
brazil_loss_long |>
  ggplot(aes(
    x = area_ha, y = reasons)) +
  geom_col() 

Sorting the chart

Code
brazil_loss_long |>

  ggplot(aes(y = fct_reorder(reasons, area_ha), 
             x = area_ha)) +
    geom_col() 

Technical column names

In exploratory analysis, bare column names or even functions that are used as in this example.

For a final, presented product, consider changing than to a more readable form.

Summarizing

Code
brazil_loss_long |>
  ggplot(aes(y = fct_lump_n(reasons,
                            n = 5,
                            w = area_ha),
             x = area_ha)) +
  geom_col()

Sorting the summarized

Code
brazil_loss_long |>
  ggplot(aes(
    y =
      # Collapsing all but the five most common levels
      fct_lump_n(reasons, n = 5, w = area_ha) |>
      # Sorting by area
      fct_infreq(w = area_ha) |>
      # Reverse the sorting
      fct_rev(),
    x = area_ha,
    fill = reasons
  )) +
  geom_col() +
  labs(y = NULL) +
  theme(legend.position = "bottom")

Legend with all reasons

Sorting the lumped

Code
brazil_loss_long |>
  mutate(reasons_fct =       
      # Collapsing all but the five most common levels
      fct_lump_n(reasons, n = 5, w = area_ha) |>
      # Sorting by area
      fct_infreq(w = area_ha) |>
      # Reverse the sorting
      fct_rev() ) |> 
  ggplot(aes(
    y = reasons_fct,
    x = area_ha,
    fill = reasons_fct
  )) +
  geom_col() +
  labs(y = NULL) +
  theme(legend.position = "bottom")

Loss of pasture over the years

Basic plot

Code
brazil_loss |> 
  ggplot(aes(x = year, 
             y = pasture)) +
           geom_col(fill = "darkgreen")

Cutting the y-x axis is cheating!

Code
brazil_loss |> 
  ggplot(aes( x = year, y = pasture)) +
           geom_col(fill = "darkgreen") +
  coord_cartesian(ylim = c(500000, 3000000))

Points vs bars for amounts

Preferable for truncation of axes

Code
brazil_loss |>
  ggplot(aes(x = year, y = pasture)) +
  geom_point(color = "darkgreen", size = 3)

Area plot

Code
brazil_loss |>
  ggplot(aes(x = year, y = pasture, group = 1)) +
  geom_point(color = "darkgreen", 
             size = 1) +
  geom_line(color = "darkgreen", 
            alpha = 0.4) +
  geom_area(fill = "darkgreen", 
            alpha = 0.2)

Combining amounts and trends

Note

Alpha (transparancy) is a simple tool to emphasize particular plot elements and staying in the same color scheme.

Area plot

Snake plot

brazil_loss |>
  ggplot(aes(y = fct_rev(fct_infreq(year, pasture)), x = pasture)) +
  geom_point(color = "red", size = 3)

Not ideal with years (ordinal category, technically numeric)

year pasture
2001 1520000
2002 2568000
2003 2761000
2004 2564000
2005 2665000
2006 1861000
2007 1577000
2008 1345000
2009 847000
2010 616000
2011 738000
2012 546000
2013 695000

Grouped bar charts

Code
brazil_loss |>
  pivot_longer(
    cols = c(fire, small_scale_clearing,
      selective_logging, commercial_crops),
    names_to = "reasons",
    values_to = "area_ha" ) |>
  filter(year > 2008) -> brazil_loss_long_select

ggplot(brazil_loss_long_select) +
  geom_col(aes(x = year,
               y = area_ha,
               fill = reasons))

Stacked area

Code
ggplot(brazil_loss_long_select) +
  geom_area(aes(x = year,
               y = area_ha,
               fill = reasons,
               group = reasons))

Simple dodged?

Code
ggplot(brazil_loss_long_select) +
  geom_col(aes(x = year,
               y = area_ha,
               fill = reasons),
           position = "dodge") +
  theme(legend.position = "bottom")

Not fantastic.

Grouped bar charts - rotate?

Code
ggplot(brazil_loss_long_select ) +
  geom_col(aes(x = area_ha, 
               y = reasons, 
               fill = year), 
           position = "dodge") +
    theme(legend.position = "bottom")

Meh.

Grouped bar charts - better use x?

Code
ggplot(brazil_loss_long_select ) +
  geom_col(aes(x = reasons, 
               y = area_ha, 
               fill = year), 
           position = "dodge") +
    theme(legend.position = "right") +
  guides(x =  guide_axis(angle = 90)) 

Ack!

Facets to the rescue

Code
ggplot(brazil_loss_long_select) +
  geom_col(aes(x = year,
               y = area_ha, 
               fill = reasons,)) +
  facet_wrap(vars(reasons)) +
  theme(legend.position = "none")

Summary for stacked bar charts

Summary using line charts

Exploring a single column

Code
brazil_loss |> 
  ggplot(aes(x = pasture)) +
  geom_histogram()

Code
brazil_loss |> 
  ggplot(aes(x = pasture)) +
  geom_histogram(binwidth = 500000)

Complete set

Code
brazil_loss_long_complete |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The five main reasons for deforestation in Brazil",
       fill = NULL) +
    guides(x = guide_axis(n.dodge = 2))

From exploratory plots to “products”

Code
brazil_loss_long_complete |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The five main reasons for deforestation in Brazil",
       subtitle = "Area in hectar per year",
       fill = NULL,
       x = NULL,
       y = NULL) +
  facet_wrap(vars(fct_lump(reasons, n = 5, w = area_ha)), nrow = 3) +
  theme(legend.position = "none")  +
  guides(x = guide_axis(n.dodge = 2)) +
  scale_y_continuous(labels = scales::comma)

Small multiples

  • No year label (self-explanatory)
  • Changed y label to subtitle
  • Dodged labels rather than turning
  • Abbreviating the year might be less readable
  • Desired improvements missing
    • Color scheme
    • Names of reasons

Heat map

Code
brazil_loss_long_complete |>
  ggplot(aes(x = year, y = reasons)) +
           geom_tile(aes(fill = area_ha))

Heatmap improved

Dodged labels and square root transformation and viridis color scheme

Code
brazil_loss_long_complete |>
  ggplot(aes(x = year, y = reasons)) +
           geom_tile(aes(fill = area_ha)) +
  scale_fill_continuous(trans = "sqrt", type = "viridis") +
  guides(x = guide_axis(n.dodge = 3))

Heatmap improvements

Adding suppressed labels and log-transformation

Code
brazil_loss_long_complete |>
  ggplot(aes(x = year, y = reasons)) +
           geom_tile(aes(fill = area_ha)) +
  scale_fill_continuous(trans = "log10", type = "viridis") +
 scale_x_discrete(
    breaks = seq(from = 2001, to = 2013, by = 3)
  ) 

Thank you for your attention!