Getting started with R

Exercises for Day 1

Author

Tyler McInnes

Published

November 26, 2025

1 Setting up a new project

1.1 Create a new project for exercises

NoteNote

This project will be used for this the exercise only. Give it a name to reflect this. Remember the conventions for naming style.

I strongly recommend not using an LLM for this exercise. Figuring this out on your own is one of the most critical lessons for the day.

Download this template to the project folder.

1.2 Set up

For this exercise we will need two packages: tidyverse and nycflights13. Load the two packages. What is different between loading tidyverse and nycflights13?

Remember the importance of documentation: use hash symbols to create a section in the .qmd file, use code chunks, document your work.

The nycflights13 package will quietly (in the background, not visible in your Environment) load a data object called “flights”. Use one of the functions we have learned today to view the flights data.

2 Exploratory analysis

A very common situations you will find yourself in research: You get some data - you don’t know what format it is, what it’s meant to look like, you don’t yet know what questions you can ask of this data.

Use functions we have encountered today to answer the following questions:

2.1 How much data is in flights?

How would you measure data in a table?

2.2 What type of data is stored?

Look at the column names of the data (colnames()). What do dep_time, arr_time, and air_time mean? With these three variables we can carry out a simple but important test - can you identify what that test is?

  • We can perform a similarly important test using just dep_time and arr_time.


2.3 Hypothesis generation

Given the data variables in flights, propose two hypotheses that we could test with this dataset.

3 Exploring values and distributions of numerical variables

3.1 Explore categorical variables

Instead of passing all the data in an object to a function we can specify a single variable with $. Use flights$variable_name to pass a single variable to a function.

e.g.,

flights$carrier |> unique()
 [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
[16] "OO"

Use the functions unique() and n_distinct() to answer:

3.2 What are the locations flights are leaving from?

3.3 How many destinations are these flights leaving for?

3.4 Use functions range(), mean(), median() to gauge the level of departure delays

NoteNote

We will need to add na.rm = TRUE as an argument within range(), mean(), and median().

If we do not pass na.rm = TRUEas an argument the functions will detect the presence of NAs in the data and return NA.

4 Exploratory analysis with ggplot

4.1 Use ggplot to look at associations between two variables.

4.2 Can you use ggplot to test the hypotheses you generated above?

NoteMy hypothesis

Here’s a plot to test my hypothesis. Can you identify my hypothesis?

ggplot(flights,
       aes(x = dep_time,
           y = dep_delay)) +
  geom_point()
Warning: Removed 8255 rows containing missing values or values outside the scale range
(`geom_point()`).

4.3 Extra message

Depending on your choice of variables your plot (and the plot above) may produce a message that looks like this:

Warning: Removed 8255 rows containing missing values or values outside the scale range
(`geom_point()`).

What type of message is this? What does this message mean? What steps, if any, do you recommend taking in response?