Project
Finding a data set
There is no need to have a scientific subject, it is just as suitable to have a “commercial data analysis question” or working on data you’re personally invested in.
You can use resources like Kaggle or rtidytuesday to find suitable data sets. Ideally, you use two sources and intersect or compare the data to create something unique.
You are free to choose a data set that you have worked on already but you have to include a statement about the source of the dataset and help you have taken on.
Ideally, you should not require much time to find an interesting angle for the exploratory data analysis.
Exploratory plots
Exploratory plots will demonstrate aspects of the data sets such as distributions, aspects, ratios and other common stats that can be used to describe data. Model building can be employed but is not required.
Final products
The final visualisation product can be either a poster, a presentation or a website.
It should contain a section on the exploration of data and a “final product” that is publication ready.
There should be three major plots efforts, such as a
- multi-panel figure,
- complex chart,
- shiny application,
- flowchart or
- up to one scientific or technical illustration that is not created in code.
You should also include a table summarizing some of the data.
You do not need to supply an in-depth analysis in writing but figures and tables require captions.

Submission
You need to submit a shared git repository that can be run on the instructor’s machine. You can choose any graphics library that is programmable, e.g. matplotlib or ggplot2 and use any programming language as needed. R and ggplot2 are the choices best supported by the instructors.
The repository should contain minimally two documents, one demonstrating the exploratory work, one for the final product.
Ensure that data loading is implemented in a reproducible way. Do not share data if it can be readily accessed online.
Learners are free to include code from others but any such code needs to be marked by its source and will not be evaluated and does not count towards the three efforts. Learners will explain their code during the course.
Data resources
- Kaggle - Machine learning data sets and code
- Tidy Tuesday - A community with weekly datasets and their analyses.
- Project Tycho - Analysis ready health data