Exercises: Introduction to ggplot2

Load the tidyverse packages into your environment.

library(tidyverse)

Exercise 1

Weight chart

Load the data from the weight_chart.txt file.

weight_chart <- read_delim("ggplot_data_files/weight_chart.txt")
head(weight_chart)

Draw a scatterplot (using geom_point) of the Age vs Weight. When defining your aesthetics the Age will be the x and Weight will be the y.

weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_point()

Make all of the points filled with blue2 by putting a fixed aesthetic into geom_point() and give them a size of 3

weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_point(col = "blue2", size = 3)

You will see that an obvious relationship exists between the two variables. Change the geometry to geom_line() to see another way to represent this plot.

weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_line(col="blue2", linewidth=3)

Combine the two plots by adding both a geom_line() and a geom_point() geometry to show both the individual points and the overall trend.

weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_point(col="blue", size=5) +
  geom_line(linewidth = 2)
  
weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_line(linewidth = 2) +
  geom_point(col="blue", size=5)

Chromosome position data

Load the data for the chromosome_position_data.txt file.

chr_pos <- read_delim("ggplot_data_files/chromosome_position_data.txt")
chr_pos

Use pivot_longer() to put the data into tidy format by combining the three data columns together.
The options to pivot_longer() will be:

The columns to restructure: cols = Mut1:WT
[Optional] The name of the new names column: names_to = "sample"
[Optional] The name of the values column: values_to = "value"

chr_pos <- chr_pos |>
  pivot_longer(cols = Mut1:WT, names_to = "sample")

chr_pos

Draw a line graph geom_line() to plot the position x = Position against the value y = value.
Split the Samples by colour colour = Sample.
Use the linewidth attribute in geom_line() to make the lines slightly thicker than their default width.

chr_pos |>
  ggplot(aes(x = Position, y = value, colour = sample)) +
  geom_line(linewidth = 1)

If you have time…

Load in the genomes.csv file and use the separate function to turn the Groups column into Domain, Kingdom and Class based on a semicolon delimiter.

genomes <- read_delim("ggplot_data_files/genomes.csv")
genomes

genomes <- genomes |>
  separate(
    col  = Groups, 
    into = c("Domain", "Kingdom", "Class"), 
    sep  = ";"
  ) 

genomes

Plot a point graph of log10(Size) vs Chromosomes and colour it by Domain.

genomes |>
  ggplot(aes(x = log10(Size), y = Chromosomes, colour = Domain)) +
  geom_point()

Exercise 2

Load the data from small_file.txt using read_delim()

small <- read_delim("ggplot_data_files/small_file.txt")
head(small)

Plot out a barplot of the lengths of each sample from category A.

Start by filtering the data to keep only Sample A samples
- small %>% filter(Category == "A")
Pass this filtered tibble to ggplot
Your x aesthetic will be Sample and your y will be Length
Since the value in the data is the bar height you need to use geom_col()

small %>%
  filter(Category == "A") %>%
  ggplot(aes(x = Sample, y = Length)) +
  geom_col()

Plot out a barplot (using geom_bar()) of the mean length for each category in small.file.
You will need to set stat="summary", fun="mean" in geom_bar() so it plots the mean value.

small %>%
  ggplot(aes(x =  Category, y = Length)) +
  geom_bar(stat = "summary", fun = "mean")

Add a call to geom_jitter() to the last plot so you can also see the individual points

Colour the points by Category and decrease the width of the jitter columns to get better separation. Make sure height is set to 0.
If you don’t want to see the legend for the points then you can set show.legend=FALSE in geom_jitter().
If the legend is still present, it can be completely removed by adding another layer + theme(legend.position="none")

small %>%
  ggplot(aes(x = Category, y = Length, colour = Category)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
  theme(legend.position = "none")

Load the data from expression.txt using read_delim().

expression <- read_delim("ggplot_data_files/expression.txt")
head(expression)

Plot out the distribution of Expression values in this data. You can try both geom_histogram() and geom_density(). Try changing the colour and fill parameters to make the plot look prettier. In geom_histogram() try changing the binwidth parameter to alter the resolution of the distribution.

expression %>%
  ggplot(aes(Expression)) +
  geom_histogram(fill = "seagreen", colour = "blue2")
expression %>%
  ggplot(aes(Expression)) +
  geom_histogram(fill = "seagreen", colour = "blue2", binwidth = 0.4)

exp_plot <- expression %>%
  ggplot(aes(Expression)) +
  geom_density(fill = "seagreen", colour = "blue2")

Load the data from cancer_stats.csv using read_delim().

cancer <- read_delim("ggplot_data_files/cancer_stats.csv")
cancer

Plot a barplot (geom_col()) of the number of Male deaths for all Sites.

Make sure you let the RStudio auto-complete help you to fill in the Male Deaths column name so you get the correct backtick quotes around it. You won’t be able to show all of the categories so just show the first 5.

cancer %>%
  slice(1:5) %>%
  ggplot(aes(x = Site, y = `Male Deaths`)) +
  geom_col()

If you have time…

Create a new variable in child.variants loaded from Child_Variants.csv using mutate and if_else. The value should be GOOD if QUAL == 200 otherwise it should be BAD.

child <- read_delim("ggplot_data_files/Child_Variants.csv")

child <- child |>
  mutate(good_bad = if_else(QUAL == 200, "Good", "Bad"))

head(child)

Plot out a violin plot, using geom_violin() of the MutantReads for the new Good or Bad category. It looks better on a log scale.

child %>%
  ggplot(aes(x = good_bad, y = MutantReads)) +
  geom_violin(fill="seagreen", colour="blue2")
child %>%
  ggplot(aes(x = good_bad, y = log2(MutantReads))) +
  geom_violin(fill="seagreen", colour="blue2")

Exercise 3

Use theme_set to set your ggplot theme to be theme_bw with a base_size of 12. Replot one of your earlier plots to see how its appearance changes.

theme_set(theme_bw(base_size = 12))
expression |>
  ggplot(aes(Expression)) +
  geom_density(fill = "seagreen", colour = "blue2")

In the cancer barplot you did in exercise 2 you had to exclude sites because you couldn’t show them on the x axis. Use the coord_flip transformation to switch the x and y axes so you can remove the slice function which restricted you to 5 sites, and show all of the sites again.

cancer |>
  ggplot(aes(x = Site, y = `Male Deaths`)) +
  geom_col() +
  coord_flip()

We could reorder the results by the data to make the plot clearer and remove the female-only cancers.

cancer |>
  drop_na(`Male Deaths`) |>
  ggplot(aes(x = reorder(Site, `Male Deaths`), y = `Male Deaths`)) +
  geom_col() +
  xlab("Site") +
  coord_flip()

Load the data from brain_bodyweight.tsv

brain_body <- read_delim("ggplot_data_files/brain_bodyweight.txt")
head(brain_body)

Plot a scatterplot of the brain against the body
Change the axis labels (xlab and ylab) to say Brainweight (g) and Bodyweight (kg) and add a suitable title (ggtitle).
Use a method to display the data as log transformed on the plot (Options are: use log scale axes, log transform when creating the aesthetic mapping or mutate the data before passing it to ggplot)

brain_body |>
  mutate(Category = factor(Category, levels = c("Domesticated", "Wild", "Extinct"))) %>%
  ggplot(aes(x = brain, y = body, colour = Category)) +
  geom_point(size = 4)+
  ggtitle("Brain vs Body weight")+
  xlab("Brainweight (g)") +
  ylab("Bodyweight (kg)") +
  scale_y_log10() +
  scale_x_log10() +
  scale_colour_brewer(palette = "Set1")

If you have time…

Create a barplot of the brainweight of all species, coloured by their bodyweight. Use a custom colour scheme for the colouring of the bars. You will again need to use a log scale for the brain and bodyweight.

brain_body %>%
  ggplot(aes(x=Species, y=brain, fill=log(body))) +
  geom_col() +
  coord_flip() +
  scale_fill_gradientn(colours=c("purple", "blue2", "green2","yellow", "red2"))

Exercise 4: Summary Overlays

treatments <- read_delim("ggplot_data_files/treatments.csv")
treatments

Plot a stripchart of the four conditions using geom_jitter()

treatments |>
  ggplot(aes(x = Sample, y = Measure)) +
  geom_jitter()

We need to make sure we’ve set height = 0 and we might also want to reduce the jitter width.

treatments |>
  ggplot(aes(x = Sample, y = Measure)) +
  geom_jitter(height = 0, width = 0.2)

Overlay a boxplot of the same data along with the raw points.

treatments |>
  ggplot(aes(x = Sample, y = Measure)) +
  geom_jitter(height = 0, width = 0.2) +
  geom_boxplot()

That could do with some improvements

Adjust the size and width of spread of the points in geom_jitter() to something sensible
Adjust the linewidth of the lines in the boxplot
Make sure geom_boxplot() is drawn first so you can see everything
Try colouring the points by the condition to see if it’s any clearer.

set.seed(99)
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_boxplot(linewidth = 1, width = 0.6) +
  geom_jitter(height = 0, width = 0.3, size = 2)

Outliers from a boxplot are shown as points. This means that for the TGX-221 sample, we’re showing the data point twice. We can set outliers = FALSE in geom_boxplot() to avoid this.

#set.seed(99)
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_boxplot(linewidth = 1, width = 0.6, outliers = FALSE) +
  geom_jitter(height = 0, width = 0.3, size = 2)

If we want to change the boxplot colour back to a single colour we can set that as a fixed aesthetic.

set.seed(99)
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_boxplot(linewidth = 1, width = 0.6, colour = "grey2", outliers = FALSE) +
  geom_jitter(height = 0, width = 0.3, size = 3) +
  theme(legend.position = "none")

Plot the same data as a barplot with errorbars for the SEM.

Use a geom_bar for the barplot with stat="summary" and then use stat_summary with a geometry of errorbar with the default mean_se values.

treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_bar(stat="summary") +
  stat_summary(geom="errorbar", width = 0.4, linewidth = 1.2) +
  geom_jitter(height = 0, width = 0.3, size = 2) +
  theme(legend.position = "none")

If you have time…

Precompute the values

treatments_summary <- treatments %>%
  group_by(Sample) %>%
  summarise(mean = mean(Measure),stdev = sd(Measure))

treatments_summary

treatments_summary %>%
  ggplot(aes(x = Sample, y = mean, ymin = mean-stdev, ymax = mean+stdev)) +
  geom_col(fill = "#9AC1BA",color = "#083F5E", linewidth = 2) +
  geom_errorbar(size = 2, colour = "#083F5E", width = 0.3) +
  theme_classic()

Replot the stripchart, overlaying a bar for the mean

treatments |>
  ggplot(aes(x = Sample, y = Measure, colour=Sample)) +
  geom_jitter(height=0, width=0.2, show.legend = FALSE) +
  stat_summary(
    geom = "errorbar", 
    fun = mean, 
    fun.max = mean, 
    fun.min = mean,
    width = 0.4, 
    linewidth = 1, 
    col = "black"
  )

Exercise 5

Load the data in up_down_expression.txt

up_down <- read_delim("ggplot_data_files/up_down_expression.txt")
up_down

Plot out a scatterplot of Condition1 vs Conditon2 coloured by State

up_down |>
  ggplot(aes(x = Condition1, y = Condition2, colour = State)) +
  geom_point()

Change the coloring using scale_colour_manual so that up is red, unchanging is grey and down is blue

up_down |>
  ggplot(aes(x = Condition1, y = Condition2, colour = State)) +
  geom_point() +
  scale_colour_manual(values = c("blue", "grey", "red"))

Add text labels to the genes Col1A2, TCL1B, SPTSSB, SULF2 Filter the full dataset using up_down |> filter( Gene %in% c("Col1A2", "etc" )) and save the result.

genes_of_interest <- c("COL1A2", "TCL1B", "SPTSSB", "SULF2")

up_down_interesting <- up_down |>
  filter(Gene %in% genes_of_interest)

Pass the filtered dataset to the data option of geom_text. Make sure you added label = Gene to your aesthetic mappings.
Colour the labels black and use hjust=1.2 to position them to be readable away from the actual points, or use geom_text_repel from the ggrepel package to adjust the positions automatically.

Use geom_abline to put a null line across the diagonal (slope=1, intercept=0)

library(ggrepel)
up_down |>
  ggplot(aes(x = Condition1, y = Condition2, colour = State, label = Gene)) +
  geom_point(size = 1.5) +
  scale_colour_manual(values = c("blue", "grey", "red")) +
  theme(legend.position = "none") +
  geom_abline(slope = 1, intercept = 0, colour="darkgrey", linewidth=1) +
  geom_text_repel(data = up_down_interesting, col = "black", box.padding = 1)

festival <- read_csv("ggplot_data_files/DownloadFestival.csv") 

head(festival)

Draw a stripchart of cleanliness for males and females separately.

festival %>%
  ggplot(aes(x = gender, y = cleanliness)) +
  geom_jitter(height = 0, width = 0.3, alpha = 0.5)

Use facet_grid(cols = vars(day)) to split the plot based on the day of the festival to see the effect this has on the data

Make some additions to the plot
- Colour the male samples red and the female samples blue
- Add a line to show the mean by using a stat summary

festival %>%
  ggplot(aes(x = gender, y = cleanliness, colour = gender)) +
  geom_jitter(height = 0, width = 0.3, alpha = 0.5, stroke = NA, show.legend = FALSE) +
  scale_colour_manual(values = c("blue2", "red2")) +
  stat_summary(
    geom      = "errorbar", 
    fun       = mean, 
    fun.max   = mean, 
    fun.min   = mean, 
    colour    = "darkgrey", 
    linewidth = 2
  ) +
  facet_grid(cols = vars(day))

If you have time…

Add a new column called attendance to the data to say how many days people attended the festival. To do this you will need to:

Group by the person
Use count to get a count of how many times each person occurred
Use right_join to merge the counts back into the original data
Rename the n column to attendance

festival_attendance <- festival %>%
  group_by(person) %>%
  count() %>%
  ungroup() %>%
  right_join(festival) %>%
  rename(attendance = n)

Redraw the plot faceting by both attendance and day

festival_attendance %>%
  ggplot(aes(x = gender, y = cleanliness, colour = gender)) +
  geom_jitter(height = 0, width = 0.3, alpha = 0.5, stroke = NA, show.legend = FALSE) +
  scale_colour_manual(values = c("blue2", "red2")) +
  stat_summary(
    geom      = "errorbar", 
    fun       = mean, 
    fun.max   = mean, 
    fun.min   = mean, 
    colour    = "darkgrey", 
    linewidth = 2
  ) +
  facet_grid(cols = vars(day), rows = vars(attendance))