---
title: "ggplot2_exercise_answers"
description: Worked answers for the exercises in the ggplot course
author: Simon Andrews, Laura Biggins
author-affiliation: Babraham Institute
title-block-banner: true
format: 
  html:
    toc: true
    df-print: paged
editor: visual
embed-resources: true
---

# Exercises: Introduction to ggplot2

Load the tidyverse packages into your environment.

```{r}
#| include: FALSE
library(knitr)
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width= 6, fig.height= 4)
```

```{r}
library(tidyverse)
```

## Exercise 1

### Weight chart

Load the data from the `weight_chart.txt` file.

```{r}
#| message: FALSE
weight_chart <- read_delim("ggplot_data_files/weight_chart.txt")
head(weight_chart)
```

Draw a scatterplot (using `geom_point`) of the Age vs Weight. When defining your aesthetics the Age will be the x and Weight will be the y.

```{r}
weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_point()
```

Make all of the points filled with blue2 by putting a fixed aesthetic into `geom_point()` and give them a size of 3

```{r}
weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_point(col = "blue2", size = 3)
```

You will see that an obvious relationship exists between the two variables. Change the geometry to `geom_line()` to see another way to represent this plot.

```{r}
weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_line(col="blue2", linewidth=3)
```

Combine the two plots by adding both a `geom_line()` and a `geom_point()` geometry to show both the individual points and the overall trend.

```{r}
#| label: fig-mpg
#| fig-cap:
#|   - "line on top of points"
#|   - "line behind"
#| layout-ncol: 2
#| column: page
#| fig-width: 3.5
#| fig-height: 2.5

weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_point(col="blue", size=5) +
  geom_line(linewidth = 2)
  
weight_chart |>
  ggplot(aes(x = Age, y = Weight)) +
  geom_line(linewidth = 2) +
  geom_point(col="blue", size=5)
```

### Chromosome position data

Load the data for the `chromosome_position_data.txt` file.

```{r}
chr_pos <- read_delim("ggplot_data_files/chromosome_position_data.txt")
chr_pos
```

Use `pivot_longer()` to put the data into tidy format by combining the three data columns together.\
The options to `pivot_longer()` will be:

-   The columns to restructure: `cols = Mut1:WT`

-   \[Optional\] The name of the new names column: `names_to = "sample"`

-   \[Optional\] The name of the values column: `values_to = "value"`

```{r}
chr_pos <- chr_pos |>
  pivot_longer(cols = Mut1:WT, names_to = "sample")

chr_pos
```

Draw a line graph `geom_line()` to plot the position `x = Position` against the value `y = value`.\
Split the Samples by colour `colour = Sample`.\
Use the `linewidth` attribute in `geom_line()` to make the lines slightly thicker than their default width.

```{r}
chr_pos |>
  ggplot(aes(x = Position, y = value, colour = sample)) +
  geom_line(linewidth = 1)
```

### If you have time...

Load in the `genomes.csv` file and use the separate function to turn the Groups column into Domain, Kingdom and Class based on a semicolon delimiter.

```{r}
genomes <- read_delim("ggplot_data_files/genomes.csv")
genomes
```

```{r}
genomes <- genomes |>
  separate(
    col  = Groups, 
    into = c("Domain", "Kingdom", "Class"), 
    sep  = ";"
  ) 

genomes
```

Plot a point graph of `log10(Size)` vs `Chromosomes` and colour it by `Domain`.

```{r}
genomes |>
  ggplot(aes(x = log10(Size), y = Chromosomes, colour = Domain)) +
  geom_point()
```

## Exercise 2

Load the data from `small_file.txt` using `read_delim()`

```{r}
small <- read_delim("ggplot_data_files/small_file.txt")
head(small)
```

Plot out a barplot of the lengths of each sample from category A.

-   Start by filtering the data to keep only Sample A samples

    -   `small %>% filter(Category == "A")`

-   Pass this filtered tibble to `ggplot`

-   Your x aesthetic will be `Sample` and your y will be `Length`

-   Since the value in the data is the bar height you need to use `geom_col()`

```{r}
small %>%
  filter(Category == "A") %>%
  ggplot(aes(x = Sample, y = Length)) +
  geom_col()
```

Plot out a barplot (using `geom_bar()`) of the mean length for each category in `small.file`.\
You will need to set `stat="summary", fun="mean"` in `geom_bar()` so it plots the mean value.

```{r}
small %>%
  ggplot(aes(x =  Category, y = Length)) +
  geom_bar(stat = "summary", fun = "mean")
```

Add a call to `geom_jitter()` to the last plot so you can also see the individual points

-   Colour the points by `Category` and decrease the width of the jitter columns to get better separation. Make sure height is set to 0.\
-   If you don't want to see the legend for the points then you can set `show.legend=FALSE` in `geom_jitter()`.\
-   If the legend is still present, it can be completely removed by adding another layer `+ theme(legend.position="none")`

```{r}
small %>%
  ggplot(aes(x = Category, y = Length, colour = Category)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
  theme(legend.position = "none")
```

Load the data from `expression.txt` using `read_delim()`.

```{r}
expression <- read_delim("ggplot_data_files/expression.txt")
head(expression)
```

Plot out the distribution of Expression values in this data. You can try both `geom_histogram()` and `geom_density()`. Try changing the colour and fill parameters to make the plot look prettier. In `geom_histogram()` try changing the `binwidth` parameter to alter the resolution of the distribution.

```{r}
#| fig-cap:
#|   - "histogram"
#|   - "decreasing the binwidth"
#| layout-ncol: 2
#| column: page
#| fig-width: 3.5
#| fig-height: 2.5

expression %>%
  ggplot(aes(Expression)) +
  geom_histogram(fill = "seagreen", colour = "blue2")

expression %>%
  ggplot(aes(Expression)) +
  geom_histogram(fill = "seagreen", colour = "blue2", binwidth = 0.4)
```

```{r}
exp_plot <- expression %>%
  ggplot(aes(Expression)) +
  geom_density(fill = "seagreen", colour = "blue2")
```

Load the data from `cancer_stats.csv` using `read_delim()`.

```{r}
cancer <- read_delim("ggplot_data_files/cancer_stats.csv")
cancer
```

Plot a barplot (`geom_col()`) of the number of Male deaths for all Sites.

Make sure you let the RStudio auto-complete help you to fill in the Male Deaths column name so you get the correct backtick quotes around it. You won't be able to show all of the categories so just show the first 5.

```{r}
cancer %>%
  slice(1:5) %>%
  ggplot(aes(x = Site, y = `Male Deaths`)) +
  geom_col()
```

### If you have time...

Create a new variable in child.variants loaded from Child_Variants.csv using `mutate` and `if_else`. The value should be `GOOD` if QUAL == 200 otherwise it should be `BAD`.

```{r}
child <- read_delim("ggplot_data_files/Child_Variants.csv")

child <- child |>
  mutate(good_bad = if_else(QUAL == 200, "Good", "Bad"))

head(child)
```

Plot out a violin plot, using `geom_violin()` of the `MutantReads` for the new Good or Bad category. It looks better on a log scale.

```{r}
#| fig-cap:
#|   - "linear values"
#|   - "log2 transformed"
#| layout-ncol: 2
#| column: page
#| fig-width: 3.5
#| fig-height: 2.5

child %>%
  ggplot(aes(x = good_bad, y = MutantReads)) +
  geom_violin(fill="seagreen", colour="blue2")

child %>%
  ggplot(aes(x = good_bad, y = log2(MutantReads))) +
  geom_violin(fill="seagreen", colour="blue2")
```

## Exercise 3

Use `theme_set` to set your ggplot theme to be `theme_bw` with a `base_size` of 12. Replot one of your earlier plots to see how its appearance changes.

```{r}
theme_set(theme_bw(base_size = 12))
expression |>
  ggplot(aes(Expression)) +
  geom_density(fill = "seagreen", colour = "blue2")

```

In the cancer barplot you did in exercise 2 you had to exclude sites because you couldn’t show them on the x axis. Use the `coord_flip` transformation to switch the x and y axes so you can remove the `slice` function which restricted you to 5 sites, and show all of the sites again.

```{r}
#| fig-width: 4
#| fig-height: 5
cancer |>
  ggplot(aes(x = Site, y = `Male Deaths`)) +
  geom_col() +
  coord_flip()
```

We could reorder the results by the data to make the plot clearer and remove the female-only cancers.

```{r}
#| fig-width: 5
#| fig-height: 5
cancer |>
  drop_na(`Male Deaths`) |>
  ggplot(aes(x = reorder(Site, `Male Deaths`), y = `Male Deaths`)) +
  geom_col() +
  xlab("Site") +
  coord_flip()
```

Load the data from `brain_bodyweight.tsv`

```{r}
brain_body <- read_delim("ggplot_data_files/brain_bodyweight.txt")
head(brain_body)
```

-   Plot a scatterplot of the brain against the body

-   Change the axis labels (xlab and ylab) to say Brainweight (g) and Bodyweight (kg) and add a suitable title (ggtitle).

-   Use a method to display the data as log transformed on the plot (Options are: use log scale axes, log transform when creating the aesthetic mapping or mutate the data before passing it to ggplot)

```{r}
#| fig-width: 7
#| fig-height: 5

brain_body |>
  mutate(Category = factor(Category, levels = c("Domesticated", "Wild", "Extinct"))) %>%
  ggplot(aes(x = brain, y = body, colour = Category)) +
  geom_point(size = 4)+
  ggtitle("Brain vs Body weight")+
  xlab("Brainweight (g)") +
  ylab("Bodyweight (kg)") +
  scale_y_log10() +
  scale_x_log10() +
  scale_colour_brewer(palette = "Set1")
```

### If you have time...

Create a barplot of the brainweight of all species, coloured by their bodyweight. Use a custom colour scheme for the colouring of the bars. You will again need to use a log scale for the brain and bodyweight.

```{r}
#| fig-width: 5
#| fig-height: 5
brain_body %>%
  ggplot(aes(x=Species, y=brain, fill=log(body))) +
  geom_col() +
  coord_flip() +
  scale_fill_gradientn(colours=c("purple", "blue2", "green2","yellow", "red2"))
```

## Exercise 4: Summary Overlays

```{r}
treatments <- read_delim("ggplot_data_files/treatments.csv")
treatments
```

Plot a stripchart of the four conditions using geom_jitter()

```{r}
treatments |>
  ggplot(aes(x = Sample, y = Measure)) +
  geom_jitter()
```

We need to make sure we've set `height = 0` and we might also want to reduce the jitter width.

```{r}
treatments |>
  ggplot(aes(x = Sample, y = Measure)) +
  geom_jitter(height = 0, width = 0.2)
```

Overlay a boxplot of the same data along with the raw points.  

```{r}
treatments |>
  ggplot(aes(x = Sample, y = Measure)) +
  geom_jitter(height = 0, width = 0.2) +
  geom_boxplot()
```

That could do with some improvements

-   Adjust the size and width of spread of the points in `geom_jitter()` to something sensible

-   Adjust the `linewidth` of the lines in the boxplot

-   Make sure `geom_boxplot()` is drawn first so you can see everything

-   Try colouring the points by the condition to see if it's any clearer.

```{r}
set.seed(99)
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_boxplot(linewidth = 1, width = 0.6) +
  geom_jitter(height = 0, width = 0.3, size = 2) 
```

Outliers from a boxplot are shown as points. This means that for the TGX-221 sample, we're showing the data point twice. We can set `outliers = FALSE` in `geom_boxplot()` to avoid this.

```{r}
#set.seed(99)
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_boxplot(linewidth = 1, width = 0.6, outliers = FALSE) +
  geom_jitter(height = 0, width = 0.3, size = 2) 
```

If we want to change the boxplot colour back to a single colour we can set that as a fixed aesthetic.

```{r}
set.seed(99)
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_boxplot(linewidth = 1, width = 0.6, colour = "grey2", outliers = FALSE) +
  geom_jitter(height = 0, width = 0.3, size = 3) +
  theme(legend.position = "none")
```

Plot the same data as a barplot with errorbars for the SEM.

Use a `geom_bar` for the barplot with `stat="summary"` and then use stat_summary with a geometry of `errorbar` with the default `mean_se` values.

```{r}
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
  geom_bar(stat="summary") +
  stat_summary(geom="errorbar", width = 0.4, linewidth = 1.2) +
  geom_jitter(height = 0, width = 0.3, size = 2) +
  theme(legend.position = "none")
```

### If you have time...

Precompute the values

```{r}
treatments_summary <- treatments %>%
  group_by(Sample) %>%
  summarise(mean = mean(Measure),stdev = sd(Measure))

treatments_summary
```

```{r}
treatments_summary %>%
  ggplot(aes(x = Sample, y = mean, ymin = mean-stdev, ymax = mean+stdev)) +
  geom_col(fill = "#9AC1BA",color = "#083F5E", linewidth = 2) +
  geom_errorbar(size = 2, colour = "#083F5E", width = 0.3) +
  theme_classic()
```

Replot the stripchart, overlaying a bar for the mean

```{r}
treatments |>
  ggplot(aes(x = Sample, y = Measure, colour=Sample)) +
  geom_jitter(height=0, width=0.2, show.legend = FALSE) +
  stat_summary(
    geom = "errorbar", 
    fun = mean, 
    fun.max = mean, 
    fun.min = mean,
    width = 0.4, 
    linewidth = 1, 
    col = "black"
  ) 

```

## Exercise 5

Load the data in `up_down_expression.txt`

```{r}
up_down <- read_delim("ggplot_data_files/up_down_expression.txt")
up_down
```

Plot out a scatterplot of `Condition1` vs `Conditon2` coloured by `State`

```{r}
up_down |>
  ggplot(aes(x = Condition1, y = Condition2, colour = State)) +
  geom_point()
```

Change the coloring using `scale_colour_manual` so that up is red, unchanging is grey and down is blue

```{r}
up_down |>
  ggplot(aes(x = Condition1, y = Condition2, colour = State)) +
  geom_point() +
  scale_colour_manual(values = c("blue", "grey", "red"))
```

Add text labels to the genes Col1A2, TCL1B, SPTSSB, SULF2
Filter the full dataset using `up_down |> filter( Gene %in% c("Col1A2", "etc" ))` and save the result.

```{r}
genes_of_interest <- c("COL1A2", "TCL1B", "SPTSSB", "SULF2")

up_down_interesting <- up_down |>
  filter(Gene %in% genes_of_interest)
```

Pass the filtered dataset to the data option of geom_text.  Make sure you added `label = Gene` to your aesthetic mappings.  
Colour the labels black and use `hjust=1.2` to position them to be readable away from the actual points, or use `geom_text_repel` from the `ggrepel` package to adjust the positions automatically.

Use `geom_abline` to put a null line across the diagonal `(slope=1, intercept=0)`

```{r}
library(ggrepel)
up_down |>
  ggplot(aes(x = Condition1, y = Condition2, colour = State, label = Gene)) +
  geom_point(size = 1.5) +
  scale_colour_manual(values = c("blue", "grey", "red")) +
  theme(legend.position = "none") +
  geom_abline(slope = 1, intercept = 0, colour="darkgrey", linewidth=1) +
  geom_text_repel(data = up_down_interesting, col = "black", box.padding = 1)
```

```{r}
festival <- read_csv("ggplot_data_files/DownloadFestival.csv") 

head(festival)
```

Draw a stripchart of cleanliness for males and females separately.

```{r}
festival %>%
  ggplot(aes(x = gender, y = cleanliness)) +
  geom_jitter(height = 0, width = 0.3, alpha = 0.5)
  
```

Use `facet_grid(cols = vars(day))` to split the plot based on the day of the festival to see the effect this has on the data

Make some additions to the plot  
  -	Colour the male samples red and the female samples blue  
  - Add a line to show the mean by using a stat summary   

```{r}
festival %>%
  ggplot(aes(x = gender, y = cleanliness, colour = gender)) +
  geom_jitter(height = 0, width = 0.3, alpha = 0.5, stroke = NA, show.legend = FALSE) +
  scale_colour_manual(values = c("blue2", "red2")) +
  stat_summary(
    geom      = "errorbar", 
    fun       = mean, 
    fun.max   = mean, 
    fun.min   = mean, 
    colour    = "darkgrey", 
    linewidth = 2
  ) +
  facet_grid(cols = vars(day))
```

### If you have time...

Add a new column called attendance to the data to say how many days people attended the festival.  To do this you will need to:

1.	Group by the person
2.	Use count to get a count of how many times each person occurred
3.	Use right_join to merge the counts back into the original data
4.	Rename the n column to attendance

```{r}
festival_attendance <- festival %>%
  group_by(person) %>%
  count() %>%
  ungroup() %>%
  right_join(festival) %>%
  rename(attendance = n) 
```

Redraw the plot faceting by both attendance and day

```{r}
#| fig-width: 6
#| fig-height: 6
festival_attendance %>%
  ggplot(aes(x = gender, y = cleanliness, colour = gender)) +
  geom_jitter(height = 0, width = 0.3, alpha = 0.5, stroke = NA, show.legend = FALSE) +
  scale_colour_manual(values = c("blue2", "red2")) +
  stat_summary(
    geom      = "errorbar", 
    fun       = mean, 
    fun.max   = mean, 
    fun.min   = mean, 
    colour    = "darkgrey", 
    linewidth = 2
  ) +
  facet_grid(cols = vars(day), rows = vars(attendance))
```