library(tidyverse)
Exercises: Introduction to ggplot2
Load the tidyverse packages into your environment.
Exercise 1
Weight chart
Load the data from the weight_chart.txt
file.
<- read_delim("ggplot_data_files/weight_chart.txt")
weight_chart head(weight_chart)
Draw a scatterplot (using geom_point
) of the Age vs Weight. When defining your aesthetics the Age will be the x and Weight will be the y.
|>
weight_chart ggplot(aes(x = Age, y = Weight)) +
geom_point()
Make all of the points filled with blue2 by putting a fixed aesthetic into geom_point()
and give them a size of 3
|>
weight_chart ggplot(aes(x = Age, y = Weight)) +
geom_point(col = "blue2", size = 3)
You will see that an obvious relationship exists between the two variables. Change the geometry to geom_line()
to see another way to represent this plot.
|>
weight_chart ggplot(aes(x = Age, y = Weight)) +
geom_line(col="blue2", linewidth=3)
Combine the two plots by adding both a geom_line()
and a geom_point()
geometry to show both the individual points and the overall trend.
|>
weight_chart ggplot(aes(x = Age, y = Weight)) +
geom_point(col="blue", size=5) +
geom_line(linewidth = 2)
|>
weight_chart ggplot(aes(x = Age, y = Weight)) +
geom_line(linewidth = 2) +
geom_point(col="blue", size=5)
Chromosome position data
Load the data for the chromosome_position_data.txt
file.
<- read_delim("ggplot_data_files/chromosome_position_data.txt")
chr_pos chr_pos
Use pivot_longer()
to put the data into tidy format by combining the three data columns together.
The options to pivot_longer()
will be:
The columns to restructure:
cols = Mut1:WT
[Optional] The name of the new names column:
names_to = "sample"
[Optional] The name of the values column:
values_to = "value"
<- chr_pos |>
chr_pos pivot_longer(cols = Mut1:WT, names_to = "sample")
chr_pos
Draw a line graph geom_line()
to plot the position x = Position
against the value y = value
.
Split the Samples by colour colour = Sample
.
Use the linewidth
attribute in geom_line()
to make the lines slightly thicker than their default width.
|>
chr_pos ggplot(aes(x = Position, y = value, colour = sample)) +
geom_line(linewidth = 1)
If you have time…
Load in the genomes.csv
file and use the separate function to turn the Groups column into Domain, Kingdom and Class based on a semicolon delimiter.
<- read_delim("ggplot_data_files/genomes.csv")
genomes genomes
<- genomes |>
genomes separate(
col = Groups,
into = c("Domain", "Kingdom", "Class"),
sep = ";"
)
genomes
Plot a point graph of log10(Size)
vs Chromosomes
and colour it by Domain
.
|>
genomes ggplot(aes(x = log10(Size), y = Chromosomes, colour = Domain)) +
geom_point()
Exercise 2
Load the data from small_file.txt
using read_delim()
<- read_delim("ggplot_data_files/small_file.txt")
small head(small)
Plot out a barplot of the lengths of each sample from category A.
Start by filtering the data to keep only Sample A samples
small %>% filter(Category == "A")
Pass this filtered tibble to
ggplot
Your x aesthetic will be
Sample
and your y will beLength
Since the value in the data is the bar height you need to use
geom_col()
%>%
small filter(Category == "A") %>%
ggplot(aes(x = Sample, y = Length)) +
geom_col()
Plot out a barplot (using geom_bar()
) of the mean length for each category in small.file
.
You will need to set stat="summary", fun="mean"
in geom_bar()
so it plots the mean value.
%>%
small ggplot(aes(x = Category, y = Length)) +
geom_bar(stat = "summary", fun = "mean")
Add a call to geom_jitter()
to the last plot so you can also see the individual points
- Colour the points by
Category
and decrease the width of the jitter columns to get better separation. Make sure height is set to 0.
- If you don’t want to see the legend for the points then you can set
show.legend=FALSE
ingeom_jitter()
.
- If the legend is still present, it can be completely removed by adding another layer
+ theme(legend.position="none")
%>%
small ggplot(aes(x = Category, y = Length, colour = Category)) +
geom_bar(stat = "summary", fun = "mean") +
geom_jitter(height = 0, width = 0.2, show.legend = FALSE) +
theme(legend.position = "none")
Load the data from expression.txt
using read_delim()
.
<- read_delim("ggplot_data_files/expression.txt")
expression head(expression)
Plot out the distribution of Expression values in this data. You can try both geom_histogram()
and geom_density()
. Try changing the colour and fill parameters to make the plot look prettier. In geom_histogram()
try changing the binwidth
parameter to alter the resolution of the distribution.
%>%
expression ggplot(aes(Expression)) +
geom_histogram(fill = "seagreen", colour = "blue2")
%>%
expression ggplot(aes(Expression)) +
geom_histogram(fill = "seagreen", colour = "blue2", binwidth = 0.4)
<- expression %>%
exp_plot ggplot(aes(Expression)) +
geom_density(fill = "seagreen", colour = "blue2")
Load the data from cancer_stats.csv
using read_delim()
.
<- read_delim("ggplot_data_files/cancer_stats.csv")
cancer cancer
Plot a barplot (geom_col()
) of the number of Male deaths for all Sites.
Make sure you let the RStudio auto-complete help you to fill in the Male Deaths column name so you get the correct backtick quotes around it. You won’t be able to show all of the categories so just show the first 5.
%>%
cancer slice(1:5) %>%
ggplot(aes(x = Site, y = `Male Deaths`)) +
geom_col()
If you have time…
Create a new variable in child.variants loaded from Child_Variants.csv using mutate
and if_else
. The value should be GOOD
if QUAL == 200 otherwise it should be BAD
.
<- read_delim("ggplot_data_files/Child_Variants.csv")
child
<- child |>
child mutate(good_bad = if_else(QUAL == 200, "Good", "Bad"))
head(child)
Plot out a violin plot, using geom_violin()
of the MutantReads
for the new Good or Bad category. It looks better on a log scale.
%>%
child ggplot(aes(x = good_bad, y = MutantReads)) +
geom_violin(fill="seagreen", colour="blue2")
%>%
child ggplot(aes(x = good_bad, y = log2(MutantReads))) +
geom_violin(fill="seagreen", colour="blue2")
Exercise 3
Use theme_set
to set your ggplot theme to be theme_bw
with a base_size
of 12. Replot one of your earlier plots to see how its appearance changes.
theme_set(theme_bw(base_size = 12))
|>
expression ggplot(aes(Expression)) +
geom_density(fill = "seagreen", colour = "blue2")
In the cancer barplot you did in exercise 2 you had to exclude sites because you couldn’t show them on the x axis. Use the coord_flip
transformation to switch the x and y axes so you can remove the slice
function which restricted you to 5 sites, and show all of the sites again.
|>
cancer ggplot(aes(x = Site, y = `Male Deaths`)) +
geom_col() +
coord_flip()
We could reorder the results by the data to make the plot clearer and remove the female-only cancers.
|>
cancer drop_na(`Male Deaths`) |>
ggplot(aes(x = reorder(Site, `Male Deaths`), y = `Male Deaths`)) +
geom_col() +
xlab("Site") +
coord_flip()
Load the data from brain_bodyweight.tsv
<- read_delim("ggplot_data_files/brain_bodyweight.txt")
brain_body head(brain_body)
Plot a scatterplot of the brain against the body
Change the axis labels (xlab and ylab) to say Brainweight (g) and Bodyweight (kg) and add a suitable title (ggtitle).
Use a method to display the data as log transformed on the plot (Options are: use log scale axes, log transform when creating the aesthetic mapping or mutate the data before passing it to ggplot)
|>
brain_body mutate(Category = factor(Category, levels = c("Domesticated", "Wild", "Extinct"))) %>%
ggplot(aes(x = brain, y = body, colour = Category)) +
geom_point(size = 4)+
ggtitle("Brain vs Body weight")+
xlab("Brainweight (g)") +
ylab("Bodyweight (kg)") +
scale_y_log10() +
scale_x_log10() +
scale_colour_brewer(palette = "Set1")
If you have time…
Create a barplot of the brainweight of all species, coloured by their bodyweight. Use a custom colour scheme for the colouring of the bars. You will again need to use a log scale for the brain and bodyweight.
%>%
brain_body ggplot(aes(x=Species, y=brain, fill=log(body))) +
geom_col() +
coord_flip() +
scale_fill_gradientn(colours=c("purple", "blue2", "green2","yellow", "red2"))
Exercise 4: Summary Overlays
<- read_delim("ggplot_data_files/treatments.csv")
treatments treatments
Plot a stripchart of the four conditions using geom_jitter()
|>
treatments ggplot(aes(x = Sample, y = Measure)) +
geom_jitter()
We need to make sure we’ve set height = 0
and we might also want to reduce the jitter width.
|>
treatments ggplot(aes(x = Sample, y = Measure)) +
geom_jitter(height = 0, width = 0.2)
Overlay a boxplot of the same data along with the raw points.
|>
treatments ggplot(aes(x = Sample, y = Measure)) +
geom_jitter(height = 0, width = 0.2) +
geom_boxplot()
That could do with some improvements
Adjust the size and width of spread of the points in
geom_jitter()
to something sensibleAdjust the
linewidth
of the lines in the boxplotMake sure
geom_boxplot()
is drawn first so you can see everythingTry colouring the points by the condition to see if it’s any clearer.
set.seed(99)
|>
treatments ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
geom_boxplot(linewidth = 1, width = 0.6) +
geom_jitter(height = 0, width = 0.3, size = 2)
Outliers from a boxplot are shown as points. This means that for the TGX-221 sample, we’re showing the data point twice. We can set outliers = FALSE
in geom_boxplot()
to avoid this.
#set.seed(99)
|>
treatments ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
geom_boxplot(linewidth = 1, width = 0.6, outliers = FALSE) +
geom_jitter(height = 0, width = 0.3, size = 2)
If we want to change the boxplot colour back to a single colour we can set that as a fixed aesthetic.
set.seed(99)
|>
treatments ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
geom_boxplot(linewidth = 1, width = 0.6, colour = "grey2", outliers = FALSE) +
geom_jitter(height = 0, width = 0.3, size = 3) +
theme(legend.position = "none")
Plot the same data as a barplot with errorbars for the SEM.
Use a geom_bar
for the barplot with stat="summary"
and then use stat_summary with a geometry of errorbar
with the default mean_se
values.
|>
treatments ggplot(aes(x = Sample, y = Measure, colour = Sample)) +
geom_bar(stat="summary") +
stat_summary(geom="errorbar", width = 0.4, linewidth = 1.2) +
geom_jitter(height = 0, width = 0.3, size = 2) +
theme(legend.position = "none")
If you have time…
Precompute the values
<- treatments %>%
treatments_summary group_by(Sample) %>%
summarise(mean = mean(Measure),stdev = sd(Measure))
treatments_summary
%>%
treatments_summary ggplot(aes(x = Sample, y = mean, ymin = mean-stdev, ymax = mean+stdev)) +
geom_col(fill = "#9AC1BA",color = "#083F5E", linewidth = 2) +
geom_errorbar(size = 2, colour = "#083F5E", width = 0.3) +
theme_classic()
Replot the stripchart, overlaying a bar for the mean
|>
treatments ggplot(aes(x = Sample, y = Measure, colour=Sample)) +
geom_jitter(height=0, width=0.2, show.legend = FALSE) +
stat_summary(
geom = "errorbar",
fun = mean,
fun.max = mean,
fun.min = mean,
width = 0.4,
linewidth = 1,
col = "black"
)
Exercise 5
Load the data in up_down_expression.txt
<- read_delim("ggplot_data_files/up_down_expression.txt")
up_down up_down
Plot out a scatterplot of Condition1
vs Conditon2
coloured by State
|>
up_down ggplot(aes(x = Condition1, y = Condition2, colour = State)) +
geom_point()
Change the coloring using scale_colour_manual
so that up is red, unchanging is grey and down is blue
|>
up_down ggplot(aes(x = Condition1, y = Condition2, colour = State)) +
geom_point() +
scale_colour_manual(values = c("blue", "grey", "red"))
Add text labels to the genes Col1A2, TCL1B, SPTSSB, SULF2 Filter the full dataset using up_down |> filter( Gene %in% c("Col1A2", "etc" ))
and save the result.
<- c("COL1A2", "TCL1B", "SPTSSB", "SULF2")
genes_of_interest
<- up_down |>
up_down_interesting filter(Gene %in% genes_of_interest)
Pass the filtered dataset to the data option of geom_text. Make sure you added label = Gene
to your aesthetic mappings.
Colour the labels black and use hjust=1.2
to position them to be readable away from the actual points, or use geom_text_repel
from the ggrepel
package to adjust the positions automatically.
Use geom_abline
to put a null line across the diagonal (slope=1, intercept=0)
library(ggrepel)
|>
up_down ggplot(aes(x = Condition1, y = Condition2, colour = State, label = Gene)) +
geom_point(size = 1.5) +
scale_colour_manual(values = c("blue", "grey", "red")) +
theme(legend.position = "none") +
geom_abline(slope = 1, intercept = 0, colour="darkgrey", linewidth=1) +
geom_text_repel(data = up_down_interesting, col = "black", box.padding = 1)
<- read_csv("ggplot_data_files/DownloadFestival.csv")
festival
head(festival)
Draw a stripchart of cleanliness for males and females separately.
%>%
festival ggplot(aes(x = gender, y = cleanliness)) +
geom_jitter(height = 0, width = 0.3, alpha = 0.5)
Use facet_grid(cols = vars(day))
to split the plot based on the day of the festival to see the effect this has on the data
Make some additions to the plot
- Colour the male samples red and the female samples blue
- Add a line to show the mean by using a stat summary
%>%
festival ggplot(aes(x = gender, y = cleanliness, colour = gender)) +
geom_jitter(height = 0, width = 0.3, alpha = 0.5, stroke = NA, show.legend = FALSE) +
scale_colour_manual(values = c("blue2", "red2")) +
stat_summary(
geom = "errorbar",
fun = mean,
fun.max = mean,
fun.min = mean,
colour = "darkgrey",
linewidth = 2
+
) facet_grid(cols = vars(day))
If you have time…
Add a new column called attendance to the data to say how many days people attended the festival. To do this you will need to:
- Group by the person
- Use count to get a count of how many times each person occurred
- Use right_join to merge the counts back into the original data
- Rename the n column to attendance
<- festival %>%
festival_attendance group_by(person) %>%
count() %>%
ungroup() %>%
right_join(festival) %>%
rename(attendance = n)
Redraw the plot faceting by both attendance and day
%>%
festival_attendance ggplot(aes(x = gender, y = cleanliness, colour = gender)) +
geom_jitter(height = 0, width = 0.3, alpha = 0.5, stroke = NA, show.legend = FALSE) +
scale_colour_manual(values = c("blue2", "red2")) +
stat_summary(
geom = "errorbar",
fun = mean,
fun.max = mean,
fun.min = mean,
colour = "darkgrey",
linewidth = 2
+
) facet_grid(cols = vars(day), rows = vars(attendance))