This document provides worked answers for all of the exercises in the Advanced R with Tidyverse course.

Exercise 1

Reading data into tibbles

First we are going to load the main tidyverse library. This will cause the loading of the main data loading and manipulation packages from the tidyverse

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Now we can import our first file. The small_file.txt file is a tab delimited file, so we load it with read_delim().

small_file <- read_delim("Advanced_R_Data/small_file.txt")

Error: Could not guess the delimiter.

Use `vroom(delim =)` to specify one explicitly.

This fails initially because of the presence of header lines in the file which means that readr can’t work out the structure. We can see this if we look at the file in a text editor.

We can fix the file import by telling it to ignore comments, either setting skip=2 to skip the first two lines or by using the comment argument.

small_file <- read_delim(
  "Advanced_R_Data/small_file.txt", 
  comment = "#"
)

Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Sample, Category
dbl (1): Length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

small_file

The data in the Sample and Category columns has been stored as Character type. If we want to force the Category column to be a factor then we use the col_types argument. We only need to specify the column we want to change.

small_file <- read_delim(
  "Advanced_R_Data/small_file.txt",
  comment="#",
  col_types = "?f?"
)

small_file <- read_delim(
  "Advanced_R_Data/small_file.txt",
  comment="#",
  col_types = "cfn"  # we could specify each column if we wanted to
) 

small_file

Now we can see that Category is a factor, which doesn’t have any immediate effect but can be useful later when plotting.

Now we can load the larger Child_Variants.csv file.

child_variants <- read_delim("Advanced_R_Data/Child_Variants.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 25822 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): CHR, dbSNP, REF, ALT, GENE, ENST
dbl (5): POS, QUAL, MutantReads, COVERAGE, MutantReadPercent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We do get errors reported.

problems(child_variants)

If we look at the rows where the problems are reported, we can see that there are some NA values for Coverage. We haven’t yet seen how to select specific rows but we could scroll to these lines in the viewer.

Row numbers

If you have tried extracting the problem rows using the indices in the table you may have noticed that the row numbers are off by one. In the tibble it’s row numbers 9937 and 13379-13381 that contain the NA values. The header line gets counted in the row numbers shown by the problems function.

We’ll cover the slice function shortly - it’s been used in the code below to extract the relevant rows.

child_variants |>
  slice(c(9937, 13379:13381)) |>
  select(-(1:7))

If you have time…

Create a geom_point() plot from the child_variants data plotting the chromosome against the coverage.

child_variants |>
  ggplot(aes(x=CHR, y=COVERAGE)) +
  geom_point()

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

We get a warning about the four NA values, but we know about those and can ignore this error for now.

Exercise 2 - Filtering and Selecting with dplyr

We’re going to read in the genomes.csv file.

genomes <- read_delim("Advanced_R_Data/genomes.csv")

Rows: 47211 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Organism, Groups
dbl (5): Size, Chromosomes, Organelles, Plasmids, Assemblies

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(genomes)

Make a list of the 5 organisms with the largest genomes

(Arrange genomes by descending Size and then extract the first 5 rows with slice())

genomes %>%
  arrange(desc(Size)) %>%
  slice(1:5)

Now we’re going to do some basic filtering.

Find the organisms with more than 40 chromosomes.

We can sort by Organism after filtering, though this dataset is already organised alphabetically so it doesn’t actually change the order if we skip this step.

We can select a couple of columns to make the output easier to see.

genomes |>
  filter(Chromosomes > 40) |>
  arrange(Organism) |>
  select(Organism, Chromosomes)

Do any organisms containing a plasmid also have more than one chromosome?

genomes %>%
  filter(Plasmids > 0) %>%
  filter(Chromosomes > 1) %>%
  select(Organism, Chromosomes, Plasmids)

Make a version of the data containing only the columns from Chromosomes onwards

As always with select, there are different ways to do the selection. Here we’re using the last_col() select helper.

genomes %>%
  select(Chromosomes:last_col())

Move the `Size` column to the front of the tibble

genomes %>%
  select(Size, everything())

The relocate() function

We haven’t covered this, but the dplyr function specifically designed for changing column positions is relocate(). In this case, there are no advantages to using is over the select() function, but it provides more options for doing more complex relocations.

genomes |>
  relocate(Size, .before = 1)

Select the columns which start with “O”

genomes %>%
  select(starts_with("O"))

How many different groups are there?

There are a few different ways of answering this question. We’re deduplicating based on group and then piping the result to nrow().

genomes %>%
  distinct(Groups) %>%
  nrow()

[1] 310

If you have time…

Plot a scatterplot of the number of chromosomes vs the genome size but showing only one organism for each chromosome number.
When deduplicating, keep only the smallest genome for each chromosome number.
Exclude organisms with no listed chromosomes.

To make this clearer we can select only the columns we’re using. This isn’t necessary (the rest of the code will still work if this isn’t done), but it makes it a little easier to see what’s going on whilst developing the chain of operations.

In the final plot the linear scaling of chromosome size wasn’t all that informative so we can use a log scale which shows the overall trend more clearly.

genomes %>%
  select(Chromosomes, Size) %>%
  arrange(Size) %>%
  filter(Chromosomes > 0) %>%
  distinct(Chromosomes, .keep_all = TRUE) %>%
  ggplot(aes(x=Chromosomes, y=log(Size))) +
  geom_point()

Exercise 3 - More clever filtering

We’ll load in the cancer_stats.csv file.

cancer <- read_delim("Advanced_R_Data/cancer_stats.csv")

Rows: 36 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Class, Site
dbl (4): Male Cases, Female Cases, Male Deaths, Female Deaths

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(cancer)

For which Digestive System cancer types are there more female cases than male cases?

Since these two tests are combined with an AND logic we can either put them in a single filter statement or we could chain two filter statements together.

This is the first place where we have column names with spaces in them, and therefore where they need to be quoted. We can’t put them in single or double quotes otherwise they’ll be treated as text, so we use backticks. By far the easiest way to get the quoting right is to select from the drop down list in RStudio which will handle this for you.

cancer %>%
  filter(Class=="Digestive System" & `Female Cases` > `Male Cases`)

Which cancer types have NA values for either male or female?

cancer %>%
  filter(is.na(`Male Cases`))

cancer %>%
  filter(is.na(`Female Cases`))

That seems to make sense.

Which cancer type has the best survival rate for males?

In reality it might be better to add a survival column to the data and then filter on that, but since we haven’t seen how to do that yet we’ll settle for doing it directly in the filter.

cancer %>%
  select(Class, Site, `Male Cases`, `Male Deaths`) %>%
  arrange(`Male Deaths`/`Male Cases`) %>%
  slice(1)

Which Sites have `acute` in their name?

cancer %>%
  filter(str_detect(Site, "acute"))

Out of tongue, kidney, breast and pancreas, which is classified as a soft tissue cancer?

We can do this by initially making a vector of the things we want to test.

tissues_to_test <- c("tongue", "kidney", "breast", "pancreas")

Now we can find those tissues in the whole dataset.

cancer %>%
  filter(Site %in% tissues_to_test)

But there’s a problem - where’s tongue?
It’s not there because it’s spelled with a capital T in the original data.
How could we do the filtering to ignore case?
We could transform the data to lower case as we do the test. This preserves the original case in the result, but allows for case insensitive matches.

cancer %>%
  filter(tolower(Site) %in% tissues_to_test)

Now we can go ahead and find the answer to the original question

cancer %>%
  filter(tolower(Site) %in% tissues_to_test) %>%
  filter(Class == "soft tissue")

Find sites with 4 or fewer letters in their name

For this we’ll need to use a transformation of the text to its length when filtering. We can either do this with the core nchar function, or the stringr str_length function.

cancer %>%
  filter(nchar(Site) <= 4 )

cancer %>%
  filter(str_length(Site) <= 4 )

Extended explanation

The filter function requires an expression that evaluates to a logical vector i.e. a set of TRUE/FALSE values. Sometimes the logic is simple to comprehend e.g. filter(height > 12) and other times the expression may require more thought.

Often with R code it can be helpful to run a small section of the code to understand what is actually happening. We’ll use the code below from the exercise as an example.

cancer %>%
  filter(nchar(Site) <= 4)

With the code above we cannot just isolate and run the expression within the filter function nchar(Site) <= 4 as Site is not a standalone variable and we’ll get an error.

nchar(Site) <= 4

Error: object 'Site' not found

We could however create a dummy variable so that we could use this approach to rationalise about the code. The pull() function extracts a single column. It is very similar to using $ in base R.

my_site <- cancer |>   # tidyverse version
  pull(Site)

my_site <- cancer$Site # base R equivalent of the code above

my_site

 [1] "Tongue"              "Mouth"               "Pharynx"            
 [4] "Eosophagus"          "Stomach"             "Small intestine"    
 [7] "Colon"               "liver"               "gall"               
[10] "pancreas"            "larynx"              "lung"               
[13] "bone"                "heart"               "skin"               
[16] "breast"              "bladder"             "kidney"             
[19] "ureter"              "brain"               "thyroid"            
[22] "other endocrine"     "hodgkin"             "non-hodgkin"        
[25] "acute lymphocytic"   "chronic lymphocytic" "acute myeloid"      
[28] "chronic myeloid"     "cervix"              "uterine corpus"     
[31] "Ovary"               "Vulva"               "female genital"     
[34] "Prostate"            "Testis"              "male genital"

We now have a variable called my_site that is a character vector containing the data from the Site column in the cancer dataset, and we can use this in our nchar or str_length functions

nchar(my_site) <= 4

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
[13]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This produces a logical vector. When we use this inside a filter function, only the TRUE values (and all the data on their row) will be kept.

cancer %>%
  filter(nchar(Site) <= 4 )

Find all sites whose name ends with a “y”.

Again we can either use the core endsWith function, or the stringr str_sub function.

cancer %>%
  filter(endsWith(Site, "y"))

cancer %>%
  filter(str_sub(Site, start = -1) == "y")

Extended explanation

We can use the same approach to think about the next part of the exercise where we are trying to find all sites that end with a “y”.
This is the solution from the exercise.

cancer %>%
  filter(endsWith(Site, "y"))

If we extract the expression from the filter function and replace Site with the my_site vector again, we can see that the endsWith function produces a logical vector.

endsWith(my_site, "y")

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

The other option shown is to use the str_sub function.

cancer %>%
  filter(str_sub(Site, start = -1) == "y")

We can break the code down even further here. Firstly, we have the str_sub function which is extracting the last letter for each site (-1 is used to extract the final character).

str_sub(my_site, start = -1)

 [1] "e" "h" "x" "s" "h" "e" "n" "r" "l" "s" "x" "g" "e" "t" "n" "t" "r" "y" "r"
[20] "n" "d" "e" "n" "n" "c" "c" "d" "d" "x" "s" "y" "a" "l" "e" "s" "l"

We then use == to test whether the elements of this vector equal “y”.

str_sub(my_site, start = -1) == "y"

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

This is the logical vector that the filter function uses to cut down the rows.

cancer %>%
  filter(str_sub(Site, start = -1) == "y")

One last note - to quickly find out how many TRUE values are in a logical vector, the sum function can be used - the number of TRUE values are summed up.

sum(str_sub(my_site, start = -1) == "y")

[1] 2

Exercise 4 - Restructuring data into ‘tidy’ format

We are going to look through a few different files to see how they can be restructured. In each case we can ask the following questions:

How many types of measurement are there?
Does the same type of measurement appear in more than one column? If so, how can we put them all in one column?
Are there any combined categorical variables which should be split apart?
Are there any repeated annotation values which would be better split off into another table?
Does the restructured data contain any NA values which would be better off being removed?

tidy_data1

tidy1 <- read_delim("Advanced_R_Data/tidy_data1.csv")

Rows: 17 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): DMSO, TGX-221, PI103, Akt1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tidy1

This dataset has only one type of measurement, but it is split over 4 columns. The sample names are all OK already and there is no additional annotation so there’s nothing else to do once we’ve combined the whole dataset into a key/value pair.

For the cols argument we need to select all columns. We could do this in a number of ways;

DMSO:Akt1
1:last_col()

..but we’ll use the simpler everything() function to get the lot in one go.

tidy1 <- tidy1 %>%
  pivot_longer(
    cols = everything(), 
    names_to  = "Sample", 
    values_to = "Value"
  ) 

tidy1

Although this is better, if we look towards the end of the file we can see that there are still NA values in there.

tail(tidy1)

We can use a filter to remove these.

tidy1 %>%
  filter(!is.na(Value))

Another function for removing the NA values (that we haven’t actually seen before) is drop_na(). By default this will remove any rows with NA values in. Alternatively, a column can be specified and any rows of data with NAs in that column will be removed.

tidy1 %>%
  drop_na()

tidy_data2

tidy2 <- read_delim("Advanced_R_Data/tidy_data2.csv") 
tidy2

Here we have a bunch of data columns of the same type which we need to gather. We need to leave the rest of the columns alone. It’s probably quicker to specify the columns we want to gather here rather than the ones we don’t.

tidy2 <- tidy2 %>%
  pivot_longer(cols=A:E, names_to = "sample", values_to = "value") 

head(tidy2)

If this was real data then we’d want to separate out all of the annotation information into a separate tibble, so we could have the ID, Chr, Start, End, Strand in one table, and just keep the ID in the main data tibble. We could then recombine these where we needed the combined information and leave them separate otherwise.

tidy2_annotation <- tidy2 %>%
  select(ID:Strand) %>%
  distinct(ID, .keep_all = TRUE)

head(tidy2_annotation)

tidy2 <- tidy2 %>%
  select(-(Chr:Strand)) 

head(tidy2)

We could have done the splitting of the tibbles at the start. We get the same result but with slightly less code.

tidy2 <- read_delim("Advanced_R_Data/tidy_data2.csv")

Rows: 12 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Chr, Strand
dbl (7): Start, End, A, B, C, D, E

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(tidy2)

tidy2_annotation <- tidy2 |>
  select(ID:Strand)

head(tidy2_annotation)

tidied2 <- tidy2 |>
  select(-(Chr:Strand)) |>
  pivot_longer(cols=A:E, names_to = "sample", values_to = "value") 

head(tidied2)

tidy_data3

tidy3 <- read_delim("Advanced_R_Data/tidy_data3.csv")

Rows: 174 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Probe_ID, Symbol
dbl (6): WT_1, WT_2, WT_3, KO_1, KO_2, KO_3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tidy3

Here we have a lot of different measures but they are all of the same type so we need to gather them all together with pivot_longer. To select the columns to gather I’m going to use a select helper starts_with. I could have used a range WT_1:KO_3 but wanted to show a way you could do fuzzy matching to account for less well structured data.

Because our set of columns needs to be a single vector I combined the WT and KO columns with c()

tidy3 %>%
  pivot_longer(
    cols = c(starts_with("WT"), starts_with("KO")),
    names_to  = "sample",
    values_to = "value")

We still have a problem here in that the genotype (WT/KO) and the replicate number are combined, so we want to separate these apart using a split operation.

tidy3 <- tidy3 %>%
  pivot_longer(
    cols = c(starts_with("WT"), starts_with("KO")),
    names_to  = "sample",
    values_to = "value"
  ) %>%
  separate(
    sample, 
    into = c("genotype","replicate_number"),
    sep = "_"
  ) 

tidy3

For a final level of tidying we could remove the symbol column to a separate tibble since it is synonymous with the Probe_ID column and we only need to keep one of these in the main data to act as an identifier.

If you have time…

In your genomes data split the Groups column into 3 new columns based on the semi-colon delimiter. Call the new columns “Domain”, “Kingdom” and “Class”.

genomes <- read_delim("Advanced_R_Data/genomes.csv")

Rows: 47211 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Organism, Groups
dbl (5): Size, Chromosomes, Organelles, Plasmids, Assemblies

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

genomes <- genomes |>
  separate(
    col  = Groups, 
    into = c("Domain","Kingdom","Class"),
    sep  = ";"
  )

Remove any species with a single quote in their name, or which has a Kingdom or Class of ‘Other’

genomes |>
  filter(! str_detect(Organism, "'")) %>%
  filter(! (Kingdom == "Other" | Class == "Other"))

Exercise 5 - Adding or Creating New Data

Create new columns

In the cancer data generate new variables for cases and deaths which sum up the male and female values.
Overwrite the original cancer variable with the expanded version.

cancer <- cancer %>%
  mutate(
    Deaths = `Male Deaths` + `Female Deaths`, 
    Cases  = `Male Cases` + `Female Cases`
  ) 

cancer

Replace NAs

In child_variants we had some COVERAGE values which said >1000 and because these aren’t numeric they got turned into NA. we can now turn them back into a fixed value of 1000.

child_variants <- child_variants %>%
  mutate(COVERAGE = replace_na(COVERAGE,1000)) 

child_variants %>%
  arrange(desc(COVERAGE)) %>%
  head()

Create a new `Type` column

In child_variants create a new column called Type which has a value of “SNP” if both REF and ALT are only 1 letter and “INDEL” for all other cases.

child_variants <- child_variants %>%
  mutate(
    Type = if_else(
      str_length(REF) == 1 & str_length(ALT) == 1,
      "SNP", 
      "INDEL"
    )
  ) 

child_variants

If you have time…

Change the Type column in child_variants so that it can contain 3 values:

SNP if ALT and REF are the same length
INSERTION if ALT is longer than REF
DELETION if REF is longer than ALT

child_variants <- child_variants %>%
  mutate(
    Type = case_when(
      nchar(REF) == 1 & nchar(ALT) == 1 ~ "SNP",
      nchar(REF) > nchar(ALT)           ~ "DELETION",
      nchar(REF) < nchar(ALT)           ~ "INSERTION"
    )
  )

child_variants

Exercise 6 - Grouping and Summarising

Find the mean length per Category for the small_file data. One nasty gotcha with this is that if you want to use a column in more than one calculation (e.g. here we want to use Length for both the mean and sd calculations), then you can’t save to a column with the same name otherwise R gets confused. That’s why we call the output column MeanLength and not just Length.

small_file %>%
  group_by(Category) %>%
    summarise(
      MeanLength = mean(Length),
      SD = sd(Length)
    )

In child_variants find genes which have at least 3 novel SNPs in them and calculate their average COVERAGE.

child_variants %>% 
  filter(dbSNP == ".") %>%
  filter(Type == "SNP") %>%
  group_by(GENE) %>%
  summarise(COVERAGE = mean(COVERAGE), COUNT = n()) %>%
  filter(COUNT > 2) %>%
  arrange(desc(COVERAGE))

If you have time…

In tidy2 get the mean value for each combination of chromosome and category (A,B,C and D).

tidy2 <- read_delim("Advanced_R_Data/tidy_data2.csv")

Rows: 12 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Chr, Strand
dbl (7): Start, End, A, B, C, D, E

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tidy2 %>%
  select(Chr, A:E) %>%
  pivot_longer(cols=A:E, names_to = "Sample",values_to = "Value") %>%
  group_by(Sample, Chr) %>%
  summarise(Value=mean(Value))

`summarise()` has grouped output by 'Sample'. You can override using the
`.groups` argument.

Find which cancer type has the closest incidence rate between males and females

cancer %>%
  mutate(mfdiff=abs(`Male Cases` - `Female Cases`)) %>%
  select(Site, `Male Cases`, `Female Cases`, mfdiff) %>%
  arrange(mfdiff) %>%
  slice(1)

Find which cancer has the largest discrepancy in survival rates between males and females (excluding ones which only one sex can get)

cancer %>%
  filter(!(is.na(`Male Cases`) | is.na(`Female Cases`))) %>%
  mutate(
    MaleRate = `Male Deaths`/`Male Cases`, 
    FemaleRate = `Female Deaths`/`Female Cases`, 
    rateDiff = abs(MaleRate-FemaleRate)
  ) %>%
  select(Class, Site, MaleRate:rateDiff) %>%
  arrange(desc(rateDiff)) %>%
  slice(1)

For each class of cancer find out which site has the best overall survival rate.

For this we can calculate the survival rate for all sites and order the data by that. We can then group it and do a slice on the data. Since the data is grouped we’ll get a sliced value for each group, and not just one for the whole dataset.

cancer %>%
  mutate(survival = (Cases - Deaths)/Cases) %>%
  arrange(desc(survival)) %>%
  group_by(Class) %>%
  slice(1)

In child variants find out which gene on each chromosome has the highest number of variants

child_variants %>%
  group_by(CHR, GENE) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  group_by(CHR) %>%
  slice(1)

`summarise()` has grouped output by 'CHR'. You can override using the `.groups`
argument.

Exercise 7 - Extending and Joining

read_delim("Advanced_R_Data/dna_methylation.csv") -> dna.methylation

Rows: 24 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Gene, Sample, Group, State
dbl (1): Count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dna.methylation

We need to get the percentage methylation for each gene. The easiest way to do this is to put the counts into two separate columns, one for meth and one for unmeth, and then use mutate to combine these into a single percentage methylation value.

dna.methylation %>%
  pivot_wider(names_from = State, values_from = Count) %>%
  mutate(percent_meth=(100*Meth/(Meth+Unmeth))) -> dna.methylation

dna.methylation

Now to get the mean methylation per group per gene.

dna.methylation %>%
  group_by(Gene, Group) %>%
  summarise(mean_meth=mean(percent_meth)) -> per.condition.dna.meth

`summarise()` has grouped output by 'Gene'. You can override using the
`.groups` argument.

per.condition.dna.meth

Now we can join this to the annotation for these genes.

read_delim("Advanced_R_Data/methylation_annotation.txt") -> methylation.annotation

Rows: 2 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Gene_stable_ID, Gene_name
dbl (3): Chromosome, Gene_start, Gene_end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

methylation.annotation

per.condition.dna.meth %>%
  rename(Gene_name = Gene) %>%
  left_join(methylation.annotation)

Joining with `by = join_by(Gene_name)`

If you have time…

In child make a sorted list (from most frequent to least frequent) of all of the different observed SNP mutations (REF > ALT combinations). You will need to build a new column from the combination of REF and ALT (plus a delimiter). You can use the unite function to achieve this.

child_variants %>% 
  select(ALT,REF) %>%
  unite(col = "variant", ALT, REF, sep = ">") %>%
  group_by(variant) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

A slightly different way of achieving the same thing, using mutate and str_c instead of unite.

child_variants %>% 
  select(ALT,REF) %>%
  mutate(variant = str_c(ALT, ">", REF)) %>%
  group_by(variant) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

Load the small_file.txt file. Generate a new column called normalized_length which is the difference between the existing length column and the shortest length for the category that measurement came from. To do this you will need to:

Group by Category
Summarise to generate a min_length column containing the minimum length for each category
Use a right_join back to the original data to transfer these values onto the full dataset
Use mutate to subtract the min_length from the length column to create the normalized_length value
Once you have the values calculated you can plot out the category against the normalized length values.

small <- read_delim("Advanced_R_Data/small_file.txt", comment = "#")

Rows: 40 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): Sample, Category
dbl (1): Length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

small_norm <- small %>%
  group_by(Category) %>%
  summarise(
    min_length = min(Length)
  ) %>%
  ungroup() %>%
  right_join(small) %>%
  mutate(normalised_length = Length-min_length)

Joining with `by = join_by(Category)`

small_norm

small_norm %>%
  ggplot(aes(x = Category, y = normalised_length)) +
  geom_col()

Exercise 8 - Custom Functions

Write a function which calculates the lowest quality for the variants in child_variants in a specific gene. Have it take the gene name as its only argument and output a one line tibble with just the GENE and QUAL values in it. Use it to find the lowest quality for the AGRN1 gene.

lowest.gene.q <- function(tbl, gene) {
   tbl %>%
    filter(GENE == gene) %>%
    arrange(QUAL) %>%
    slice(1) %>%
    select(GENE, QUAL)
}

lowest.gene.q(child_variants, "AGRN")

child_variants |>
  lowest.gene.q("AGRN")

Modify the script so that you can now specify the column for which you want to find the lowest value. Have this default to QUAL but show that it can also work for MutantReads or COVERAGE.

lowest.q <- function(tbl, col = QUAL) {
  tbl %>%
    arrange({{ col }}) %>%
    slice(1)
}

lowest.q <- function(tbl, gene, col = QUAL) {
   tbl %>%
    arrange({{ col }}) %>%
    filter(GENE == gene) %>%
    arrange(QUAL) %>%
    slice(1) %>%
    select(GENE, {{ col }})
}



child_variants %>% 
  lowest.q(gene = "AGRN")

child_variants %>% 
  lowest.q(col = COVERAGE, gene = "AGRN")

Exercise 1

If you have time…

Exercise 2 - Filtering and Selecting with dplyr

Make a list of the 5 organisms with the largest genomes

Find the organisms with more than 40 chromosomes.

Do any organisms containing a plasmid also have more than one chromosome?

Make a version of the data containing only the columns from Chromosomes onwards

Move the Size column to the front of the tibble

Select the columns which start with “O”

How many different groups are there?

If you have time…

Exercise 3 - More clever filtering

For which Digestive System cancer types are there more female cases than male cases?

Which cancer types have NA values for either male or female?

Which cancer type has the best survival rate for males?

Which Sites have acute in their name?

Out of tongue, kidney, breast and pancreas, which is classified as a soft tissue cancer?

Find sites with 4 or fewer letters in their name

Find all sites whose name ends with a “y”.

Exercise 4 - Restructuring data into ‘tidy’ format

tidy_data1

tidy_data2

tidy_data3

If you have time…

Exercise 5 - Adding or Creating New Data

Create new columns

Replace NAs

Create a new Type column

If you have time…

Exercise 6 - Grouping and Summarising

If you have time…

Exercise 7 - Extending and Joining

If you have time…

Exercise 8 - Custom Functions

Move the `Size` column to the front of the tibble

Which Sites have `acute` in their name?

Create a new `Type` column