This document provides worked answers for all of the exercises in the Introduction to R with Tidyverse course.

We start by doing some simple calculations.

`31 * 78`

`## [1] 2418`

`697 / 41`

`## [1] 17`

We next look at how to assign data to named variables and then use those variables in calculations.

We make assignments using arrows and they can point to the right or the left depending on the ordering of our data and variable name.

```
39 -> x
y <- 22
```

We can then use these in calculations instead of re-entering the data

`x - y`

`## [1] 17`

We can also save the results directly into a new variable.

```
x - y -> z
z
```

`## [1] 17`

We can also store text. The difference with text is that we need to indicate to R that this isnâ€™t something it should try to understand. We do this by putting it into quotes.

`"simon" -> my.name`

We can use the `nchar`

function to find out how many characters my name has in it.

`nchar(my.name)`

`## [1] 5`

We can also use the substr function to get the first letter of my name.

`substr(my.name,start=1,stop=1)`

`## [1] "s"`

Weâ€™re going to manually make some vectors to work with. For the first one there is no pattern to the numbers so weâ€™re going to make it completley manually with the `c()`

function.

`c(2,5,8,12,16) -> some.numbers`

For the second one weâ€™re making an integer series, so we can use the colon notation to enter this more quickly.

`5:9 -> number.range`

Now we can do some maths using the two vectors.

`some.numbers - number.range`

`## [1] -3 -1 1 4 7`

Because the two vectors are the same size then the equivalent positions are matched together. Thus the final answer is:

`(2-5), (5-6), (8-7), (12-8), (16-9)`

Weâ€™re going to use some functions which return vectors and then use the subsetting functionality on them.

First weâ€™re going to make a numerical sequence with the `seq`

function.

```
seq(
from=2,
by=3,
length.out = 100
) -> number.series
number.series
```

```
## [1] 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53
## [19] 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 104 107
## [37] 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152 155 158 161
## [55] 164 167 170 173 176 179 182 185 188 191 194 197 200 203 206 209 212 215
## [73] 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269
## [91] 272 275 278 281 284 287 290 293 296 299
```

We now want to extract the values at positions 5,10,15 and 20. This means that we need a vector with these values in it. Itâ€™s short enough that we can just enter these manually, but we can also see that itâ€™s a mathematical progression, so we could also use seq to create this.

`c(5,10,15,20)`

`## [1] 5 10 15 20`

`seq(from=5, by=5, to=20)`

`## [1] 5 10 15 20`

We can now use either of these methods to select the correspoding values at those positions in the `number.series`

data structure.

`number.series[c(5,10,15,20)]`

`## [1] 14 29 44 59`

`number.series[seq(from=5,by=5, to=20)]`

`## [1] 14 29 44 59`

Finally weâ€™re going to extract all values from positions 10 to 30. For this weâ€™ll use the colon operator as we did in the last exercise, but now itâ€™s inside a selector.

`number.series[10:30]`

`## [1] 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89`

Since R is a language built around data manipulation and statistics we can use some of the built in statistical functions.

We can use `rnorm`

to generate a sampled set of values from a normal distribution

`rnorm(20) -> normal.numbers`

Note that if you run this multiple times youâ€™ll get slightly different results.

We can now use the `t.test`

function to test whether this vector of numbers has a mean which is significantly differnt from zero.

`t.test(normal.numbers)`

```
##
## One Sample t-test
##
## data: normal.numbers
## t = 1.1001, df = 19, p-value = 0.285
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.1674367 0.5384588
## sample estimates:
## mean of x
## 0.1855111
```

Not surprisingly, it isnâ€™t significantly different.

If we do the same thing again but this time use a distribution with a mean of 1 we should see a difference in the statistical results we get.

`t.test(rnorm(20, mean=1))`

```
##
## One Sample t-test
##
## data: rnorm(20, mean = 1)
## t = 6.363, df = 19, p-value = 4.191e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.6841746 1.3549094
## sample estimates:
## mean of x
## 1.019542
```

This time the result is significant.

To median center the values in `number.series`

we simply calculate the median value and then substract this from each of the values in the vector.

`number.series - median(number.series)`

```
## [1] -148.5 -145.5 -142.5 -139.5 -136.5 -133.5 -130.5 -127.5 -124.5 -121.5
## [11] -118.5 -115.5 -112.5 -109.5 -106.5 -103.5 -100.5 -97.5 -94.5 -91.5
## [21] -88.5 -85.5 -82.5 -79.5 -76.5 -73.5 -70.5 -67.5 -64.5 -61.5
## [31] -58.5 -55.5 -52.5 -49.5 -46.5 -43.5 -40.5 -37.5 -34.5 -31.5
## [41] -28.5 -25.5 -22.5 -19.5 -16.5 -13.5 -10.5 -7.5 -4.5 -1.5
## [51] 1.5 4.5 7.5 10.5 13.5 16.5 19.5 22.5 25.5 28.5
## [61] 31.5 34.5 37.5 40.5 43.5 46.5 49.5 52.5 55.5 58.5
## [71] 61.5 64.5 67.5 70.5 73.5 76.5 79.5 82.5 85.5 88.5
## [81] 91.5 94.5 97.5 100.5 103.5 106.5 109.5 112.5 115.5 118.5
## [91] 121.5 124.5 127.5 130.5 133.5 136.5 139.5 142.5 145.5 148.5
```

If we are sampling from two distributions with only a 1% difference in their means, how many observations do we need to have before we can detect them as significantly changing.

Letâ€™s try a few different thresholds to see.

```
samples <- 100
rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2
t.test(data1,data2)
```

```
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 0.281, df = 197.65, p-value = 0.779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5131564 0.6836997
## sample estimates:
## mean of x mean of y
## 10.15606 10.07079
```

```
samples <- 500
rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2
t.test(data1,data2)
```

```
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = -0.72221, df = 993.57, p-value = 0.4703
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3476367 0.1605922
## sample estimates:
## mean of x mean of y
## 10.04706 10.14058
```

```
samples <- 1000
rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2
t.test(data1,data2)
```

```
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = -2.5475, df = 1997.2, p-value = 0.01092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.39747003 -0.05169393
## sample estimates:
## mean of x mean of y
## 9.971158 10.195740
```

```
samples <- 5000
rnorm(samples,mean=10,sd=2) -> data1
rnorm(samples,mean=10.1,sd=2) -> data2
t.test(data1,data2)
```

```
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = -4.3498, df = 9997.9, p-value = 1.376e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.25087714 -0.09500668
## sample estimates:
## mean of x mean of y
## 9.962487 10.135429
```

Itâ€™s only really when we get up close to 5000 samples that we can reliably detect such a small difference. The answers will be different every time since `rnorm`

involves a random component.

Weâ€™re going to read some data from a file straight into R. To do this weâ€™re going to use the tidyverse `read_`

functions. We therefore need to load tidyverse into our script.

`library(tidyverse)`

`## -- Attaching packages --------------------------------------------------------------------- tidyverse 1.3.0 --`

```
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
```

```
## -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
```

Weâ€™ll start off by reading in a small file.

`read_tsv("small_file.txt") -> small`

```
## Parsed with column specification:
## cols(
## Sample = col_character(),
## Length = col_double(),
## Category = col_character()
## )
```

`small`

Note that the only relevant name from now on is `small`

which is the name we saved the data under. The original file name is irrelevant after the data is loaded.

We can see that Sample and Category have the â€˜characterâ€™ type because they are text. Length has the â€˜doubleâ€™ type because it is a number.

We want to find the median of the log2 transformed lengths.

To start with we need to extract the length column using the `$`

notation.

`small$Length`

```
## [1] 45 82 81 56 96 85 65 96 60 62 80 63 50 64 43 98 78 53 100
## [20] 79 84 68 99 65 55 98 56 83 81 69 50 72 54 56 87 84 80 68
## [39] 95 93
```

Now we can log2 transform this.

`log2(small$Length)`

```
## [1] 5.491853 6.357552 6.339850 5.807355 6.584963 6.409391 6.022368 6.584963
## [9] 5.906891 5.954196 6.321928 5.977280 5.643856 6.000000 5.426265 6.614710
## [17] 6.285402 5.727920 6.643856 6.303781 6.392317 6.087463 6.629357 6.022368
## [25] 5.781360 6.614710 5.807355 6.375039 6.339850 6.108524 5.643856 6.169925
## [33] 5.754888 5.807355 6.442943 6.392317 6.321928 6.087463 6.569856 6.539159
```

Finally we can find the median of this.

`median(log2(small$Length))`

`## [1] 6.227664`

The second file weâ€™re going to read is a CSV file of variant data. We therefore need to use `read_csv`

to read it in.

`read_csv("Child_Variants.csv") -> child`

```
## Parsed with column specification:
## cols(
## CHR = col_character(),
## POS = col_double(),
## dbSNP = col_character(),
## REF = col_character(),
## ALT = col_character(),
## QUAL = col_double(),
## GENE = col_character(),
## ENST = col_character(),
## MutantReads = col_double(),
## COVERAGE = col_double(),
## MutantReadPercent = col_double()
## )
```

`child`