---
title: "Statistic Analysis of Protein Level Data"
output:
  html_document:
    df_print: paged
---

# Introduction

After the exploration stage we have a resonable picture of what's going on in our data.  We want to finish up by doing some statistics to back up what we saw during the exploration.

```{r}
library(tidyverse)
theme_set(theme_bw())
library(tidyheatmaps)
library(rstatix)
library(limma)
```

# Loding data

We're going to load in the data we previous got from the exploration, after the imputation of missing values using a random forest model.

```{r}
read_delim("random_forest_imputed_data.tsv", name_repair = "minimal") -> data
head(data)
```

# Statistics

We're going to look at a few different types of statistical test.

## Anova

We'll start with a broad question which is which genes can we separate with a comparison of all 4 of the conditions.  This says nothing about where they change, just whether they are consistent across all conditions.

We perform this test for every protein separately.

```{r}
data |>
  group_by(`Majority protein IDs`) |>
  anova_test(logIntensity ~ Condition) |>
  ungroup() -> anova_results

anova_results |>
  as_tibble() |>
  mutate(fdr = p.adjust(p, method="fdr")) |>
  select(`Majority protein IDs`,DFn,DFd,fdr) -> anova_results

head(anova_results)
```

```{r}
anova_results |>
  mutate(significant = if_else(fdr < 0.05, "Significant","Not Significant")) |>
  group_by(significant) |>
  count()
```

Because there is such a massive difference between the Total and TAP conditions then almost every protein is significant by this metric.

## T-test
For a simpler pairwise comparison we can just a simple t-test to compare two conditions.  In this case we'll compare the Total samples in Q1 and Q2

```{r}
data |>
  filter (str_detect(Experiment,"ProtTot")) |>
  group_by(`Majority protein IDs`) |>
  t_test(logIntensity ~ Condition) |>
  as_tibble() |>
  mutate(fdr = p.adjust(p, method="fdr")) |>
  mutate(significant = if_else(fdr < 0.05, "Significant","Not Significant")) |>
  select(`Majority protein IDs`, group1,group2, statistic, fdr,significant) |>
  arrange(fdr) -> q1_vs_q2_t_test

head(q1_vs_q2_t_test)
```

Let's see how many hits were significant from this

```{r}
q1_vs_q2_t_test |>
  group_by(significant) |>
  count()
```

A much smaller number of hits from this.

### Plot hits as heatmap

Let's look at these hits as a heatmap

```{r fig.width=5, fig.height=8}
q1_vs_q2_t_test |>
  filter(fdr<0.05) |>
  pull(`Majority protein IDs`) -> q1_vs_q2_significant_proteins


data |>
  filter (str_detect(Experiment,"ProtTot")) |>
  filter(`Majority protein IDs` %in% q1_vs_q2_significant_proteins) |>
  tidy_heatmap(
    rows=`Majority protein IDs`,
    columns=Experiment,
    values = logIntensity,
    annotation_col = Condition,
    scale="row",
    cluster_rows = TRUE,
    cluster_cols=TRUE,
    cutree_cols = 2,
    cutree_rows = 2
  )

```

Those all look very sensible.

### Volcano plot

Proteomics people do love a volcano plot, so let's draw one for this data.  The t-test result will give us the significance but for a volcano plot we also need a difference so we're going to need to calculate that separately.

```{r}
data |>
  filter (str_detect(Experiment,"ProtTot")) |>
  group_by(`Majority protein IDs`, Condition) |>
  summarise(logIntensity = mean(logIntensity)) |>
  ungroup() |>
  pivot_wider(
    names_from=Condition,
    values_from=logIntensity
  ) |>
  mutate(Q1_Q2_difference = Q2_ProtTot - Q1_ProtTot) |>
  select(`Majority protein IDs`,Q1_Q2_difference) |>
  right_join(q1_vs_q2_t_test) -> q1_vs_q2_t_test

head(q1_vs_q2_t_test)
```

Now we have both the fdr and the difference we can plot this out.

```{r}
q1_vs_q2_t_test |>
  ggplot(aes(x=Q1_Q2_difference, y=fdr, colour=significant)) +
  geom_point() +
  scale_y_log10() +
  scale_y_reverse() +
  scale_colour_manual(values=c("grey","red2")) +
  coord_cartesian(xlim=c(-1,1))
```

### Highlighting on scatterplot

```{r}
data |>
  filter (str_detect(Experiment,"ProtTot")) |>
  group_by(`Majority protein IDs`, Condition) |>
  summarise(logIntensity = mean(logIntensity)) |>
  ungroup() |>
  pivot_wider(
    names_from=Condition,
    values_from=logIntensity
  ) |>
  left_join(q1_vs_q2_t_test |> select(`Majority protein IDs`,significant)) |>
  arrange(significant) |>
  ggplot(aes(x=Q1_ProtTot, y=Q2_ProtTot, colour=significant)) +
  geom_point() +
  scale_colour_manual(values=c("grey","red2"))

```

## Moderated T-test (LIMMA)

The power of a t-test is limited by the number of replicates, and specifically how accurately you can estimate the variance between biological replicates.

LIMMA is an improved version of this type of test, where you can improve the estimate of variance by sharing information between other "similar" proteins.  It works where you can find a way to group proteins which should behave similarly together.  

By default LIMMA assumes that the magnitude of your measurement is a good predictor of variance (as it is for microarrays or RNA-Seq).  In proteomics it's a bit more complicated since if you're analysing at the protein level it depends how your summarisation from peptides to proteins was done.  In some cases expression level predicts variance, but in others it might be better predicted by the number of peptides which were combined in the summary (if you have that information).

### Check intensity vs variance

We can plot out the relationship between intensity and variance to see if this is going to be helpful here.

```{r}
data |>
  filter (Condition == "Q1_ProtTot") |>
  group_by(`Majority protein IDs`) |>
  summarise(
    variance = sd(logIntensity),
    magnitude = mean(logIntensity)
  ) |>
  ungroup() |>
  filter(magnitude>6.5 & magnitude < 9) |>
  ggplot(aes(x=magnitude,y=variance)) +
  geom_point(colour="grey", size=1) +
  geom_smooth()
```

In this instance we can see that there is no linkage between magnitude of measurement and variance, so LIMMA will not give us biologically meaningfully better results.

If we had number of summarised peptides as a predictor then we could test that as a different factor to consider when running the model.

We're going to go through the process of running LIMMA, but as shown above the results aren't going to be meaningfully better than a simple t-test.

### LIMMA stats

We need to make up a limma dataset from the expression values in the data, and a simple design to specify how we want to compare them.  You can build more complex designs but we're going to use the simplest version here.

```{r}
data |>
  filter(str_detect(Condition,"ProtTot")) |>
  select(`Majority protein IDs`,Experiment,logIntensity) |>
  pivot_wider(
    names_from=Experiment,
    values_from=logIntensity
  ) |>
  column_to_rownames("Majority protein IDs") -> limma_data

head(limma_data)
```

```{r}
tibble(
  Experiment = colnames(limma_data),
  Control = 1,
  Q1_vs_Q2 = if_else(startsWith(Experiment,"Q1"),1,0)
) |>
  column_to_rownames("Experiment") -> limma_design

limma_design
```

```{r}
lmFit(limma_data,limma_design) -> limma_fit
eBayes(limma_fit) -> limma_fit

topTable(limma_fit,coef="Q1_vs_Q2", adjust="BH", number = nrow(limma_data)) |>
  as_tibble(rownames="Majority Protein IDs") -> limma_results

head(limma_results)
```

### Count hits

```{r}
limma_results |>
    mutate(significant = if_else(adj.P.Val<0.05, "Significant","Not Significant"))  -> limma_results

limma_results |>
  group_by(significant) |>
  count()
```

A much higher proportion of hits are found as significant by LIMMA.  Given the lack of linkage between variance and magnitude of value it's arguable whether this represents a genuine improvement.

Let's look at the results.


### Plot volcano plot

```{r}
limma_results |>
  mutate(significant = if_else(adj.P.Val<0.05, "Significant","Not Significant")) |>
  ggplot(aes(x=logFC, y=adj.P.Val, colour=significant)) +
  geom_point() +
  scale_y_log10() +
  scale_y_reverse() +
  coord_cartesian(xlim=c(-1,1)) +
  scale_colour_manual(values=c("grey","red2"))
```

### Plot heatmap

```{r fig.width=5, fig.height=8}
limma_results |>
  filter(adj.P.Val < 0.05) |>
  pull(`Majority Protein IDs`) -> q1_vs_q2_limma_significant_proteins


data |>
  filter (str_detect(Experiment,"ProtTot")) |>
  filter(`Majority protein IDs` %in% q1_vs_q2_limma_significant_proteins) |>
  tidy_heatmap(
    rows=`Majority protein IDs`,
    columns=Experiment,
    values = logIntensity,
    annotation_col = Condition,
    scale="row",
    cluster_rows = TRUE,
    cluster_cols=TRUE,
    cutree_cols = 2,
    cutree_rows = 2,
    show_rownames = FALSE
  )
```

You can see the results for this are much noisier than the straight T-Test.  Some of the significant hits show high variability between the replicates.