---
title: "Exploration of Protein Level Data"
output:
  html_document:
    df_print: paged
---

# Introduction

After looking at the PSM data we're now going to move to the protein level.  We will again look at some QC, but we'll also clean up the data, fix the normalisation and impute some missing values.  We'll then do some basic visualisation of the different experimental conditions.

```{r}
library(tidyverse)
theme_set(theme_bw())
library(tidyheatmaps)
library(missForest)
```

```{r}
read_delim("proteinGroups.txt", name_repair = "minimal") -> data
```

# Formatting

We're going to do a little bit of manipulation of the data.  At the moment we have the `Experiment` field containing the sample name provided when the search was run.  In other instances we might just get the raw file name reported.

We want to add a column with the conditions in it.  In this case it's just part of the Experiment name, but in other data you might need to import and merge in a metadata table if this isn't encoded in the sample name itself.

## Restrucuring

This data comes in a wide format, so we're going to pick out only the columns we want and are going to restructure into long format.  We'll also need to separate out the Experiments and conditions

The fields we are going to keep are:

1. The Protein ID - Majority protein IDs is simpler than the full list and is unique
2. Gene names - not every protein will have a name, and the names may not be unique
3. Q-Value - the likelihood of the match being incorrect
4. Reverse - does this protein come from the decoy set?
5. Potential contaminant - is this a known contaminant?
6. The LFQ intensities - there are multiple intensity values reported - you need to be sure which is the correct one to use for downstream analysis.


```{r}
data |>
  select(
    `Majority protein IDs`, 
    `Gene names`,
    `Q-value`,
    `Reverse`,
    `Potential contaminant`,
    starts_with("LFQ Intensity ")     
  ) -> data
```



## Add conditions
Now we can restructure and annotate with sample names and conditions.  In this case we can add condition because the Experiment names contain the condition, but if your sample names are not as structured as this you'd need to load in a separate data file to detail the mapping of sample (Experiment) names to other experimental metadata and merge that into the results.

```{r}
data |>
  pivot_longer(
    cols=starts_with("LFQ intensity"),
    names_to="Experiment",
    values_to="Intensity"
  ) |>
  mutate (
    Experiment = str_replace(Experiment, "LFQ intensity ","")
  ) |>
  mutate(Condition = str_replace(Experiment,"_Rep.*","")) |>
  select(Experiment, Condition, everything()) -> data
```

## Log transform

All intensities only really make sense on a log scale so we'll do that transformation now.

```{r}
data |>
  mutate(logIntensity=log10(Intensity)) -> data
```



# Quality Control

Now we're going to look at a set of metrics which will give us an idea of how well the search performed and how well matched the different runs were with each other.

## Real vs reversed hits

Let's see how many reversed PSMs came through

```{r}
data |>
  group_by(Experiment,Condition,Reverse) |>
  count() |>
  mutate(Reverse = !is.na(Reverse)) |>
  ggplot(aes(x=Experiment, y=n, fill=Reverse)) +
  geom_col(position="dodge") +
  coord_flip() +
  facet_wrap(vars(Condition), scale="free_y")
```

There are only a very tiny number of reverse hits which made it through.  We'll remove these from further analysis.

```{r}
data |>
  filter(is.na(Reverse)) -> data
```

## Contaminants

```{r}
data |>
  group_by(Experiment,Condition, `Potential contaminant`) |>
  count() |>
  ungroup() |>
  mutate(`Potential contaminant` = if_else(is.na(`Potential contaminant`),"Yes","No")) |>
  ggplot(aes(x=Experiment, y=n, fill=`Potential contaminant`)) +
  geom_col(position="dodge") +
  coord_flip() +
  facet_wrap(vars(Condition), scale="free_y")
```

Very few contaminants.  We'll remove them.

```{r}
data |>
  filter(is.na(`Potential contaminant`)) -> data
```



## Q-Value

The Q-Value is the ratio of reverse to forward hit strengths.  The lower the better.  Here we're transforming these onto a negative log scale, so higher values are better, and are distributed more nicely to plot.

```{r fig.width=8, fig.height=5}
data |>
  ggplot(aes(x=Experiment, y=-10*log10(`Q-value`), fill=Condition)) +
  geom_violin() +
  coord_flip()
```

We can see a small number of poor hits, which we can remove.

```{r}
data |>
  filter(-10*log10(`Q-value`) > 10) -> data
```


## Missing values

In this particular file we don't have any missing values - there is a numerical value for every protein in every sample
```{r}
data |>
  group_by(`Majority protein IDs`) |> 
  count() |>
  ungroup() |>
  distinct(n)
```

However, this doesn't mean that the data was actually measured properly.  This file just switched out any unobserved values for a numerical zero value, which isn't a very helpful way to do imputation.  Let's look at the number of missing values per sample.

### Missing values per sample
```{r}
data |>
  group_by(Experiment,Condition) |>
  summarise(
    missing_values = sum(Intensity == 0)
  ) |>
  ungroup() |>
  ggplot(aes(x=Experiment, y=missing_values, fill=Condition)) +
  geom_col() +
  coord_flip()
```

So consistently higher numbers of missing values in the TAP samples, which would make sense since those are the enriched samples.

### Missing values per protein

Let's look at this the other way and see how many proteins have different numbers of missing values

```{r}
data |>
  group_by(`Majority protein IDs`) |>
  summarise(
    missing_values = sum(Intensity == 0)
  ) |>
  ungroup() |>
  group_by(missing_values) |>
  count(name="Number_of_Proteins") |>
  ungroup() |>
  ggplot(aes(x=factor(missing_values), y=Number_of_Proteins)) +
  geom_col()
```

So there are quite a lot of proteins with no missing values.  A big spike at 6 and 9 which probably links to different conditions, and then another large peak at 12 which is proteins which are listed but have no real measurements for any sample.

Let's just check which samples have the missing values for the 6 and 9 groups.

```{r fig.width=3, fig.height=6}
data |>
  group_by(`Majority protein IDs`) |>
  filter(sum(Intensity==0) == 6) |>
  ungroup() |>
  mutate(present=if_else(Intensity==0,0,1)) |>
  tidy_heatmap(
    rows=`Majority protein IDs`,
    columns=Experiment,
    values=present,
    show_rownames = FALSE,
    cluster_rows = TRUE,
    cluster_cols = FALSE,
    treeheight_row = 0,
    colors = c("white","black"),
    main = "Presence (black) for 6 missing samples"
  )
  
```

For the 6 samples it's mostly proteins which are present in total but absent in TAP - but there are a few which are the other way around, and a few just in Q2.

```{r fig.width=3, fig.height=6}
data |>
  group_by(`Majority protein IDs`) |>
  filter(sum(Intensity==0) == 9) |>
  ungroup() |>
  mutate(present=if_else(Intensity==0,0,1)) |>
  tidy_heatmap(
    rows=`Majority protein IDs`,
    columns=Experiment,
    values=present,
    show_rownames = FALSE,
    cluster_rows = TRUE,
    cluster_cols = FALSE,
    treeheight_row = 0,
    colors = c("white","black"),
    main = "Presence (black) for 9 missing samples"
  )
  
```

Again, proteins mostly only found in one or other of the total samples.

We can get rid of any proteins which aren't measured in at least 3 samples.

```{r}
data |>
  group_by(`Majority protein IDs`) |>
  filter(sum(Intensity!=0)>=3) |>
  ungroup() -> data
```



## Distribution of Signal Intensity

We can look at the LFQ signal intensity across samples.

The dynamic range of the raw intensity values is such that it only really makes sense to look at these values on a log scale.

```{r}
data |>
  ggplot(aes(x=Experiment, y=logIntensity, fill=Condition)) +
  geom_violin() +
  coord_flip()
```

Here I've not included the zero measures, so we're just looking at the intensity of what we actually measured.  We can see that the samples are largely comparable, but that the TAP samples have an extra set of high intensity proteins.  This could be the set of actually enriched proteins.  We won't change this yet.


## Pairwise comparisons
Intensity distributions only tell part of a story.  It's really helpful to look at the pairwise scatterplots of all samples against each other to get a better idea of how the data actually works.

We'll use a small function to plot one of the comparisons and then use that to plot them all.

```{r fig.width=20, fig.height=20}

pairwise_scatterplot <- function(data1,data2,data) {

  data |>
    filter(Experiment %in% c(data1,data2)) |>
    select(Experiment,`Majority protein IDs`,logIntensity) |>
    pivot_wider(
      names_from=Experiment,
      values_from=logIntensity
    ) |>
    ggplot(aes(x=.data[[data1]],y=.data[[data2]])) +
    geom_point(size=0.5) +
    geom_abline(slope=1, intercept=0, colour="blue", linewidth=1) +
    xlab(NULL)+ylab(NULL) -> plot
  
  return(plot)
}

tibble(data1 = unique(data$Experiment), data2=unique(data$Experiment)) |>
  complete(data1,data2) |>
  arrange(data1,data2) -> all_sample_combinations

map2(all_sample_combinations$data1,all_sample_combinations$data2, pairwise_scatterplot, data) -> all_scatterplots

gridExtra::grid.arrange(grobs=all_scatterplots, nrow=length(unique(data$Experiment)))

```


We can see that despite the shift in intensity distributions that the pairwise scatterplots aren't too badly distributed.  For some of the samples we can see a bit of an offset but it's not obvious that changing the normalisation would be beneficial in this instance.


# Imputation
Because we have such a high number of missing values, and because these tie to the conditions we're looking at it would be good to fill these in with values which will behave in a sensible way.  At the moment they're just down as numerical zeroes, which isn't great.  We could use a simpler method to do this, or we could use a more robust imputation tool.


## Simple Imputation
The simple version of the imputation is to just substitue the missing values with the lowest observed value from the same dataset.  That gives a more relalistic limit of detection value rather than zero which is way below the lowest real value.

```{r}
data |>
  group_by(Experiment) |>
  mutate(Intensity2 = replace(Intensity, Intensity==0, min(Intensity[Intensity != 0]))) |>
  mutate(logIntensity = log10(Intensity2)) -> data_simpleimpute
```

Let's see what this does to the scatterplots

```{r fig.width=20, fig.height=20}
map2(all_sample_combinations$data1,all_sample_combinations$data2, pairwise_scatterplot, data_simpleimpute) -> all_scatterplots_simpleimpute

gridExtra::grid.arrange(grobs=all_scatterplots_simpleimpute, nrow=length(unique(data$Experiment)))

```

You can now see the extra line of points on the axis, representing the missing values, however they all end up with identical values in each sample which is not going to be ideal for any downstream statistics.

## Random Forest Imputation

A more complex imputation method is to use a random forest model to impute the missing values.

```{r}
data |>
  select(`Majority protein IDs`,Experiment,Intensity) |>
  mutate(
    Intensity=replace(Intensity, Intensity==0, NA)
  ) |>
  pivot_wider(
    names_from=Experiment,
    values_from=Intensity
  ) |>
  column_to_rownames(var="Majority protein IDs") |>
  missForest()  -> data_forestimpute

data_forestimpute$ximp |>
  as_tibble(rownames="Majority protein IDs") -> data_forestimpute

head(data_forestimpute)
```

We can now put this back into the original format

```{r}
data_forestimpute |>
  pivot_longer(
    cols=-`Majority protein IDs`,
    names_to="Experiment",
    values_to="Intensity"
  ) |>
  left_join(data |> distinct(Experiment,Condition)) |>
  mutate(logIntensity =log10(Intensity)) -> data_forestimpute
```


Let's see what this does to the scatterplots

```{r fig.width=20, fig.height=20}
map2(all_sample_combinations$data1,all_sample_combinations$data2, pairwise_scatterplot, data_forestimpute) -> all_scatterplots_forestimpute

gridExtra::grid.arrange(grobs=all_scatterplots_forestimpute, nrow=length(unique(data$Experiment)))

```

We don't see the same outgroups as before but the concern would be whether the imputed values make any sense.

Let's look at a comparison of a sample looking at the two different imputation methods to see how different they look.  We'll use an enriched sample as those will be the most controversial

```{r fig.width=6, fig.height=8}
data_forestimpute |>
  select(-Intensity) |>
  rename(forestIntensity = logIntensity) |>
  inner_join(
    data_simpleimpute |>
      select(`Majority protein IDs`,Experiment, logIntensity) |>
      rename(simpleIntensity=logIntensity)
  ) |>
  mutate(imputed = if_else(forestIntensity==simpleIntensity,"Measured","Imputed")) |>
  ggplot(aes(x=imputed, y=forestIntensity, fill=Condition)) +
  geom_violin(show.legend = FALSE) +
  facet_wrap(vars(Experiment), ncol = 3)

```

This makes some sense.  The imputed TAP values are lower than those for the total. They're all higher than the simple impute values though, which would be the lowest observed value in the measured.

# Clustering samples

## PCA

An easy way to make comparisons between samples is to cluster them together and see if a structure emerges.

Here we'll use a PCA plot to compare the samples.  We're only going to use proteins measured in all samples as PCA won't cope well with missing values.

```{r fig.width=7, fig.height=6}
data_forestimpute |>
  group_by(`Majority protein IDs`) |>
  filter(all(is.finite(logIntensity))) |>
  ungroup() |>
  select(`Majority protein IDs`,Experiment,logIntensity) |>
  pivot_wider(
    names_from=Experiment,
    values_from=logIntensity
  ) |>
  column_to_rownames("Majority protein IDs") |>
  t() |>
  prcomp() -> pca_result

pca_result$x |>
  as_tibble(rownames="Experiment") -> pca_result

pca_result |>
  left_join(data |> distinct(Experiment,Condition)) |>
  ggplot(aes(x=PC1, y=PC2, colour=Condition)) +
  geom_point(size=4)
```

We can see that PC1 separates the TAP from the ProtTot and PC2 separates Q1 from Q2.  The replicates in each condition cluster very nicely.


# Condition scatterplots

We can make average signal from each condition and then plot a scatterplot of this.  Again, for now, we're only using proteins present in all samples.

```{r}
data_forestimpute |>
  group_by(`Majority protein IDs`,Condition) |>
  summarise(
    logIntensity = mean(logIntensity)
  ) |>
  pivot_wider(
    names_from=Condition,
    values_from=logIntensity
  ) -> data_per_condition

head(data_per_condition)
```

Now we can plot out some of the comparisons

```{r fig.width=5, fig.height=5}
data_per_condition |>
  ggplot(aes(x=Q1_ProtTot, y=Q1_TAP)) +
  geom_point() +
  geom_abline(slope=1, intecept=0, colour="blue", linewidth=1)
```

Now we can plot out some of the comparisons

```{r fig.width=5, fig.height=5}
data_per_condition |>
  ggplot(aes(x=Q2_ProtTot, y=Q2_TAP)) +
  geom_point() +
  geom_abline(slope=1, intecept=0, colour="blue", linewidth=1)
```

Now we can plot out some of the comparisons

```{r fig.width=5, fig.height=5}
data_per_condition |>
  ggplot(aes(x=Q1_ProtTot, y=Q2_ProtTot)) +
  geom_point() +
  geom_abline(slope=1, intecept=0, colour="blue", linewidth=1)
```

```{r fig.width=5, fig.height=5}
data_per_condition |>
  ggplot(aes(x=Q1_TAP, y=Q2_TAP)) +
  geom_point() +
  geom_abline(slope=1, intecept=0, colour="blue", linewidth=1)
```

This gives us a pretty nice view of the data.  We can see that the Total comparison between Q1 and Q2 is really nice, as is the TAP comparison.

The comparisons between total and TAP is much messier, but looks very similar between Q1 and Q2.