---
title: "Quality Control of PSM Level Data"
output:
  html_document:
    df_print: paged
---

# Introduction

Here we are going to look at the properties of PSM level data - the lowest level of search result from a database search with the mass spectra.

The exact structure of the files you're using will vary - here we're showing examples based on output from MaxQuant, but other search programs will have similar fields in them.

From the large collection of fields in the output the ones we will be examining here will be:

1. The measured intensity value (3D peak area in this instance)
2. The PSM match metrics (PEP value and Match Score)
3. The number of missed cleavages in each match
4. A way to identify contaminant sequences
5. The retention time of peaks for each PSM
6. The mass accuracy measurement
7. Whether the peaks were directly quantitated or based on a match between runs
8. For matched peaks the difference to the expected retention time

Not all reports will have all of these values, and the way they are shown will also vary.  You'll have to look through the contents and documentation for the search program you're using to see how these values are encoded in your data.


```{r}
library(tidyverse)
theme_set(theme_bw())
```

```{r}
read_delim("evidence.txt", name_repair = "minimal") -> data

colnames(data)
```

# Formatting

We're going to do a little bit of manipulation of the data.  At the moment we have the `Experiment` field containing the sample name provided when the search was run.  In other instances we might just get the raw file name reported.

## Add conditions

We want to add a column with the conditions in it.  In this case it's just part of the Experiment name, but in other data you might need to import and merge in a metadata table if this isn't encoded in the sample name itself.


```{r}
data |>
  distinct(Experiment)
```

```{r}
data |>
  mutate(condition = str_replace(Experiment,"_Rep.*","")) -> data
```


# Quality Control

Now we're going to look at a set of metrics which will give us an idea of how well the search performed and how well matched the different runs were with each other.

## Real vs reversed hits

The PSMs here are generated from both a real and decoy database, where the sequences are reversed.  There should be no real hits from the reversed sequences, but some will always come through.

```{r}
data |>
  group_by(Experiment,condition,Reverse) |>
  count() |>
  mutate(Reverse = !is.na(Reverse)) |>
  ggplot(aes(x=Experiment, y=n, fill=Reverse)) +
  geom_col(position="dodge") +
  coord_flip() +
  facet_wrap(vars(condition), scale="free_y")
```

There are only a very tiny number of reverse hits which made it through.  We'll remove these from further analysis.

```{r}
data |>
  filter(is.na(Reverse)) -> data
```



## PEP Distribution

We'll start by looking at the quality of the PSM matches.  These are reported by two values in MaxQuant - the PEP value which is the probability that the reported match is incorrect - so lower is better, and the Match Score which is the ratio of the score for the best real match divided by the score for the best decoy match.

In MaxQuant we only see PSMs to real sequences reported.  In other search programs you may also see matches to decoy sequences reported and there will be a field to indicate this.  Looking at the distribution of real and decoy matches can be interesting, but you need to remove the decoys before any analysis.

```{r fig.width=8, fig.height=4}
data |>
  filter(-10*log10(PEP) < 500) |>
  ggplot(aes(x=-10*log10(PEP), group=Experiment, colour=condition)) +
  geom_density() +
  geom_vline(xintercept = 20)
```

We've transformed the p-values to PHRED scores, where larger is better, and I've put a line at PHRED=20 which is 99% confidence.  We can see that most of our matches fall above this but that there are a small number of matches with scores up to p=0.1, which is the basline cutoff.

## Score Distribution

```{r fig.width=8, fig.height=4}
data |>
  filter(Score < 400) |>
  ggplot(aes(x=Score, group=Experiment, colour=condition)) +
  geom_density() +
  scale_x_continuous(breaks=seq(from=0, by=20, to=400))
```

These are the distributions for scores in the andromeda program (in the case of MaxQuant). Other search tools will use a different type of score.


## Incomplete digestion

We are looking at typsin digests, and we would hope that the proteins were mostly completely digested to individual tryptic fragments.  The search will also look for matches to fragments where one or more cleavage sites were missed.

We can look at the number of missed cleavages in the PSMs in different samples.

```{r fig.width=8, fig.height=6}
data |>
  ggplot(aes(x=`Missed cleavages`, fill=condition)) +
  geom_bar() +
  coord_flip() +
  facet_wrap(vars(`Experiment`), ncol=3)
```

Whilst we can see that matches to peptides with no missed cleavages are the most common we can also see a fairly high number of mismatches occurring.  These do at least seem to be happening at a similar level in all of the samples so if there is an issue then it is affecting all samples equally.

## Number of PSMs and Contamination level

We can see how many PSMs there are in each sample.  We'll split these based on the presence of contaminants. How you identify contaminants will differ between outputs.  Sometimes (as in this case) there will be a specific column which labels PSMs as contaminants, but often you'll find something in the protein name (eg name starts with "CRAP_").

```{r fig.width=7, fig.height=6}
data |>
  group_by(`Potential contaminant`) |>
  slice(1:4) |>
  select(Sequence,`Potential contaminant`,everything())

data |>
  mutate(`Potential contaminant` = if_else(is.na(`Potential contaminant`),"Normal","Contaminant")) |>
  group_by(Experiment,condition,`Potential contaminant`) |>
  count() |>
  ungroup() |>
  ggplot(aes(x=`Potential contaminant`, y=n, fill=`Potential contaminant`)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(Experiment))
```

We can see that the TAP samples consistently have fewer PSMs than the ProtTot samples.  In a standard experimental design this would be a cause for concern, but in this instance the TAP samples are an enriched subset, so we'd expect a loss of both total protein level and diversity within the population.

We've also looked at the level of contamination in the samples and we can see that this is universally low so the signal to noise ratio for real data vs contaminants in this data is high.

## Distribution of Signal Intensity

Signal intensity for this type of label free quantitation is the area under the 3D MS1 peak.  Higher intensity will result from a more abundant peptide, but also from higher amounts of protein in the sample overall so the signal here will be a combination of these two factors.

The dynamic range of the raw intensity values is such that it only really makes sense to look at these values on a log scale.

```{r}
data |>
  ggplot(aes(x=Experiment, y=log10(Intensity), fill=condition)) +
  geom_violin() +
  coord_flip()
```

We can see some variation in the ranges of values.  Again we see that the TAP samples show lower total signal than the ProtTot samples, which would be expected.

## Signal intensity over time

Our data is collected from an elution from a chromotography column.  We would like to see that the column is effectively separating the peptides over time and feeding a steady stream of peptides into the mass spec.  We can therefore look at the rate at which PSMs are identified relative to the elution time from the column.

Initially we can look at the intensity values over time.  These shouldn't vary systematically with time.

```{r fig.width=12, fig.height=12}
data |>
  ggplot(aes(x=`Retention time`, y=log2(Intensity), colour=condition)) +
  geom_point(show.legend = FALSE, size=0.5) +
  geom_smooth(colour="black") +
  facet_wrap(vars(Experiment))
```

We can generally see an even spread of intensities over time. Some of the sames show a bit of a spike around 80 mins in which isn't ideal, but isn't bad enough to not use the run.

## PSMs over time

The previous plot looked at the intensity values from the PSM matches over time.  We can also look at the raw number of PSMs being found each minute.  We're going to bin the data into 1 minute chunks and count the number of PSMs in each chunk in each sample.

```{r}
data |>
  mutate(`Retention time` = as.integer(`Retention time`)) |>
  group_by(`Retention time`, Experiment, condition) |>
  count(name="Num_PSMS") |>
  ggplot(aes(x=`Retention time`, y=Num_PSMS, colour=condition)) +
  geom_line(show.legend = FALSE, linewidth=1) +
  facet_wrap(vars(Experiment))
```

There will always be some noise at the start of the run when the first peptides appear.  Thereafter the rate is fairly steady.  Again we see a bit of a peak around 80 mins in which is more pronounced in some samples than others.  It might be worth making a note of the samples which do and don't show this peak so that if we later see an effect which links to these samples specifically we might treat that with some caution.

## Retention time alignment

For the subset of PSMs which we find in all samples we can see how well their retention times match.  For each PSM we'll calculate the deviation of the retention times from the mean across all samples.

A PSM is defined by the combination of sequence, modifications and charge.

```{r}
data |>
  unite(PSM,Sequence,Modifications, Charge,sep=":") |>
  group_by(PSM) |>
  filter(n() == 12 & length(unique(Experiment))==12) |>
  group_by(PSM) |>
  mutate(
    average_time = mean(`Retention time`),
    Retention_Diff = `Retention time`-average_time
  )-> retention_times

retention_times |>
  ggplot(
    aes(
      x=average_time,
      y=Retention_Diff, 
      group=Experiment, 
      colour=condition 
    )
  ) +
  geom_line()

```

We can see that the Q1 and Q2 batches have different elution trajectories to each other.  we can see that Q2TAP has a problem in PSMs before about 25 mins which might cause problems in peak matching.  All samples seem to be more unreliable above about 80 mins.

If this was particularly poor then we could speak to our Mass Spec facility.  We could also potentially filter the PSMs and re-summarise at the protein level if we wanted to only use the clean data.



## Mass Accuracy

Mass Spectrometers are very accurate in their mass determination, but this requires continual calibration, otherwise the reported m/z values will drift.  The searches will allow a certain amount of mass inaccuracy to be tolerated and will estimate this per sample (since it should be roughly constant for the whole sample).  Here we can see how much the reported masses drifted from the true calculated peptide masses.  We want these peaks to be tight and ideally centered on or around zero.

```{r}
data |>
  ggplot(aes(x=Experiment, y=`Mass error [ppm]`, fill=condition)) +
  geom_violin() +
  geom_hline(yintercept = 0) +
  coord_flip() 
```


## Matched vs Directly Measured

When identifying a peak we can either measure it directly based on an MS2 fragmentation pattern, or, if we're doing label free quantitation and we're doing multiple runs allowing for cross matching, we might quantify a peptide in one sample based on an MS2 identification in another and the identification of a matched MS1 peak.

Here we can look at the proportion of PSMs which have been identified in different ways.


```{r}
data |>
  group_by(Experiment, condition, Type) |>
  count() |>
  ggplot(aes(x=Experiment, y=n, fill=Type)) +
  geom_col() +
  coord_flip() +
  facet_grid(rows=vars(condition), scale="free_y")
```

We're only going to use the MULTI-MSMS (proper MS2 identification and MS1 quantitation), and MULTI-MATCH (transferred MS2 identification and matched MS1 quantitation).

We can see that the proportion of direct and matched PSMs does vary.  Lots of the matches are MS2/MS1, but in some samples (eg Q2_TAP) the proportion of transferred identifications is notably higher.


## Retention time matching

When we do matching of MS1 values between runs to increase the number of quantitations we can measure the difference between the observed and expected retention times.  When running the search you can say how consistent you expect your elutions to be.  Using a tighter window lowers the chance of a spurious match and increases the speed of the search, but if the window is set too narrowly then the true match may be missed.

```{r fig.width=10, fig.height=4}
data |>
  ggplot(aes(x=Experiment, y=`Retention time calibration`, colour=condition)) +
  geom_boxplot(outliers = FALSE) +
  coord_flip() +
  facet_grid(cols=vars(Type))
```

We can see that the Q1 samples all have very consistent retention times.  The Q2 samples are offset by about -5 minutes, which is well within the search range, but does suggest that these were likely a different run batch which introduced some variability.