--- title: "Statistic Analysis of Protein Level Data" output: html_document: df_print: paged --- # Introduction After the exploration stage we have a resonable picture of what's going on in our data. We want to finish up by doing some statistics to back up what we saw during the exploration. ```{r} library(tidyverse) theme_set(theme_bw()) library(tidyheatmaps) library(rstatix) library(limma) ``` # Loding data We're going to load in the data we previous got from the exploration, after the imputation of missing values using a random forest model. ```{r} read_delim("random_forest_imputed_data.tsv", name_repair = "minimal") -> data head(data) ``` # Statistics We're going to look at a few different types of statistical test. ## Anova We'll start with a broad question which is which genes can we separate with a comparison of all 4 of the conditions. This says nothing about where they change, just whether they are consistent across all conditions. We perform this test for every protein separately. ```{r} data |> group_by(`Majority protein IDs`) |> anova_test(logIntensity ~ Condition) |> ungroup() -> anova_results anova_results |> as_tibble() |> mutate(fdr = p.adjust(p, method="fdr")) |> select(`Majority protein IDs`,DFn,DFd,fdr) -> anova_results head(anova_results) ``` ```{r} anova_results |> mutate(significant = if_else(fdr < 0.05, "Significant","Not Significant")) |> group_by(significant) |> count() ``` Because there is such a massive difference between the Total and TAP conditions then almost every protein is significant by this metric. ## T-test For a simpler pairwise comparison we can just a simple t-test to compare two conditions. In this case we'll compare the Total samples in Q1 and Q2 ```{r} data |> filter (str_detect(Experiment,"ProtTot")) |> group_by(`Majority protein IDs`) |> t_test(logIntensity ~ Condition) |> as_tibble() |> mutate(fdr = p.adjust(p, method="fdr")) |> mutate(significant = if_else(fdr < 0.05, "Significant","Not Significant")) |> select(`Majority protein IDs`, group1,group2, statistic, fdr,significant) |> arrange(fdr) -> q1_vs_q2_t_test head(q1_vs_q2_t_test) ``` Let's see how many hits were significant from this ```{r} q1_vs_q2_t_test |> group_by(significant) |> count() ``` A much smaller number of hits from this. ### Plot hits as heatmap Let's look at these hits as a heatmap ```{r fig.width=5, fig.height=8} q1_vs_q2_t_test |> filter(fdr<0.05) |> pull(`Majority protein IDs`) -> q1_vs_q2_significant_proteins data |> filter (str_detect(Experiment,"ProtTot")) |> filter(`Majority protein IDs` %in% q1_vs_q2_significant_proteins) |> tidy_heatmap( rows=`Majority protein IDs`, columns=Experiment, values = logIntensity, annotation_col = Condition, scale="row", cluster_rows = TRUE, cluster_cols=TRUE, cutree_cols = 2, cutree_rows = 2 ) ``` Those all look very sensible. ### Volcano plot Proteomics people do love a volcano plot, so let's draw one for this data. The t-test result will give us the significance but for a volcano plot we also need a difference so we're going to need to calculate that separately. ```{r} data |> filter (str_detect(Experiment,"ProtTot")) |> group_by(`Majority protein IDs`, Condition) |> summarise(logIntensity = mean(logIntensity)) |> ungroup() |> pivot_wider( names_from=Condition, values_from=logIntensity ) |> mutate(Q1_Q2_difference = Q2_ProtTot - Q1_ProtTot) |> select(`Majority protein IDs`,Q1_Q2_difference) |> right_join(q1_vs_q2_t_test) -> q1_vs_q2_t_test head(q1_vs_q2_t_test) ``` Now we have both the fdr and the difference we can plot this out. ```{r} q1_vs_q2_t_test |> ggplot(aes(x=Q1_Q2_difference, y=fdr, colour=significant)) + geom_point() + scale_y_log10() + scale_y_reverse() + scale_colour_manual(values=c("grey","red2")) + coord_cartesian(xlim=c(-1,1)) ``` ### Highlighting on scatterplot ```{r} data |> filter (str_detect(Experiment,"ProtTot")) |> group_by(`Majority protein IDs`, Condition) |> summarise(logIntensity = mean(logIntensity)) |> ungroup() |> pivot_wider( names_from=Condition, values_from=logIntensity ) |> left_join(q1_vs_q2_t_test |> select(`Majority protein IDs`,significant)) |> arrange(significant) |> ggplot(aes(x=Q1_ProtTot, y=Q2_ProtTot, colour=significant)) + geom_point() + scale_colour_manual(values=c("grey","red2")) ``` ## Moderated T-test (LIMMA) The power of a t-test is limited by the number of replicates, and specifically how accurately you can estimate the variance between biological replicates. LIMMA is an improved version of this type of test, where you can improve the estimate of variance by sharing information between other "similar" proteins. It works where you can find a way to group proteins which should behave similarly together. By default LIMMA assumes that the magnitude of your measurement is a good predictor of variance (as it is for microarrays or RNA-Seq). In proteomics it's a bit more complicated since if you're analysing at the protein level it depends how your summarisation from peptides to proteins was done. In some cases expression level predicts variance, but in others it might be better predicted by the number of peptides which were combined in the summary (if you have that information). ### Check intensity vs variance We can plot out the relationship between intensity and variance to see if this is going to be helpful here. ```{r} data |> filter (Condition == "Q1_ProtTot") |> group_by(`Majority protein IDs`) |> summarise( variance = sd(logIntensity), magnitude = mean(logIntensity) ) |> ungroup() |> filter(magnitude>6.5 & magnitude < 9) |> ggplot(aes(x=magnitude,y=variance)) + geom_point(colour="grey", size=1) + geom_smooth() ``` In this instance we can see that there is no linkage between magnitude of measurement and variance, so LIMMA will not give us biologically meaningfully better results. If we had number of summarised peptides as a predictor then we could test that as a different factor to consider when running the model. We're going to go through the process of running LIMMA, but as shown above the results aren't going to be meaningfully better than a simple t-test. ### LIMMA stats We need to make up a limma dataset from the expression values in the data, and a simple design to specify how we want to compare them. You can build more complex designs but we're going to use the simplest version here. ```{r} data |> filter(str_detect(Condition,"ProtTot")) |> select(`Majority protein IDs`,Experiment,logIntensity) |> pivot_wider( names_from=Experiment, values_from=logIntensity ) |> column_to_rownames("Majority protein IDs") -> limma_data head(limma_data) ``` ```{r} tibble( Experiment = colnames(limma_data), Control = 1, Q1_vs_Q2 = if_else(startsWith(Experiment,"Q1"),1,0) ) |> column_to_rownames("Experiment") -> limma_design limma_design ``` ```{r} lmFit(limma_data,limma_design) -> limma_fit eBayes(limma_fit) -> limma_fit topTable(limma_fit,coef="Q1_vs_Q2", adjust="BH", number = nrow(limma_data)) |> as_tibble(rownames="Majority Protein IDs") -> limma_results head(limma_results) ``` ### Count hits ```{r} limma_results |> mutate(significant = if_else(adj.P.Val<0.05, "Significant","Not Significant")) -> limma_results limma_results |> group_by(significant) |> count() ``` A much higher proportion of hits are found as significant by LIMMA. Given the lack of linkage between variance and magnitude of value it's arguable whether this represents a genuine improvement. Let's look at the results. ### Plot volcano plot ```{r} limma_results |> mutate(significant = if_else(adj.P.Val<0.05, "Significant","Not Significant")) |> ggplot(aes(x=logFC, y=adj.P.Val, colour=significant)) + geom_point() + scale_y_log10() + scale_y_reverse() + coord_cartesian(xlim=c(-1,1)) + scale_colour_manual(values=c("grey","red2")) ``` ### Plot heatmap ```{r fig.width=5, fig.height=8} limma_results |> filter(adj.P.Val < 0.05) |> pull(`Majority Protein IDs`) -> q1_vs_q2_limma_significant_proteins data |> filter (str_detect(Experiment,"ProtTot")) |> filter(`Majority protein IDs` %in% q1_vs_q2_limma_significant_proteins) |> tidy_heatmap( rows=`Majority protein IDs`, columns=Experiment, values = logIntensity, annotation_col = Condition, scale="row", cluster_rows = TRUE, cluster_cols=TRUE, cutree_cols = 2, cutree_rows = 2, show_rownames = FALSE ) ``` You can see the results for this are much noisier than the straight T-Test. Some of the significant hits show high variability between the replicates.