Model-based Analysis of Ubiquitin Proteasome System

Wubing Zhang, Collin Tokheim, X. Shirley Liu1

1Dana-Farber Cancer Institute

3 October 2021

Abstract

Advances in proteomic profiling have enabled the study of protein regulation and degradation in diseases such as cancer. However, there are major computational challenges, such as how to perform quality control (QC) and normalization, how to identify differential protein abundance at multiple-levels of resolution (e.g. protein complexes), and how to integrate data with other omic technologies. Here, we developed a computational analysis pipeline Model-based Analysis of the Ubiquitin-Proteasome System using R (rMAUPS), which performs computational analysis of proteomics data efficiently and effectively. rMAUPS comprises four significant modules, including quality control (QC), imputation, differential analysis, and integrative analysis.

Package

rMAUPS - 0.1.1

1 Install required packages
2 LC-MS/MS dataset
3 Quick start
- 3.1 Metadata configuration
- 3.2 Run the pipeline
4 Step-by-step analysis of proteomics
5 Session info

1 Install required packages

Installing the package in a new environment may take a long time. If the installation fails, please post a new issue here.

1.1 Prerequisites

To install rMAUPS, you have to first install conda following the document.

You should also install r, r-recommended. If you have R installed before, you need to ensure libgfortran, libnetcdf and libxml2 are also installed in your conda environment. Besides, it seems important to have r-data.table and r-rcpparmadillo installed through conda before biocmanager installs dependencies (such as DESeq2).

$ conda install -c r r r-markdown r-recommended
$ conda install -c anaconda libgfortran libnetcdf libxml2
$ conda install -c conda-forge pandoc r-data.table r-rcpparmadillo

1.2 Installation dependencies and rMAUPS in R

install.packages(c("devtools", "BiocManager"), repos = "https://cloud.r-project.org")
# Install dependencies
BiocManager::install(c("ggpubr", "metap", "ggrepel", "GSVA", "DESeq2", "limma", "impute", "biomaRt", "msigdbr", "BiocStyle", "msmsTests"))
# Install rMAUPS from github
devtools::install_github("WubingZhang/rMAUPS")

1.3 load required packages

The environment should be OK if you can load the required packages successfully.

library(ggplot2)
library(rMAUPS)

2 LC-MS/MS dataset

The rMAUPS package includes two real LC-MS/MS data files ending with “export_proteins.txt”, which are exported from the Proteome Discoverer software. Here, we will take the two datasets as an example to describe how to analyze the data using the rMAUPS pipeline. Before running the pipeline, the data can be preprocessed into a tidy format using the function normalizeProteomeDiscoverer.

List the two files as follow:

datapath = system.file("extdata", package = "rMAUPS")
list.files(datapath, pattern = "export_proteins")

## [1] "experiment1_export_proteins.txt" "experiment2_export_proteins.txt"

Preprocess all the datasets in the folder

normalizeProteomeDiscoverer(datapath, output = "./", log2 = TRUE)

## Oct-03-2021 23:21:03 Processing /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rMAUPS/extdata/experiment1_export_proteins.txt

## Oct-03-2021 23:21:03 Processing /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rMAUPS/extdata/experiment2_export_proteins.txt

## [1] TRUE

Process one dataset

normdata = normalizeProteomeDiscoverer(file.path(datapath, "experiment1_export_proteins.txt"), log2 = TRUE, return = TRUE)

## Oct-03-2021 23:21:04 Processing /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rMAUPS/extdata/experiment1_export_proteins.txt

head(normdata)

##         PSMs F1.126.DMSO F1.127N.DMSO F1.127C.DMSO F1.128N.MLN4924
## AHNAK   1411   7.2505030    7.6678385    7.6800153       7.4199202
## FLNB     557  -0.3907115   -0.1707519   -0.1218003      -0.1352324
## DYNC1H1  410   5.0411237    4.7732085    4.6884241       5.1122591
## AHNAK2   349   5.0226038    5.1310190    5.3911218       4.9934428
## FLNC     504   5.3129906    5.3637808    5.2985534       5.4937850
## SPTAN1   353  -2.3213765   -2.3777361   -2.1824982      -2.5368991
##         F1.128C.MLN4924 F1.129N.MLN4924 F1.129C.CSN5i.R F1.130N.CSN5i.R
## AHNAK         7.0438406       7.1233083       7.3832827      7.57337436
## FLNB         -0.4055006      -0.2957409      -0.1762043     -0.07383365
## DYNC1H1       6.0084579       5.3129985       5.2337983      4.91458815
## AHNAK2        4.6431495       4.8088745       5.0144311      5.24339380
## FLNC          5.2117384       5.3215400       5.4332368      5.43162251
## SPTAN1       -2.4802570      -2.3631797      -2.3115984     -2.31050604
##         F1.130C.CSN5i.R
## AHNAK         7.1545078
## FLNB         -0.2216058
## DYNC1H1       5.2747704
## AHNAK2        4.8052963
## FLNC          5.4902761
## SPTAN1       -2.4415415

3 Quick start

After preprocessing the datasets, you can run rMAUPS pipeline quickly. The pipeline requires a metadata, which configs the path to the datasets, list of samples and their experimental conditions, and design matrix of the comparisons.

3.1 Metadata configuration

rMAUPS includes a metadata as an example, you can read the file metadata.csv and check the format of the metadata.

metadata = read.csv(file.path(datapath, "metadata.csv"))
head(metadata)

##                 Experiment          Sample Condition Comparison_MLN4924
## 1 experiment1_normdata.csv     F1.126.DMSO      DMSO                  0
## 2 experiment1_normdata.csv    F1.127N.DMSO      DMSO                  0
## 3 experiment1_normdata.csv    F1.127C.DMSO      DMSO                  0
## 4 experiment1_normdata.csv F1.128N.MLN4924   MLN4924                  1
## 5 experiment1_normdata.csv F1.128C.MLN4924   MLN4924                  1
## 6 experiment1_normdata.csv F1.129N.MLN4924   MLN4924                  1
##   Comparison_CSN5i
## 1                0
## 2                0
## 3                0
## 4               NA
## 5               NA
## 6               NA

3.2 Run the pipeline

After configuring the metadata, it’s ready to run the pipeline using one-line command.

MAUPSr(metadata, outdir = "analysis/")
## Or
MAUPSr(system.file("extdata", "metadata.csv", package = "rMAUPS"), outdir = "analysis/")

To better display all the results, we developed a mini shiny app, which includes all the rMAUPS results in a webpage. You can open it by using function view.

view(outdir = "analysis/")

Input the path to rMAUPS results, e.g. “analysis/” here, click submit, then all the figure results will be loaded on the webpage. It take seconds to load all the figures, please be patient after clicking submit.

4 Step-by-step analysis of proteomics

Besides the quick run of rMAUPS pipeline, you can also perform step-by-step analysis using functions in rMAUPS. You can perform quality control using the function ProteomicsQC, normalize the proteomics data using normalizeProteomics, impute the missing values using imputeNA, perform differetial analysis using DEAnalyze, and test the differential abundance of protein complexes or pathways using DeComplex.

4.1 Example dataset

To give an example about the quality control and imputation, we randomly assigned 10% values to be NA in the data.

data = as.matrix(normdata[,-1])
meta = metadata[metadata$Experiment=="experiment1_normdata.csv", -1]
rownames(meta) = meta[,1]
simulated = data
idx = sample(1:length(simulated), round(0.1*length(simulated)))
simulated[idx] = NA

4.2 Data quality control

The distribution of protein abundances across samples.
The principal component analysis.
Pairwise correlation of proteins within the same complexes
The number of detected genes / missing values in each sample.

qc = ProteomicsQC(simulated, condition = meta[colnames(data), 2], proj.name = "TestQC")

## Warning in data.table::melt(beta, id = NULL): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.

## Warning in data.table::melt(beta, id = NULL): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.

## Warning in if (!class(expr) %in% c("data.frame", "matrix")) {: the condition has
## length > 1 and only the first element will be used

## Warning in melt(beta, id = NULL): The melt generic in data.table has been
## passed a matrix and will attempt to redirect to the relevant reshape2
## method; please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.

## Warning in data.table::melt(beta, id = NULL): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.

qc$p1

qc$p2

qc$p4

qc$p5

qc$p6

qc$p7

4.3 Data normalization

rMAUPS provides a function, normalizeProteomics, to normalize proteomics data. You can easily normalize the data using multiple optional methods, such as median normalization, median ratio normalization, z-score normalization, quatile normalization and loess normalization.

normalized = normalizeProteomics(simulated, norm = "median", log2 = FALSE)

4.4 Imputation

4.4.1 Imputation using KNN

imputed = imputeNA(normalized)

## Cluster size 7436 broken into 3743 3693 
## Done cluster 3743 
## Done cluster 3693

plot(imputed[idx], data[idx])

After normalization and imputation, you can perform the quality control analysis again.

4.5 Differential expression analysis

rMAUPS provides a integrated function DEAnalyze to perform differential expression analysis for both RNA-seq data and proteomics data. For RNA-seq data, type = "RNAseq", method = "DESeq2" is recommended; for label-free proteomics, type = "msms", method = "msms.edgeR" is suggested; for isobaric labeling-based relative quantification of prpteomics, type = "msms", method = "limma" is preferred.

deres = DEAnalyze(data, meta[,-1], type = "msms", method = "limma")
## Visualize the results
deres$logFDR = log10(deres$padj)
ScatterView(deres, x = "log2FC", y = "logFDR", 
            x_cut = c(-0.5,0.5), y_cut = -2, 
            groups = c("bottomleft", "bottomright"), top = 5)

4.6 Differential expression of pathways/complexes

For proteomics data analysis, the protein complex level analysis is informative. So we design a function DeComplex to combine the differential abundance of proteins into differential level of protein complexes or biological pathways.

res = DeComplex(deres)
head(res$deComplex)

##                                                                                                             Description
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS                           10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS                                                 2 OXOGLUTARATE METABOLIC PROCESS
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY                                                                   2FE 2S CLUSTER ASSEMBLY
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS 3 PHOSPHOADENOSINE 5 PHOSPHOSULFATE BIOSYNTHETIC PROCESS
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS       3 PHOSPHOADENOSINE 5 PHOSPHOSULFATE METABOLIC PROCESS
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION                                           3 UTR MEDIATED MRNA DESTABILIZATION
##                                                                    Zscore
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS              -0.4336778
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS                         -0.7379380
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY                                  -0.2658648
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS -0.1294687
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS    -0.2998943
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION                      -0.1804021
##                                                                      pvalue
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS              0.0057651683
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS                         0.0028090347
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY                                  0.2335643368
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS 0.0279594520
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS    0.0051424322
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION                      0.0003135665
##                                                                     logP
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS              2.2391880
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS                         2.5514429
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY                                  0.6315935
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS 1.5534713
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS    2.2888314
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION                      3.5036703

res$gobp.p

res$reactome.p

res$gocc.p

res$corum.p

5 Session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rMAUPS_0.1.1     ggplot2_3.3.2    BiocStyle_2.16.1
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                backports_1.1.10           
##   [3] BiocFileCache_1.12.1        sn_1.6-2                   
##   [5] plyr_1.8.6                  GSEABase_1.50.1            
##   [7] splines_4.0.2               BiocParallel_1.22.0        
##   [9] TH.data_1.0-10              GenomeInfoDb_1.24.2        
##  [11] digest_0.6.26               foreach_1.5.1              
##  [13] htmltools_0.5.0             magick_2.5.0               
##  [15] magrittr_1.5                memoise_1.1.0              
##  [17] doParallel_1.0.16           openxlsx_4.2.2             
##  [19] limma_3.44.3                annotate_1.66.0            
##  [21] msmsTests_1.26.0            matrixStats_0.57.0         
##  [23] sandwich_3.0-0              askpass_1.1                
##  [25] prettyunits_1.1.1           colorspace_1.4-1           
##  [27] msmsEDA_1.26.0              blob_1.2.1                 
##  [29] rappdirs_0.3.1              ggrepel_0.8.2              
##  [31] haven_2.3.1                 rbibutils_1.3              
##  [33] xfun_0.18                   dplyr_1.0.2                
##  [35] crayon_1.3.4                RCurl_1.98-1.2             
##  [37] graph_1.66.0                genefilter_1.70.0          
##  [39] impute_1.62.0               zoo_1.8-8                  
##  [41] survival_3.2-7              iterators_1.0.13           
##  [43] glue_1.4.2                  gtable_0.3.0               
##  [45] zlibbioc_1.34.0             XVector_0.28.0             
##  [47] DelayedArray_0.14.1         car_3.0-10                 
##  [49] BiocGenerics_0.34.0         msigdbr_7.2.1              
##  [51] abind_1.4-5                 scales_1.1.1               
##  [53] vsn_3.56.0                  mvtnorm_1.1-1              
##  [55] DBI_1.1.0                   edgeR_3.30.3               
##  [57] rstatix_0.6.0               Rcpp_1.0.5                 
##  [59] plotrix_3.7-8               metap_1.4                  
##  [61] mzR_2.22.0                  xtable_1.8-4               
##  [63] progress_1.2.2              tmvnsim_1.0-2              
##  [65] foreign_0.8-80              bit_4.0.4                  
##  [67] preprocessCore_1.50.0       stats4_4.0.2               
##  [69] GSVA_1.36.3                 httr_1.4.2                 
##  [71] gplots_3.1.0                RColorBrewer_1.1-2         
##  [73] TFisher_0.2.0               ellipsis_0.3.1             
##  [75] farver_2.0.3                pkgconfig_2.0.3            
##  [77] XML_3.99-0.5                dbplyr_1.4.4               
##  [79] locfit_1.5-9.4              labeling_0.4.2             
##  [81] reshape2_1.4.4              tidyselect_1.1.0           
##  [83] rlang_0.4.8                 later_1.1.0.1              
##  [85] AnnotationDbi_1.50.3        munsell_0.5.0              
##  [87] cellranger_1.1.0            tools_4.0.2                
##  [89] generics_0.1.0              RSQLite_2.2.1              
##  [91] mathjaxr_1.0-1              broom_0.7.2                
##  [93] evaluate_0.14               stringr_1.4.0              
##  [95] fastmap_1.0.1               mzID_1.26.0                
##  [97] yaml_2.2.1                  knitr_1.30                 
##  [99] bit64_4.0.5                 zip_2.1.1                  
## [101] caTools_1.18.0              purrr_0.3.4                
## [103] ncdf4_1.17                  mime_0.9                   
## [105] xml2_1.3.2                  biomaRt_2.44.4             
## [107] compiler_4.0.2              shinythemes_1.1.2          
## [109] curl_4.3                    affyio_1.58.0              
## [111] ggsignif_0.6.0              tibble_3.0.4               
## [113] geneplotter_1.66.0          stringi_1.5.3              
## [115] forcats_0.5.0               MSnbase_2.14.2             
## [117] lattice_0.20-41             ProtGenerics_1.20.0        
## [119] Matrix_1.2-18               multtest_2.44.0            
## [121] vctrs_0.3.4                 mutoss_0.1-12              
## [123] pillar_1.4.6                lifecycle_0.2.0            
## [125] BiocManager_1.30.10         Rdpack_2.0                 
## [127] MALDIquant_1.19.3           data.table_1.13.0          
## [129] bitops_1.0-6                gbRd_0.4-11                
## [131] qvalue_2.20.0               httpuv_1.5.4               
## [133] GenomicRanges_1.40.0        R6_2.4.1                   
## [135] pcaMethods_1.80.0           affy_1.66.0                
## [137] bookdown_0.21               promises_1.1.1             
## [139] KernSmooth_2.23-17          rio_0.5.16                 
## [141] IRanges_2.22.2              codetools_0.2-16           
## [143] gtools_3.8.2                MASS_7.3-53                
## [145] assertthat_0.2.1            SummarizedExperiment_1.18.2
## [147] openssl_1.4.3               DESeq2_1.28.1              
## [149] withr_2.3.0                 mnormt_2.0.2               
## [151] multcomp_1.4-14             S4Vectors_0.26.1           
## [153] GenomeInfoDbData_1.2.3      parallel_4.0.2             
## [155] hms_0.5.3                   grid_4.0.2                 
## [157] tidyr_1.1.2                 rmarkdown_2.4              
## [159] carData_3.0-4               ggpubr_0.4.0               
## [161] numDeriv_2016.8-1.1         Biobase_2.48.0             
## [163] shiny_1.5.0