Advances in proteomic profiling have enabled the study of protein regulation and degradation in diseases such as cancer. However, there are major computational challenges, such as how to perform quality control (QC) and normalization, how to identify differential protein abundance at multiple-levels of resolution (e.g. protein complexes), and how to integrate data with other omic technologies. Here, we developed a computational analysis pipeline Model-based Analysis of the Ubiquitin-Proteasome System using R (rMAUPS), which performs computational analysis of proteomics data efficiently and effectively. rMAUPS comprises four significant modules, including quality control (QC), imputation, differential analysis, and integrative analysis.
rMAUPS - 0.1.1
Installing the package in a new environment may take a long time. If the installation fails, please post a new issue here.
To install rMAUPS, you have to first install conda following the document.
You should also install r, r-recommended. If you have R installed before, you need to ensure libgfortran, libnetcdf and libxml2 are also installed in your conda environment. Besides, it seems important to have r-data.table and r-rcpparmadillo installed through conda before biocmanager installs dependencies (such as DESeq2).
$ conda install -c r r r-markdown r-recommended
$ conda install -c anaconda libgfortran libnetcdf libxml2
$ conda install -c conda-forge pandoc r-data.table r-rcpparmadillo
install.packages(c("devtools", "BiocManager"), repos = "https://cloud.r-project.org")
# Install dependencies
BiocManager::install(c("ggpubr", "metap", "ggrepel", "GSVA", "DESeq2", "limma", "impute", "biomaRt", "msigdbr", "BiocStyle", "msmsTests"))
# Install rMAUPS from github
devtools::install_github("WubingZhang/rMAUPS")
The environment should be OK if you can load the required packages successfully.
library(ggplot2)
library(rMAUPS)
The rMAUPS package includes two real LC-MS/MS data files ending with “export_proteins.txt”, which are exported from the Proteome Discoverer software. Here, we will take the two datasets as an example to describe how to analyze the data using the rMAUPS pipeline. Before running the pipeline, the data can be preprocessed into a tidy format using the function normalizeProteomeDiscoverer
.
datapath = system.file("extdata", package = "rMAUPS")
list.files(datapath, pattern = "export_proteins")
## [1] "experiment1_export_proteins.txt" "experiment2_export_proteins.txt"
normalizeProteomeDiscoverer(datapath, output = "./", log2 = TRUE)
## Oct-03-2021 23:21:03 Processing /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rMAUPS/extdata/experiment1_export_proteins.txt
## Oct-03-2021 23:21:03 Processing /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rMAUPS/extdata/experiment2_export_proteins.txt
## [1] TRUE
normdata = normalizeProteomeDiscoverer(file.path(datapath, "experiment1_export_proteins.txt"), log2 = TRUE, return = TRUE)
## Oct-03-2021 23:21:04 Processing /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rMAUPS/extdata/experiment1_export_proteins.txt
head(normdata)
## PSMs F1.126.DMSO F1.127N.DMSO F1.127C.DMSO F1.128N.MLN4924
## AHNAK 1411 7.2505030 7.6678385 7.6800153 7.4199202
## FLNB 557 -0.3907115 -0.1707519 -0.1218003 -0.1352324
## DYNC1H1 410 5.0411237 4.7732085 4.6884241 5.1122591
## AHNAK2 349 5.0226038 5.1310190 5.3911218 4.9934428
## FLNC 504 5.3129906 5.3637808 5.2985534 5.4937850
## SPTAN1 353 -2.3213765 -2.3777361 -2.1824982 -2.5368991
## F1.128C.MLN4924 F1.129N.MLN4924 F1.129C.CSN5i.R F1.130N.CSN5i.R
## AHNAK 7.0438406 7.1233083 7.3832827 7.57337436
## FLNB -0.4055006 -0.2957409 -0.1762043 -0.07383365
## DYNC1H1 6.0084579 5.3129985 5.2337983 4.91458815
## AHNAK2 4.6431495 4.8088745 5.0144311 5.24339380
## FLNC 5.2117384 5.3215400 5.4332368 5.43162251
## SPTAN1 -2.4802570 -2.3631797 -2.3115984 -2.31050604
## F1.130C.CSN5i.R
## AHNAK 7.1545078
## FLNB -0.2216058
## DYNC1H1 5.2747704
## AHNAK2 4.8052963
## FLNC 5.4902761
## SPTAN1 -2.4415415
After preprocessing the datasets, you can run rMAUPS pipeline quickly. The pipeline requires a metadata, which configs the path to the datasets, list of samples and their experimental conditions, and design matrix of the comparisons.
rMAUPS includes a metadata as an example, you can read the file metadata.csv
and check the format of the metadata.
metadata = read.csv(file.path(datapath, "metadata.csv"))
head(metadata)
## Experiment Sample Condition Comparison_MLN4924
## 1 experiment1_normdata.csv F1.126.DMSO DMSO 0
## 2 experiment1_normdata.csv F1.127N.DMSO DMSO 0
## 3 experiment1_normdata.csv F1.127C.DMSO DMSO 0
## 4 experiment1_normdata.csv F1.128N.MLN4924 MLN4924 1
## 5 experiment1_normdata.csv F1.128C.MLN4924 MLN4924 1
## 6 experiment1_normdata.csv F1.129N.MLN4924 MLN4924 1
## Comparison_CSN5i
## 1 0
## 2 0
## 3 0
## 4 NA
## 5 NA
## 6 NA
After configuring the metadata, it’s ready to run the pipeline using one-line command.
MAUPSr(metadata, outdir = "analysis/")
## Or
MAUPSr(system.file("extdata", "metadata.csv", package = "rMAUPS"), outdir = "analysis/")
To better display all the results, we developed a mini shiny app, which includes all the rMAUPS results in a webpage. You can open it by using function view
.
view(outdir = "analysis/")
Input the path to rMAUPS results, e.g. “analysis/” here, click submit
, then all the figure results will be loaded on the webpage. It take seconds to load all the figures, please be patient after clicking submit
.
Besides the quick run of rMAUPS pipeline, you can also perform step-by-step analysis using functions in rMAUPS. You can perform quality control using the function ProteomicsQC
, normalize the proteomics data using normalizeProteomics
, impute the missing values using imputeNA
, perform differetial analysis using DEAnalyze
, and test the differential abundance of protein complexes or pathways using DeComplex
.
To give an example about the quality control and imputation, we randomly assigned 10% values to be NA in the data.
data = as.matrix(normdata[,-1])
meta = metadata[metadata$Experiment=="experiment1_normdata.csv", -1]
rownames(meta) = meta[,1]
simulated = data
idx = sample(1:length(simulated), round(0.1*length(simulated)))
simulated[idx] = NA
qc = ProteomicsQC(simulated, condition = meta[colnames(data), 2], proj.name = "TestQC")
## Warning in data.table::melt(beta, id = NULL): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.
## Warning in data.table::melt(beta, id = NULL): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.
## Warning in if (!class(expr) %in% c("data.frame", "matrix")) {: the condition has
## length > 1 and only the first element will be used
## Warning in melt(beta, id = NULL): The melt generic in data.table has been
## passed a matrix and will attempt to redirect to the relevant reshape2
## method; please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.
## Warning in data.table::melt(beta, id = NULL): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(beta). In the next version, this warning will become an error.
qc$p1
qc$p2
qc$p4
qc$p5
qc$p6
qc$p7
rMAUPS provides a function, normalizeProteomics
, to normalize proteomics data. You can easily normalize the data using multiple optional methods, such as median normalization, median ratio normalization, z-score normalization, quatile normalization and loess normalization.
normalized = normalizeProteomics(simulated, norm = "median", log2 = FALSE)
imputed = imputeNA(normalized)
## Cluster size 7436 broken into 3743 3693
## Done cluster 3743
## Done cluster 3693
plot(imputed[idx], data[idx])
After normalization and imputation, you can perform the quality control analysis again.
rMAUPS provides a integrated function DEAnalyze
to perform differential expression analysis for both RNA-seq data and proteomics data. For RNA-seq data, type = "RNAseq", method = "DESeq2"
is recommended; for label-free proteomics, type = "msms", method = "msms.edgeR"
is suggested; for isobaric labeling-based relative quantification of prpteomics, type = "msms", method = "limma"
is preferred.
deres = DEAnalyze(data, meta[,-1], type = "msms", method = "limma")
## Visualize the results
deres$logFDR = log10(deres$padj)
ScatterView(deres, x = "log2FC", y = "logFDR",
x_cut = c(-0.5,0.5), y_cut = -2,
groups = c("bottomleft", "bottomright"), top = 5)
For proteomics data analysis, the protein complex level analysis is informative. So we design a function DeComplex to combine the differential abundance of proteins into differential level of protein complexes or biological pathways.
res = DeComplex(deres)
head(res$deComplex)
## Description
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 10 FORMYLTETRAHYDROFOLATE METABOLIC PROCESS
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS 2 OXOGLUTARATE METABOLIC PROCESS
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY 2FE 2S CLUSTER ASSEMBLY
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS 3 PHOSPHOADENOSINE 5 PHOSPHOSULFATE BIOSYNTHETIC PROCESS
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS 3 PHOSPHOADENOSINE 5 PHOSPHOSULFATE METABOLIC PROCESS
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION 3 UTR MEDIATED MRNA DESTABILIZATION
## Zscore
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS -0.4336778
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS -0.7379380
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY -0.2658648
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS -0.1294687
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS -0.2998943
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION -0.1804021
## pvalue
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 0.0057651683
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS 0.0028090347
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY 0.2335643368
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS 0.0279594520
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS 0.0051424322
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION 0.0003135665
## logP
## C5_BP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS 2.2391880
## C5_BP_2_OXOGLUTARATE_METABOLIC_PROCESS 2.5514429
## C5_BP_2FE_2S_CLUSTER_ASSEMBLY 0.6315935
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_BIOSYNTHETIC_PROCESS 1.5534713
## C5_BP_3_PHOSPHOADENOSINE_5_PHOSPHOSULFATE_METABOLIC_PROCESS 2.2888314
## C5_BP_3_UTR_MEDIATED_MRNA_DESTABILIZATION 3.5036703
res$gobp.p
res$reactome.p
res$gocc.p
res$corum.p
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rMAUPS_0.1.1 ggplot2_3.3.2 BiocStyle_2.16.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.10
## [3] BiocFileCache_1.12.1 sn_1.6-2
## [5] plyr_1.8.6 GSEABase_1.50.1
## [7] splines_4.0.2 BiocParallel_1.22.0
## [9] TH.data_1.0-10 GenomeInfoDb_1.24.2
## [11] digest_0.6.26 foreach_1.5.1
## [13] htmltools_0.5.0 magick_2.5.0
## [15] magrittr_1.5 memoise_1.1.0
## [17] doParallel_1.0.16 openxlsx_4.2.2
## [19] limma_3.44.3 annotate_1.66.0
## [21] msmsTests_1.26.0 matrixStats_0.57.0
## [23] sandwich_3.0-0 askpass_1.1
## [25] prettyunits_1.1.1 colorspace_1.4-1
## [27] msmsEDA_1.26.0 blob_1.2.1
## [29] rappdirs_0.3.1 ggrepel_0.8.2
## [31] haven_2.3.1 rbibutils_1.3
## [33] xfun_0.18 dplyr_1.0.2
## [35] crayon_1.3.4 RCurl_1.98-1.2
## [37] graph_1.66.0 genefilter_1.70.0
## [39] impute_1.62.0 zoo_1.8-8
## [41] survival_3.2-7 iterators_1.0.13
## [43] glue_1.4.2 gtable_0.3.0
## [45] zlibbioc_1.34.0 XVector_0.28.0
## [47] DelayedArray_0.14.1 car_3.0-10
## [49] BiocGenerics_0.34.0 msigdbr_7.2.1
## [51] abind_1.4-5 scales_1.1.1
## [53] vsn_3.56.0 mvtnorm_1.1-1
## [55] DBI_1.1.0 edgeR_3.30.3
## [57] rstatix_0.6.0 Rcpp_1.0.5
## [59] plotrix_3.7-8 metap_1.4
## [61] mzR_2.22.0 xtable_1.8-4
## [63] progress_1.2.2 tmvnsim_1.0-2
## [65] foreign_0.8-80 bit_4.0.4
## [67] preprocessCore_1.50.0 stats4_4.0.2
## [69] GSVA_1.36.3 httr_1.4.2
## [71] gplots_3.1.0 RColorBrewer_1.1-2
## [73] TFisher_0.2.0 ellipsis_0.3.1
## [75] farver_2.0.3 pkgconfig_2.0.3
## [77] XML_3.99-0.5 dbplyr_1.4.4
## [79] locfit_1.5-9.4 labeling_0.4.2
## [81] reshape2_1.4.4 tidyselect_1.1.0
## [83] rlang_0.4.8 later_1.1.0.1
## [85] AnnotationDbi_1.50.3 munsell_0.5.0
## [87] cellranger_1.1.0 tools_4.0.2
## [89] generics_0.1.0 RSQLite_2.2.1
## [91] mathjaxr_1.0-1 broom_0.7.2
## [93] evaluate_0.14 stringr_1.4.0
## [95] fastmap_1.0.1 mzID_1.26.0
## [97] yaml_2.2.1 knitr_1.30
## [99] bit64_4.0.5 zip_2.1.1
## [101] caTools_1.18.0 purrr_0.3.4
## [103] ncdf4_1.17 mime_0.9
## [105] xml2_1.3.2 biomaRt_2.44.4
## [107] compiler_4.0.2 shinythemes_1.1.2
## [109] curl_4.3 affyio_1.58.0
## [111] ggsignif_0.6.0 tibble_3.0.4
## [113] geneplotter_1.66.0 stringi_1.5.3
## [115] forcats_0.5.0 MSnbase_2.14.2
## [117] lattice_0.20-41 ProtGenerics_1.20.0
## [119] Matrix_1.2-18 multtest_2.44.0
## [121] vctrs_0.3.4 mutoss_0.1-12
## [123] pillar_1.4.6 lifecycle_0.2.0
## [125] BiocManager_1.30.10 Rdpack_2.0
## [127] MALDIquant_1.19.3 data.table_1.13.0
## [129] bitops_1.0-6 gbRd_0.4-11
## [131] qvalue_2.20.0 httpuv_1.5.4
## [133] GenomicRanges_1.40.0 R6_2.4.1
## [135] pcaMethods_1.80.0 affy_1.66.0
## [137] bookdown_0.21 promises_1.1.1
## [139] KernSmooth_2.23-17 rio_0.5.16
## [141] IRanges_2.22.2 codetools_0.2-16
## [143] gtools_3.8.2 MASS_7.3-53
## [145] assertthat_0.2.1 SummarizedExperiment_1.18.2
## [147] openssl_1.4.3 DESeq2_1.28.1
## [149] withr_2.3.0 mnormt_2.0.2
## [151] multcomp_1.4-14 S4Vectors_0.26.1
## [153] GenomeInfoDbData_1.2.3 parallel_4.0.2
## [155] hms_0.5.3 grid_4.0.2
## [157] tidyr_1.1.2 rmarkdown_2.4
## [159] carData_3.0-4 ggpubr_0.4.0
## [161] numDeriv_2016.8-1.1 Biobase_2.48.0
## [163] shiny_1.5.0