Hi all,
In trying to perform a downstream analysis of human RNA-seq with WGCNA (with part of this dataset), I was trying to extract counts after importation of Salmon quants with tximeta
. As I wanted to implement advice from this paper , my goal was to extract counts with minimal normalization, to then perform the upper quartile normalization recommanded in the aforementioned paper.
As I've seen in the tximport
vignette that it is not recommended to pass the original counts without an offset, I used the option countsFromAbundance = "scaledTPM"
to implement the bias corrected counts without an offset solution, instead of the default option.
However, I wanted to be sure (very basically) that the two options were quite close, as they are in the F1000 paper in real datasets (if I'm not mistaken, there, the default option corresponds to simplesum, the "scaledTPM"
to scaledTPM). I thus calculated the two options (code below), and then inspected the correlation.
# summarise to Gene
se <- tximeta(coldata)
gse <- summarizeToGene(se)
gse_scaled <- summarizeToGene(se, countsFromAbundance = "scaledTPM")
raw_counts <- assay(gse, "counts")
counts_scaled <- assay(gse_scaled, "counts")
# Those two matrix have the same structure
all(colnames(raw_counts) == colnames(counts_scaled))
all(rownames(raw_counts) == rownames(counts_scaled))
# Indices just to be sure
diag(cor(raw_counts,
counts_scaled[rownames(raw_counts), colnames(raw_counts)]))
# Output :
#
# SRR7798788 SRR7798789 SRR7798790 SRR7798791 SRR7798792 SRR7798793 SRR7798794
# 0.6237471 0.6316028 0.6290649 0.6309516 0.6269719 0.6345217 0.6351552
# SRR7798795 SRR7798796
# 0.6314391 0.6204099
As you can see, the result is not what I (wrongly?) expected.
I also wanted to check if this was somehow linked to a wrong usage of tximeta
, so I used tximport
and got the exact same results. I then used the option "lengthScaledTPM"
, and got results closer to my expectations :
# Same code as before except the option changed
# Output :
# SRR7798788 SRR7798789 SRR7798790 SRR7798791 SRR7798792 SRR7798793 SRR7798794
# 0.9999231 0.9998466 0.9999702 0.9999487 0.9999728 0.9999512 0.9997092
# SRR7798795 SRR7798796
# 0.9999280 0.9998970
I don't understand the big difference in correlation : even though the samples could exhibit differential transcript usage, I didn't expect it to be that important, since all the samples are from the same cell line.
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 34 (Workstation Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] tximport_1.18.0 ensembldb_2.14.1
[3] AnnotationFilter_1.14.0 GenomicFeatures_1.42.3
[5] AnnotationDbi_1.52.0 AnnotationHub_2.22.1
[7] BiocFileCache_1.14.0 dbplyr_2.1.1
[9] SummarizedExperiment_1.20.0 Biobase_2.50.0
[11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
[13] IRanges_2.24.1 S4Vectors_0.28.1
[15] BiocGenerics_0.36.1 MatrixGenerics_1.2.1
[17] matrixStats_0.60.0 magrittr_2.0.1
[19] tximeta_1.8.5
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 ellipsis_0.3.2
[3] dynamicTreeCut_1.63-1 htmlTable_2.2.1
[5] XVector_0.30.0 base64enc_0.1-3
[7] rstudioapi_0.13 DT_0.18
[9] bit64_4.0.5 interactiveDisplayBase_1.28.0
[11] fansi_0.5.0 xml2_1.3.2
[13] codetools_0.2-18 splines_4.0.5
[15] doParallel_1.0.16 cachem_1.0.5
[17] impute_1.64.0 knitr_1.33
[19] jsonlite_1.7.2 Formula_1.2-4
[21] Rsamtools_2.6.0 WGCNA_1.70-3
[23] cluster_2.1.2 GO.db_3.12.1
[25] png_0.1-7 shiny_1.6.0
[27] readr_2.0.0 BiocManager_1.30.16
[29] compiler_4.0.5 httr_1.4.2
[31] backports_1.2.1 lazyeval_0.2.2
[33] assertthat_0.2.1 Matrix_1.3-4
[35] fastmap_1.1.0 later_1.2.0
[37] htmltools_0.5.1.1 prettyunits_1.1.1
[39] tools_4.0.5 gtable_0.3.0
[41] glue_1.4.2 GenomeInfoDbData_1.2.4
[43] dplyr_1.0.7 rappdirs_0.3.3
[45] Rcpp_1.0.7 vctrs_0.3.8
[47] Biostrings_2.58.0 preprocessCore_1.52.1
[49] rtracklayer_1.50.0 iterators_1.0.13
[51] xfun_0.24 fastcluster_1.2.3
[53] stringr_1.4.0 mime_0.11
[55] lifecycle_1.0.0 XML_3.99-0.6
[57] zlibbioc_1.36.0 scales_1.1.1
[59] vroom_1.5.3 ProtGenerics_1.22.0
[61] hms_1.1.0 promises_1.2.0.1
[63] RColorBrewer_1.1-2 yaml_2.2.1
[65] curl_4.3.2 memoise_2.0.0
[67] gridExtra_2.3 ggplot2_3.3.5
[69] biomaRt_2.46.3 rpart_4.1-15
[71] latticeExtra_0.6-29 stringi_1.7.3
[73] RSQLite_2.2.7 BiocVersion_3.12.0
[75] foreach_1.5.1 checkmate_2.0.0
[77] BiocParallel_1.24.1 rlang_0.4.11
[79] pkgconfig_2.0.3 bitops_1.0-7
[81] lattice_0.20-44 purrr_0.3.4
[83] GenomicAlignments_1.26.0 htmlwidgets_1.5.3
[85] bit_4.0.4 tidyselect_1.1.1
[87] R6_2.5.0 generics_0.1.0
[89] Hmisc_4.5-0 DelayedArray_0.16.3
[91] DBI_1.1.1 withr_2.4.2
[93] pillar_1.6.2 foreign_0.8-81
[95] survival_3.2-11 RCurl_1.98-1.3
[97] nnet_7.3-16 tibble_3.1.3
[99] crayon_1.4.1 utf8_1.2.2
[101] tzdb_0.1.2 jpeg_0.1-9
[103] progress_1.2.2 grid_4.0.5
[105] data.table_1.14.0 blob_1.2.2
[107] digest_0.6.27 xtable_1.8-4
[109] httpuv_1.6.1 openssl_1.4.4
[111] munsell_0.5.0 askpass_1.1
Thank you very much! I think I got confused between gene lengths and transcript lengths.