Question

tximeta countsFromAbundance for downstream analysis

0

Entering edit mode

paulimer • 0

@02e21485

Last seen 2.3 years ago

France

Hi all,

In trying to perform a downstream analysis of human RNA-seq with WGCNA (with part of this dataset), I was trying to extract counts after importation of Salmon quants with tximeta. As I wanted to implement advice from this paper , my goal was to extract counts with minimal normalization, to then perform the upper quartile normalization recommanded in the aforementioned paper.

As I've seen in the tximport vignette that it is not recommended to pass the original counts without an offset, I used the option countsFromAbundance = "scaledTPM" to implement the bias corrected counts without an offset solution, instead of the default option. However, I wanted to be sure (very basically) that the two options were quite close, as they are in the F1000 paper in real datasets (if I'm not mistaken, there, the default option corresponds to simplesum, the "scaledTPM" to scaledTPM). I thus calculated the two options (code below), and then inspected the correlation.


# summarise to Gene
se <- tximeta(coldata)
gse <- summarizeToGene(se)
gse_scaled <- summarizeToGene(se, countsFromAbundance = "scaledTPM")

raw_counts <- assay(gse, "counts")
counts_scaled <- assay(gse_scaled, "counts")

# Those two matrix have the same structure
all(colnames(raw_counts) == colnames(counts_scaled))
all(rownames(raw_counts) == rownames(counts_scaled))

# Indices just to be sure
diag(cor(raw_counts,
         counts_scaled[rownames(raw_counts), colnames(raw_counts)]))

# Output : 
# 
# SRR7798788 SRR7798789 SRR7798790 SRR7798791 SRR7798792 SRR7798793 SRR7798794 
# 0.6237471  0.6316028  0.6290649  0.6309516  0.6269719  0.6345217  0.6351552 
# SRR7798795 SRR7798796 
# 0.6314391  0.6204099

As you can see, the result is not what I (wrongly?) expected.

I also wanted to check if this was somehow linked to a wrong usage of tximeta, so I used tximport and got the exact same results. I then used the option "lengthScaledTPM", and got results closer to my expectations :

# Same code as before except the option changed
# Output :
# SRR7798788 SRR7798789 SRR7798790 SRR7798791 SRR7798792 SRR7798793 SRR7798794 
# 0.9999231  0.9998466  0.9999702  0.9999487  0.9999728  0.9999512  0.9997092 
# SRR7798795 SRR7798796 
# 0.9999280  0.9998970

I don't understand the big difference in correlation : even though the samples could exhibit differential transcript usage, I didn't expect it to be that important, since all the samples are from the same cell line.

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 34 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] tximport_1.18.0             ensembldb_2.14.1           
 [3] AnnotationFilter_1.14.0     GenomicFeatures_1.42.3     
 [5] AnnotationDbi_1.52.0        AnnotationHub_2.22.1       
 [7] BiocFileCache_1.14.0        dbplyr_2.1.1               
 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0             
[11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
[13] IRanges_2.24.1              S4Vectors_0.28.1           
[15] BiocGenerics_0.36.1         MatrixGenerics_1.2.1       
[17] matrixStats_0.60.0          magrittr_2.0.1             
[19] tximeta_1.8.5              

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2              ellipsis_0.3.2               
  [3] dynamicTreeCut_1.63-1         htmlTable_2.2.1              
  [5] XVector_0.30.0                base64enc_0.1-3              
  [7] rstudioapi_0.13               DT_0.18                      
  [9] bit64_4.0.5                   interactiveDisplayBase_1.28.0
 [11] fansi_0.5.0                   xml2_1.3.2                   
 [13] codetools_0.2-18              splines_4.0.5                
 [15] doParallel_1.0.16             cachem_1.0.5                 
 [17] impute_1.64.0                 knitr_1.33                   
 [19] jsonlite_1.7.2                Formula_1.2-4                
 [21] Rsamtools_2.6.0               WGCNA_1.70-3                 
 [23] cluster_2.1.2                 GO.db_3.12.1                 
 [25] png_0.1-7                     shiny_1.6.0                  
 [27] readr_2.0.0                   BiocManager_1.30.16          
 [29] compiler_4.0.5                httr_1.4.2                   
 [31] backports_1.2.1               lazyeval_0.2.2               
 [33] assertthat_0.2.1              Matrix_1.3-4                 
 [35] fastmap_1.1.0                 later_1.2.0                  
 [37] htmltools_0.5.1.1             prettyunits_1.1.1            
 [39] tools_4.0.5                   gtable_0.3.0                 
 [41] glue_1.4.2                    GenomeInfoDbData_1.2.4       
 [43] dplyr_1.0.7                   rappdirs_0.3.3               
 [45] Rcpp_1.0.7                    vctrs_0.3.8                  
 [47] Biostrings_2.58.0             preprocessCore_1.52.1        
 [49] rtracklayer_1.50.0            iterators_1.0.13             
 [51] xfun_0.24                     fastcluster_1.2.3            
 [53] stringr_1.4.0                 mime_0.11                    
 [55] lifecycle_1.0.0               XML_3.99-0.6                 
 [57] zlibbioc_1.36.0               scales_1.1.1                 
 [59] vroom_1.5.3                   ProtGenerics_1.22.0          
 [61] hms_1.1.0                     promises_1.2.0.1             
 [63] RColorBrewer_1.1-2            yaml_2.2.1                   
 [65] curl_4.3.2                    memoise_2.0.0                
 [67] gridExtra_2.3                 ggplot2_3.3.5                
 [69] biomaRt_2.46.3                rpart_4.1-15                 
 [71] latticeExtra_0.6-29           stringi_1.7.3                
 [73] RSQLite_2.2.7                 BiocVersion_3.12.0           
 [75] foreach_1.5.1                 checkmate_2.0.0              
 [77] BiocParallel_1.24.1           rlang_0.4.11                 
 [79] pkgconfig_2.0.3               bitops_1.0-7                 
 [81] lattice_0.20-44               purrr_0.3.4                  
 [83] GenomicAlignments_1.26.0      htmlwidgets_1.5.3            
 [85] bit_4.0.4                     tidyselect_1.1.1             
 [87] R6_2.5.0                      generics_0.1.0               
 [89] Hmisc_4.5-0                   DelayedArray_0.16.3          
 [91] DBI_1.1.1                     withr_2.4.2                  
 [93] pillar_1.6.2                  foreign_0.8-81               
 [95] survival_3.2-11               RCurl_1.98-1.3               
 [97] nnet_7.3-16                   tibble_3.1.3                 
 [99] crayon_1.4.1                  utf8_1.2.2                   
[101] tzdb_0.1.2                    jpeg_0.1-9                   
[103] progress_1.2.2                grid_4.0.5                   
[105] data.table_1.14.0             blob_1.2.2                   
[107] digest_0.6.27                 xtable_1.8-4                 
[109] httpuv_1.6.1                  openssl_1.4.4                
[111] munsell_0.5.0                 askpass_1.1

tximeta tximport • 1.7k views

ADD COMMENT • link 3.7 years ago paulimer • 0

score 1 · Answer 1 · 2021-08-03

The counts will not be closely correlated to the TPM across genes, because the gene length is included in the gene count but not in the TPM. I'm not sure what you're looking for with that correlation analysis. The important thing is that, either via (1) counts with offset in a GLM, and (2) use of scaledTPM or lengthScaledTPM without an offset in a GLM or LM, both ways will control for changes to gene length across samples per gene when making comparisons across samples (e.g. calculation of LFC).

The difference between your analysis and the figure you link to, is that you are looking at correlation of expression across genes, while the analysis from the paper is looking at correlation of log2 fold change across genes.