I used Salmon for quasi-alignment-based quantification of my RNAseq data (mouse, mm10, raw fastq-files). I imported the generated quant.sf data files into R for use with edgeR, following the manual here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html
Here is my code:
# Importing the files with tximport txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene) head(txi.salmon$counts) #edgeR import cts <- txi.salmon$counts rownames(cts) <- gsub(rownames(cts), pattern = "\\.[0-9]+$", replacement = "") # Remove transcript version .** normMat <- txi.salmon$length normMat <- normMat/exp(rowMeans(log(normMat))) o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat)) y <- DGEList(cts) y$offset <- t(t(log(normMat)) + o) # y is now ready for estimate dispersion functions see edgeR User's Guide
My question is now whether I can proceed with the generated y object as I would usually do, namely in the following way:
# Pre-filtering edgeR y$samples keep <- rowSums(cpm(y) > 1) >= 2 # Outfilter low-count/unexpressed genes (more than 1 CPM in at least 2 libraries; smallest group size is 3) summary(keep) y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions y <- calcNormFactors(y, method = "TMM") # Calculate the library normalization factors y <- estimateDisp(y, design, robust = TRUE) fit <- glmQLFit(y, design, robust = TRUE)
Is it correct to filter in this way and calculate the library normalization factors as I do here? Does edgeR "know" to use the y$offset information that was put in?
I am unsure as calcNormFactors was already called in the tximport part before.
sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /scicore/soft/apps/OpenBLAS/0.2.13-GCC-4.8.4-LAPACK-3.5.0/lib/libopenblas_prescottp-r0.2.13.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] RSQLite_2.0 STRINGdb_1.18.0 rafalib_1.0.0 edgeR_3.20.9 [5] limma_3.34.9 biomaRt_2.34.2 tximport_1.6.0 GenomicFeatures_1.30.0 [9] AnnotationDbi_1.40.0 Biobase_2.38.0 GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 [13] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0 tximportData_1.6.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 locfit_1.5-9.1 lattice_0.20-35 [4] prettyunits_1.0.2 png_0.1-7 Rsamtools_1.30.0 [7] Biostrings_2.46.0 gtools_3.5.0 assertthat_0.2.0 [10] digest_0.6.12 chron_2.3-51 R6_2.2.2 [13] plyr_1.8.4 sqldf_0.4-11 httr_1.3.1 [16] gplots_3.0.1 zlibbioc_1.24.0 rlang_0.1.4 [19] progress_1.1.2 curl_3.0 gdata_2.18.0 [22] blob_1.1.0 Matrix_1.2-12 hash_2.2.6 [25] gsubfn_0.7 proto_1.0.0 splines_3.4.2 [28] RMySQL_0.10.13 BiocParallel_1.12.0 statmod_1.4.30 [31] readr_1.1.1 stringr_1.2.0 igraph_1.1.2 [34] RCurl_1.95-4.8 bit_1.1-12 DelayedArray_0.4.1 [37] compiler_3.4.2 rtracklayer_1.38.0 pkgconfig_2.0.1 [40] tcltk_3.4.2 SummarizedExperiment_1.8.0 tibble_1.3.4 [43] GenomeInfoDbData_0.99.1 matrixStats_0.52.2 XML_3.98-1.9 [46] GenomicAlignments_1.14.1 bitops_1.0-6 grid_3.4.2 [49] DBI_0.7 magrittr_1.5 KernSmooth_2.23-15 [52] stringi_1.1.6 XVector_0.18.0 rjson_0.2.15 [55] RColorBrewer_1.1-2 tools_3.4.2 bit64_0.9-7 [58] hms_0.4.2 plotrix_3.6-6 yaml_2.1.14 [61] caTools_1.17.1 memoise_1.1.0
Thanks, Aaron. Would I use
scaleOffset()
like this:Or should I do this:
Are you sure scaleOffset() is necessary here, as I am unsure whether the calculations in the tximport bit are taking care of this already:
The latter call
scaleOffset(y, offset)
would be sufficient.As for your second question: possibly not strictly necessary, as it seems that the addition of the o serves to scale the offsets already. However, using
scaleOffset
does no harm and protects your code against future changes to the name of theoffset
field.Perfect, thanks a lot.
Thanks Aaron, I can take a look and if it seems like we're doing the same thing outside of an edgeR function, I'll just swap it out for scaleOffset.
I think this function came into edgeR in Bioc 3.4 and the tximport vignette code is from Bioc 3.3, but I didn't realize that I could swap it out until now.
I don't think it does the same thing, as
scaleOffset
will add/subtract a constant value per-row so that the row-average of the offsets is equal to the average log-library size. The tximport-related code above will alter the offset for each library with+ o
, so it's not a like-for-like replacement for that step.Nonetheless, I'd still recommend using
scaleOffset
over a direct assignment to$offset
, if for no other reason than the fact that users don't have to remember whether to use$offset
or$offsets
.Ok so then I’ll just update how we insert the thing we had already so it’s more edgeR API friendly. Thanks