Tximport of quasi-alignment-based RNAseq quantification (Salmon) for use with edgeR
2
0
Entering edit mode
@martinweihrauch-13664
Last seen 6.6 years ago

I used Salmon for quasi-alignment-based quantification of my RNAseq data (mouse, mm10, raw fastq-files). I imported the generated quant.sf data files into R for use with edgeR, following the manual here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

Here is my code: 

# Importing the files with tximport

txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)

head(txi.salmon$counts)


#edgeR import
cts <- txi.salmon$counts
rownames(cts) <- gsub(rownames(cts), pattern = "\\.[0-9]+$", replacement = "") # Remove transcript version .**
normMat <- txi.salmon$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
# y is now ready for estimate dispersion functions see edgeR User's Guide

 

My question is now whether I can proceed with the generated y object as I would usually do, namely in the following way:

 

# Pre-filtering edgeR

y$samples
keep <- rowSums(cpm(y) > 1) >= 2 # Outfilter low-count/unexpressed genes (more than 1 CPM in at least 2 libraries; smallest group size is 3)
summary(keep)
y <- y[keep, keep.lib.sizes=FALSE] # Re-calculate library sizes after count reductions
y <- calcNormFactors(y, method = "TMM") # Calculate the library normalization factors

y <- estimateDisp(y, design, robust = TRUE)
fit <- glmQLFit(y, design, robust = TRUE)

 

Is it correct to filter in this way and calculate the library normalization factors as I do here? Does edgeR "know" to use the y$offset information that was put in?

I am unsure as calcNormFactors was already called in the tximport part before. 

 

 

 

 

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /scicore/soft/apps/OpenBLAS/0.2.13-GCC-4.8.4-LAPACK-3.5.0/lib/libopenblas_prescottp-r0.2.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RSQLite_2.0            STRINGdb_1.18.0        rafalib_1.0.0          edgeR_3.20.9          
 [5] limma_3.34.9           biomaRt_2.34.2         tximport_1.6.0         GenomicFeatures_1.30.0
 [9] AnnotationDbi_1.40.0   Biobase_2.38.0         GenomicRanges_1.30.0   GenomeInfoDb_1.14.0   
[13] IRanges_2.12.0         S4Vectors_0.16.0       BiocGenerics_0.24.0    tximportData_1.6.0    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13               locfit_1.5-9.1             lattice_0.20-35           
 [4] prettyunits_1.0.2          png_0.1-7                  Rsamtools_1.30.0          
 [7] Biostrings_2.46.0          gtools_3.5.0               assertthat_0.2.0          
[10] digest_0.6.12              chron_2.3-51               R6_2.2.2                  
[13] plyr_1.8.4                 sqldf_0.4-11               httr_1.3.1                
[16] gplots_3.0.1               zlibbioc_1.24.0            rlang_0.1.4               
[19] progress_1.1.2             curl_3.0                   gdata_2.18.0              
[22] blob_1.1.0                 Matrix_1.2-12              hash_2.2.6                
[25] gsubfn_0.7                 proto_1.0.0                splines_3.4.2             
[28] RMySQL_0.10.13             BiocParallel_1.12.0        statmod_1.4.30            
[31] readr_1.1.1                stringr_1.2.0              igraph_1.1.2              
[34] RCurl_1.95-4.8             bit_1.1-12                 DelayedArray_0.4.1        
[37] compiler_3.4.2             rtracklayer_1.38.0         pkgconfig_2.0.1           
[40] tcltk_3.4.2                SummarizedExperiment_1.8.0 tibble_1.3.4              
[43] GenomeInfoDbData_0.99.1    matrixStats_0.52.2         XML_3.98-1.9              
[46] GenomicAlignments_1.14.1   bitops_1.0-6               grid_3.4.2                
[49] DBI_0.7                    magrittr_1.5               KernSmooth_2.23-15        
[52] stringi_1.1.6              XVector_0.18.0             rjson_0.2.15              
[55] RColorBrewer_1.1-2         tools_3.4.2                bit64_0.9-7               
[58] hms_0.4.2                  plotrix_3.6-6              yaml_2.1.14               
[61] caTools_1.17.1             memoise_1.1.0  
salmon edger tximport rnaseq • 1.5k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 5 hours ago
The city by the bay

I don't use tximport so I can't speak for that particular section of your code, but otherwise, relevant edgeR functions will automatically detect and use whatever is present in the $offset field. However, we advise using scaleOffset when assigning an offset matrix to a DGEList. This simply adjusts the mean offset per gene so that the offsets are interpretable on the same scale as the log-library sizes, which ensures that the log-fold changes from glmQLFTest are not distorted.

It is also fine to re-compute the normalization factors here. If an offset matrix is available in the DGEList, the normalization factors will not be used to compute the offsets for GLM fitting, so there is no chance of "double normalization".

ADD COMMENT
0
Entering edit mode

Thanks, Aaron. Would I use scaleOffset() like this:

y$offset <- scaleOffset(y = y, offset = t(t(log(normMat)) + o))
# y is now ready for estimate dispersion functions see edgeR User's Guide

 

Or should I do this:

offset <- t(t(log(normMat)) + o)
y <- scaleOffset(y, offset)

Are you sure scaleOffset() is necessary here, as I am unsure whether the calculations in the tximport bit are taking care of this already:

normMat <- txi.salmon$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)

 

 

ADD REPLY
0
Entering edit mode

The latter call scaleOffset(y, offset) would be sufficient.

As for your second question: possibly not strictly necessary, as it seems that the addition of the o serves to scale the offsets already. However, using scaleOffset does no harm and protects your code against future changes to the name of the offset field.

ADD REPLY
0
Entering edit mode

Perfect, thanks a lot.

ADD REPLY
0
Entering edit mode

Thanks Aaron, I can take a look and if it seems like we're doing the same thing outside of an edgeR function, I'll just swap it out for scaleOffset.

I think this function came into edgeR in Bioc 3.4 and the tximport vignette code is from Bioc 3.3, but I didn't realize that I could swap it out until now.

ADD REPLY
0
Entering edit mode

I don't think it does the same thing, as scaleOffset will add/subtract a constant value per-row so that the row-average of the offsets is equal to the average log-library size. The tximport-related code above will alter the offset for each library with + o, so it's not a like-for-like replacement for that step.

Nonetheless, I'd still recommend using scaleOffset over a direct assignment to $offset, if for no other reason than the fact that users don't have to remember whether to use $offset or $offsets.

ADD REPLY
0
Entering edit mode

Ok so then I’ll just update how we insert the thing we had already so it’s more edgeR API friendly. Thanks

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

It shouldn't matter if you run calcNormFactors. There is a function in edgeR called getOffset:

> getOffset
function (y)
{
    offset <- y$offset
    lib.size <- y$samples$lib.size
    norm.factors <- y$samples$norm.factors
    if (!is.null(offset)) {
        return(offset)
    }
    else {
        if (!is.null(norm.factors))
            lib.size <- lib.size * norm.factors
        return(log(lib.size))
    }
}
<bytecode: 0x00000000201ae308>
<environment: namespace:edgeR>

And as you can see, if there is an offset matrix already, that will be used. Otherwise an offset will be computed using the library size and normalization factors.

ADD COMMENT
0
Entering edit mode

So that is how it works! Thanks for your help.

ADD REPLY

Login before adding your answer.

Traffic: 886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6