Hello,
I am preparing for RNA-seq data analysis using DESeq2.
Following the instructions in the vignete, I started by aligning RNA-seq reads to the genome using STAR, and then used the following code to count reads per gene, for only one bam file:
filenames <- c("/path/to/sorted.bam")
bamfiles <- BamFileList(filenames)
txdb <- makeTxDbFromGFF("/path/to/ann.gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
se <- summarizeOverlaps(features = ebg, reads = bamfiles, mode = 'Union', singleEnd = TRUE)
Read counts per gene seem very low. For example, when I run sum(assay(se))
, I only get ~2 million reads. This is surprising since the bam file contains ~59 million alignments. I ran Qualimap with the same bam and GTF, which resulted in ~44 million reads mapping to genes, mostly in exons.
Does anyone have an idea what could cause this problem? Or maybe how to debug it, e.g, how can I find out how many reads were discarded and why? How come Rsamtools ends up with only a fraction of the reads mapped to genes found by Qualimap?
Thanks!
I would add that James's answer assumes your RNA-seq libraries were prepared with a stranded protocol, such as Illumina's TruSeq Stranded mRNA, which is the default setting for the argument
ignore.strand=FALSE
in the call tosummarizeOverlaps()
. However, if your RNA-seq libraries were prepared with a non-stranded protocol, such as Illumina's TruSeq RNA Library Prep Kit v2, then you should setignore.strand=TRUE
and you don't need the argumentpreprocess.reads
. You can find more information on usingsummarizeOverlaps()
with single-end RNA-seq data prepared with stranded and non-stranded protocols in this other thread.Thanks a lot! This solved the problem.
I think this argument is necessary not because the alignment was performed with STAR, but because the library is reverse-strand-specific. I've actually tried setting ignore.strand=FALSE, but this didn't solve the problem.