I have single ended RNAseq reads from an allopolyploid organism. This means that I will have groups of 2,3, or even 4 genes with high sequence homology among them. I am sure some reads will map to more than one gene.
So far, my pipeline is
- Map reads to genome using tophat2, with default options (i.e., no -M or -g)
- count using htseq as
htseq-count -m intersection-strict --stranded=no --idattr=Name -f bam tophat/accepted_hits.bam annotation.gff3 > counts - Use counts in DESeq2.
I am not confident that it is differentiating the homologs "correctly". First of all, tophat2 is built on top of bowtie2, and it seems bowtie2 does not map multi-reads uniformly by default. See: https://www.biostars.org/p/55237/
So once the counts matrix is formed, DESeq2 just uses that to get the p-values.
The other option I have is to use RSEMS, but I do not have knowledge of splice variants and DESeq can only take count data, which RSEMS does not provide.
What is the DESeq2 recommended pipeline that handles multi-mapped reads well (i.e., differentiates between genes with homology)?
If I use RSEMS, what should I use for differential expression analysis? They recommend ebseq but I am not sure how it handles batch effects and the like. Their vignette doesn't mention replicates in the design matrix.
Do you actually want to distinguish between expression of these duplicated genes? Would you be happy with merging closely-related paralogs into a single "logical gene" instead? Then you won't have to care about multi-mapping reads, because it won't affect the counting.
I second Ryan's comments. Depending on the nature of the duplicates, unique alignment may be quite difficult with many paralogs, as any particular part of the gene might be duplicated between paralogs. If you enforce unique alignment in Bowtie2 - either by searching for the absence of the "XS" tag or by filtering on the MAPQ scores - you might end up with very low numbers of reads for each paralog.
I do want to differentiate between the paralogs. Or, I want to be able to show what % of reads map to multiple genes and therefore it is imperative that they be collapsed into "logical genes" for the purpose of gene expression. This is the alignment summary from tophat for one condition:
Reads:
Input : 134145314
Mapped : 121434603 (90.5% of input)
of these: 18881757 (15.5%) have multiple alignments (995 have >20)
90.5% overall read mapping rate.
It seems that the number of multi-mapped reads is not very high (or is it?). What does this mean for my pipeline?
Well, whether it's high or not depends on how many genes are affected. If only a number of genes have paralogs, then 15.5% might be a lot if all of the multi-mapping reads are concentrated around those genes.
If you want to quantify the expression of distinct paralogs, then I would suggest that you filter out multi-mapping reads prior to counting. This will ensure that the estimation of the expression of each gene is not being confounded by its paralogs. It may also result in zero counts for some genes where reads cannot be assigned to a single paralog. This is a fundamental problem of short read sequencing technologies, and I don't think it can be easily fixed in silico.
On the other hand, if two genes are so close in sequence that no reads can be accurately assigned to one or the other, you'd probably be justified in resolving the ambiguity by treating them as just a single gene.
What about http://deweylab.biostat.wisc.edu/rsem/ ?
From what I understand, RSEM will give you expected counts for each gene in each library. Despite being non-integer values, these can be naturally accommodated in
edgeR
's GLM framework without any additional work. As to the specifics - you're probably better asking the RSEM authors on how to use it for your application. I've never used RSEM, and this isn't the right forum to discuss it.If I remove all multi-mapped reads, are the estimates of differential expression still valid. If certain regions of a gene are copies, then all reads mapping exlcusively to them will be ignored. Since this will affect all condtions equally, I am assuming that the estimate of log fold change should be ok. Do you think this the correct assessment for DESeq?
Yes, removing all multi-mapped reads will give accurate results, but you will lose some power to detect differential expression for genes with closely-related paralogs. For genes that are almost exactly duplicated along their full length, you may discard almost all reads for those genes and have no statistical power. You should definitely compare your counts with and without discarding multimapped reads to see if there are any problem genes.