First, let me apologize in advance for posting a problem which might prove to be out of the scope of this forum.
I am currently analysing data from a RNA-seq experiment in bacteria:
The experiment involves 60 RNA samples, corresponding to two groups of bacterial strains (commensal (C) and disease-causing (D)); each group consisting of 15 different strains; either strain treated (IND) and not treated with a chemical (CTR). I am interested in differences in expression between groups (D/C) and between conditions (IND/CTR) within groups.
Samples were stabilized with RNA later before RNA isolation, treated with Ribo-Zero to remove rRNA before library prep with stranded Truseq kit and run on an Illumina Miseq (50 bp single reads).
My pipeline includes aligning reads to the respective genome of each strain using bowtie2, counting reads mapped within CDS using htseq-count (stranded= reverse, minaqual=10), and using edgeR for differential expression analysis. Furthermore, I have restricted the analysis to genes which are found exactly once in each strain ("core transcriptome"), and I remove counts lower than 1 read per million in 15 samples. Library sizes are on average 1,103,694 counts after normalization (TMM), and the average normalization factor is 1.05.
The issue becomes clear when I estimate common, trended and tagvise GLM dispersions in edgeR, resulting in unusual MeanVar and BCV plots like these:
I have tried looking at the groups and conditions separately, all three methods of normalization (TMM, upper quantile and relative log), changing prior.df, removing samples with low RNA quality, removing MDS outliers, removing samples with very high expression of known housekeeping genes, etc. to find an explanation. Basically I am now scratching my head, wondering if anyone else has seen similar dispersion trends, and if anyone has any idea if the cause could be biological, technical or with the analysis?
You should give some indication of what design matrix you are using.