We have a time course infection experiment with 12 samples (2 technical replicates and 2 biological replicates per sample). The samples were sequenced on a HiSeq over 2 lanes, 1 x 50 bp reads.
- 24hr_infection1_lane3
- 24hr_infection1_ lane4
- 24hr_infection2_lane3
- 24hr_infection2_lane4
- 24hr_control1_lane3
- 24hr_control1_lane4
- 24hr_control2_lane3
- 24hr_control2_lane4
- 48hr_infection1_lane3
- 48hr_infection1_lane4
- 48hr_infection2_lane3
- 48hr_infection2_lane4
I aligned the samples using Tophat and then converted the .bam files to .sam files and created a counts file using HTSeq for each sample. My questions now are:
- Should the technical replicates be collapsed (add gene counts for technical replicates)?
- How should the design matrix be setup in edgeR?
Thanks very much for your input.
Based on your response, we are thinking that we should redo the alignment using bowtie2 and use the .sam output for creating a counts file using HTSeq for each sample and then collapse technical replicates. Yes, you are correct in assuming that we will compare both time points to the 24 hr control since we do not have an additional control at 48 hrs.
To be clear, if one is only interested in differential gene expression between treatment groups, then the Tuxedo suite tools is not a good choice? Ideally, one would use a different aligner (BWA, Bowtie, STAR, etc) and then use either edgeR or DESeq2?
I am saying that Tophat is probably not the tool you want for the task at hand. It uses bowtie to do the alignment, and then goes back and tries to find junction splicing reads. If you don't care about the junction splicing reads (which are in a different file than your aligned reads), then you are doing an extra step, which takes that much more time, and then not using the data that you took the extra time to generate. I don't think that the accepted-hits.bam file contains the junction-spanning reads, so you may be losing those.
Using bowtie2 for the aligner is helpful because it is a gapped aligner, so you shouldn't lose reads that span exon-exon junctions, which you might lose with bowtie1. IIRC, bwa and star are gapped aligners as well. Subread just soft-clips the junction spanning reads, but that is a distinction without a difference. You will still get the same number of reads per gene (because subread says 'the read aligns to the end of this exon', and the gapped aligners say 'the read aligns to the ends of these two exons', but both result in that read being counted as overlapping a gene).
So to answer your question, yes, ideally you would just use one of the gapped aligners that you mention and then use either edgeR or DESeq2 to test for differential expression.
And I forgot to mention that you can just use bowtie2 (or pretty much any aligner) to align the technical replicates together, so you don't have to collapse after the alignment. You just do something like
or if you have paired end
This is really helpful -- thank you!