Dear all,
I'd like your opinion on the best way to compare two technical replicates.
First, I give you some preliminary information on the experiment. I isolated total RNA from the same sample (mammalian cell line) by usining the same kit (Qiagen) but two different protocols for DNA elimination (using the Qiagen gDNA eliminator column or the Qiagen DNase). Looking at the RNA at the TapeStation, the first protocol led to the complete loss of the RNA 28s subunit. This was weird...so I decided to go deeper.
As the final goal is the gene expression analysis (DE), RNA-seq libraries for these two samples were constructed and sequenced. I obtained a different coverage for the two libraries: 28 M reads and 44 M reads. Most probably due to the flowcell balancing that was not optimal (as declared by the service provider).
Now I'd like to know if the two libraries are similar (theorically they should be identical, but we know that this is impossible!), and which is the best one. I checked the number of expressed genes counting the genes with a minimum of x raw reads, but this is cearly impacted by the coverage. So the 44M llibrary showed a higher number of genes.
Do you have any suggestion to evaluate the similarity between the two samples and to determine which is the best one?
Thank you all Best Marianna
Hi Gordon Smyth ,
thank you for your reply.
I used the following code:
But I got the following error:
It seems impossible with two samples. Isnt'it?
The plotMD is weird. Points should be "packed" around 1, if samples are very similar. isnt'it?
Thank you for your suggestions
Marianna
My suggestion was to fit an edgeR model with just an intercept, not a two-group model as you have tried to do. You need
The MD plot shows that the first sample has fairly high counts for lots of genes that are completely absent in the second sample. More generally, the first sample has similar representation to the second for most genes but reduced representation for sizeable minority of genes compared to the second.
Correction
I should have said simply
design <- matrix(1,2,1)
.Thnk you for your reply, Gordon Smyth
I tried with
but it gives the following error. I tried in several ways but I didn't work. Why?? This is the error:
As far as the plotMD, don't you think that sample 2 has fairly high counts for lots of genes that are completely absent in the sample 1? This should be explained by the different coverage of the two RNAseq libraries (test 1: about 28M reads; test 2: more than 40M reads). Isn't it? I attach the two plots below.
Thank you for your time!
Marianna
As to the design, now I think the code is ok
I got 0.550701, thus BCV is 0.74. Really high....
The result is clear. Inter-replicate dispersion is huge. Replicates are not equivalent, not even close. Second replicate is better in that it gives more complete coverage. The difference is not explainable merely by coverage.
If this has helped you and answered your question, you could consider clicking on the "thumbs up" icon next to my answer above.
Thank you Gordon Smyth
I got the point. Just the last question: how can you say that the difference between the two samples is not explanable merely by coverge? The fact the some genes (most probably those with low expression level), are represented only in sample 2, cannot be attributed to coverage? I mean...it's just a matter of probability to catch a gene. And the higher is the coverage the higer is the probabability. Isn't it?
I have to choose one of the two replicas, to establish the best protocol for RNA isolation. And the doubt is the following: sample 2 is better than sample 1 just because of the coverage, or there are some other characteristics that concurr to attribute to sample 2 the best score?
Thank you again for your valuable advices. Best Marianna
If the difference in coverage was merely due to chance and sequencing depth then the BCV would be equal to 0. The BCV already adjusts for chance variation associated with sequencing. It represents the difference in true expression levels after sequencing variability has been removed, and it is independent of sequencing depth.
It is practically impossible that you can get so many genes highly expressed in one sample and completely absent in the other merely by chance.
Gordon Smyth sorry for annoying you...this is really the last one!
When you say "so many genes", what do you mean?
I counted the number of genes expressed (reads > 10 or >20) after TMM normalization. These are the results:
Sample 1 Raw reads: 28.540.170 Genes >10 reads: 11.499 Genes >20 reads: 10.696
Sample 2 Raw reads: 44.560.925 Genes >10 reads: 13.041 Genes >20 reads: 12.169
Do you think these differences (about 1500 genes) reflect more than a different coverage?
Thank you again
Marianna
Look at the MD plot. Why do you think there are so many points below the 0-line in first MD plot and above the 0-line in the second plot? Especially look at the diagonal line below the first plot, which is made up of genes with count=0 in the first sample. Many of these have very high counts in the second sample, as can be seen from their position in the plot.
And as I have said before, the high BCV tells you rigorously that this imbalance is not due to chance or sampling depth.
I don't think that gives you any useful information. When I said "expressed" I meant literally count > 0. When I said "completely absent" I meant count=0.
I believe that I have given you advice that clearly elucidates the difference between your samples. If you prefer to do your own analysis that is fine. But I will not respond any more in this thread.