I then created a scatterplot matrix to evaluate the normalization and to see whether data points are symmetrical around the x = y line.
I have two questions about my data:
Some samples have a pronounced streak of highly-expressed genes. How should I deal with such genes in these samples?
One sample shows a large amount of variation, not only in the comparison with samples of another group but also in comparisons of samples from the same group. How would this affect the differential gene expression analysis?
Hi, I would not log [base 2] transform the normalised counts. Please transform via variance-stabilisation (vst()) or regularised log transform (rlog()), and then re-generate the scatterplot.
One sample shows a large amount of variation, not only in the
comparison with samples of another group but also in comparisons of
samples from the same group. How would this affect the differential
gene expression analysis?
They may fail Cook's test and / or the independent filtering step, and be assigned a p- or q- value of NA, respectively. If you could share some actual data that you have, that may help.
For any doubt, check the very detailed and helpful DESeq2 vignette.
Thank you very much for your reply. I used rlog() and vst() as well. The same pattern as described in the question is observed with these transformations. I tried to attach my graph but it looks it's not possible to attach an imagine here. Should I post my raw counts? There are of more than 13000 genes, how can I share them effectively here?
Thank you again.
I filtered for minimum of 10 reads across all samples. Should I try filtering for larger numbers? But what I see is a pronounced streak of highly-expressed genes for one sample so it means there is a large number of genes for that sample. I'd love to share my data here if that is possible, I just don't know how I could do that.
I filtered for minimum of 10 reads across all samples. Should I try
filtering for larger numbers?
You mean that each gene just required a minimum of 10 summed reads across all samples? Perhaps using the mean is better, so, mean > 10.
These artifacts in scatter and MA plots are not entirely unusual. I am going to guess that your sample size is small and that the genes represented by those straight lines have 0 values (and / or constant variance) in one or the other of your sample groups.
Oh, I see, definitely, mean > 10 sounds much more reasonable.
Yes, I only have 5 replicates per each group. Great suggestion, I will check those genes for their read counts and variance.
Sure, let me know if you get anywhere with it. I would start by literally outputting (to the terminal) the rows from your data that contain the lowest expression values based on mean.
Thank you very much for your reply. I used
rlog()
andvst()
as well. The same pattern as described in the question is observed with these transformations. I tried to attach my graph but it looks it's not possible to attach an imagine here. Should I post my raw counts? There are of more than 13000 genes, how can I share them effectively here? Thank you again.If you have not pre-filtered your raw count matrix for low count genes, then that could also produce the observed behaviour.
I filtered for minimum of 10 reads across all samples. Should I try filtering for larger numbers? But what I see is a pronounced streak of highly-expressed genes for one sample so it means there is a large number of genes for that sample. I'd love to share my data here if that is possible, I just don't know how I could do that.
I just uploaded my graph in my github page:
With
log2
:https://github.com/Homap/datascience/blob/master/docs/hteosteo.pdf
With
rlog()
:https://github.com/Homap/datascience/blob/master/docs/hteosteo_rld.pdf
Thanks for sharing.
You mean that each gene just required a minimum of 10 summed reads across all samples? Perhaps using the mean is better, so, mean > 10.
These artifacts in scatter and MA plots are not entirely unusual. I am going to guess that your sample size is small and that the genes represented by those straight lines have 0 values (and / or constant variance) in one or the other of your sample groups.
Hi Kevin,
Oh, I see, definitely, mean > 10 sounds much more reasonable. Yes, I only have 5 replicates per each group. Great suggestion, I will check those genes for their read counts and variance.
Thanks a lot again.
Sure, let me know if you get anywhere with it. I would start by literally outputting (to the terminal) the rows from your data that contain the lowest expression values based on mean.