I am having trouble determining what values I need to use for determining “significant differentially expressed genes”
I have been asked to attempt this using two methods.
Method 1 involves the use of the pipeline used in our lab which involves mapping to HISAT2 -> normalizing with cuffnorm -> getting FPKM values -> getting the LOG2FC of those and performing a T.Test for P values -> adjusting P values with the BH method -> isolating only those with P-value < .05 and log2fc >2
I feel like selecting change greater than 2 is arbitrary because it doesn’t take in to account that even a subtle change in gene expression can have a great change biological function. Unless the LOG2FC just means its’s been scaled down to where a noticeable difference must be greater than |2| in that case than fine I would make sure they are >|2|
The other method that I was told to use was EdgeR. I input raw read counts into EdgeR -> calcnorm factors -> estimate dispersion -> filter out low expressed genes -> glmFIT -> glmQLFTEST -> and use top tags to see which are DE which I noticed are sorted by FDR value. At the same time I noticed that EdgeR calculates LogFC rather than Log2FC. Well which one am I supposed to use? Wouldn’t the resulting P-values and FDR values be changed because they were calculated using log2FC rather than logFC?
I don’t understand why my lab would use LOG2FC from FPKM to calculate P-values then adjust them and why is that better or worse than EdgeR which uses logFC to calculate p values and FDR.
My question is I guess which of the two methods are better? Should I calculate the LOG2FC from edgeRs LogFC and perform a T.Test and adjust them myself? Will that give me the Q-value?, Adjusted P value, or FDR value or are all of those 3 the same thing?
There is no clear answer anywhere I have read so much documentation my brain hurts and I feel like I’m asking my coworkers the same questions over and over again only to realize they don’t really understand it themselves. Please do not refer me to any links because I have gone through many only to leave me with the same questions. If you don’t want to provide actual input then please don’t bother commenting unless the link truly explains it perfectly but I would much rather someone attempt to explain it in layman's terms. Thank you in advance.
F1000Research edgeR QL article is also available as a Bioconductor workflow:
From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline (May 2018)
This version is updated to the current release of Bioconductor and has a few other minor updates or simplications compared to the 2016 journal article version.
Steve Lianogloucan can you please guide me about doing differential gene expression analysis using machine learning algorithms.I will be using Particle swarm intelligenece algorithm for feature selection and after that i will use SVM for classification of cancer related genes from RNA seq data.
Can you please tell me what input should i give to PSO algorithm P values,Z score??or just normalized data?i normalized data using edge R
Hi Maryam,
In short: no, I really can't. It sounds like you are at the early stages of a long journey on a project that needs a lot of care (wether it's worth while, I'll leave that up to you to decide :-)
If you are asking me how to transform RNA-seq data so that it is "best used" as features for some machine learning algorithm, then the most general / naive answer I can give you is to either:
cpm(dgelist, prior.count = 5, log = TRUE)
; orThese approaches will transform your data into log2 space and try to reduce the higher variance observed at lower levels of expression. You want to do this because most algorithms assume that your data isn't heteroscedastic.
You will also likely want to remove lowly expressed / low variance features (genes) prior to feeding these data into some ML algorithm.
... and ... that's all I got for you.
Good luck!