I have RNA-Seq data with paired design, each tissue with treatment and untreated data. I have used DESeq2, using normTransform(),
to calculate the pairwise fold-changes. So Initially I have 60 samples and now I have a Fold change matrix of 30 columns and for all genes.
GeneID sample1_T sample1_U sample2_T sample2_U . . . sample30_U gene1 n n n n . . . gene2 n n n n . . .
FC Matrix:
GeneID sample1 sample2 . . . . sample30 gene1 -3.11 -1.3 . . . . -0.5 gene2 3.12 1.12 . . . . 0.5
Now I would like to filter the data such that the genes which are consistently up/down-regulated in at least 80% of the samples or 60% of the samples, genes that does not have any consistent up/down regulation etc.
I would like to calculate a statistical score which represents the number of samples the gene shown to be up/down regulated and extent of fold change, such that I can use that score to filter the matrix.
I did a DE analysis using the paired-design and tried to use the adj-Pvalue for filtering the data, but I do not know (Edit) if I can use this to filter as per my criterial.
I'll just leave a comment instead of answer, and leave it open for anyone else to answer. I read your post but I don't have any concrete suggestions for you. If you want to derive a new statistic and then estimate its distributional properties, you might want to consider collaborating with a statistician.
I am trying to do this because, in my data, DE of some genes is supported by only few samples, which have high regulation, so I am trying to filter the data such that the DE is supported 90% of samples or 80% of samples etc. So I am checking if there is any way to calculate a score.
For example, the heat map of Fold-change shows that there are genes ( marked in white box) that have high FC values in subset of samples.
Note that the standard significance test is of the null hypothesis (no differences) which will eventually be rejected even if a subset of the pairs show differences.
This is similar to a t-test, where you compare say
rnorm(2*n,0)
withc(rnorm(n,0),rnorm(n,1))
. The second group is bimodal, and so the model is not well specified, but the t-test will eventually reject the null even though half the samples in the second group are iid as the first group. Null hypothesis testing can only get you so far.It sounds to me like this might be a case for clustering or other unsupervised machine learning techniques. The null hypothesis test p-value only gives a limited amount of information here, where the data can be mined in a more rich way.