Dear all,
I do have a statistical question about the use of limma /edgeR/ DEseq2 in order to assess the differential binding / differential peaks in condition1 versus condition2.
I only have bigwig / bedGraph files of two replicates (condition1) and two replicates (condition2) (I do not have access to bam / fastq files).
Considering these 4 samples, deeptools (MultiWigSummary) (or other tools in BioC) generates a file that contains the average scores per region for each bigWig file.
The file format is :
chr start end s01 s02 s3 s4 s13 s15 s19 s21
1 chr1 858542 864405 0.1214228 0.1374998 0.09338397 1.2103991 0.09819990 0.1288492 0.2059920 0.2368644
2 chr1 867458 869835 0.1994169 0.1824069 0.11373949 0.6112617 0.11488137 0.1379892 0.2067585 0.2780747
From a statistical point of view, is it legitimate to use these average score per region as input into edgeR/ DESe2 /limma in order to assess if a region (chr1 : 858542 - 864405) is differentially bound in condition 1 (samples 01 and 02) and not in condition 2 (samples s3 and s4).
Although I expect that the answer to my question is a huge "no", I still want to ask the question on BioC forum in order to potentially receive any other suggestions / opinions.
How shall I check if these scores follow a negative binomial distribution (or a normal distribution) ?
Could I mathematically transform these average scores in such a way that it fits a NB distribution ?
Is there any other alternative to edgeR/ limma/ DESeq2 (or to T-test) in order to answer the question ?
Thanks a lot,
Bogdan
Hi Yunshun, thank you for your reply. I guess that I could multiply each element of the matrix by 1000 in order to transform those numerical values into integers ?
No, it doesn't make sense to do so. The distribution we assume the data would follow is completely lost during the calculation of the 'average scores'. You really need the raw counts in order to use these packages.
OK, thank you. Would it make sense to use a T-test ?
Generally speaking though, shall I have a list of integers, how shall I verify if those integers fit a NB distribution or a gaussian distribution ?
Which commands in R shall I use ?
Or how can I transform the data in order to make it follow a NB, or a normal distribution (in which case I can use a T-test) ?
As mentioned before, it would be better to have a look at the deeptools pipeline where the 'average score' was introduced, and see how the scores are utilized in the downstream DE analysis. Which test to use depends on how the scores were computed in the first place.
There is no short answer to your general questions (except that integers can't be gaussian distributed).
Hi Yunshun. When you get the chance, would it possible please to mention a few websites that I could use in order to learn how to verify in R if a list of numbers (that may represent gene expression levels, peak intensities, chromatin looping strength, etc) fit a NB distribution or a gaussian distribution (or any other distribution, Poisson, exponential, etc )? Thanks !
Hi Yunshun and Bogdan, I think that it is incorrect to say that limma cannot be used for this type of data. Only DESeq2 and edgeR expect counts (and therefore have the NB assumption). My understanding is that limma, having been built to use with (nearly) continuous micro-array data, can handle continuous data just fine.