Hi there,
I am here to get advice regarding the cooks filter for a comparison 2 x 3. I already read posts in several forums but i still got doubts because i need to do a "manual" filter since the cooks filter don´t apply here in my situation.
(DESeq2 user guide states: "The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds,cooksCutoff=FALSE).")
I dont feel confortable enough to use a "manual" filter without sharing here and someone give a input of this...
My dataset is
SampleID CellType Generation CellType_Gen 28A Ctrl 1 Ctrl_1 29A Ctrl 1 Ctrl_1 30A Ctrl 1 Ctrl_1 28C Ctrl 500 Ctrl_500 29C Ctrl 500 Ctrl_500 31A Mut 1 Mut_1 32A Mut 1 Mut_1 33A Mut 1 Mut_1 34A Mut 1 Mut_1 37A Mut 1 Mut_1 41A Mut 1 Mut_1 34C Mut 500 Mut_500 37C Mut 500 Mut_500 38C Mut 500 Mut_500 41C Mut 500 Mut_500 45C Mut 500 Mut_500 68C Mut 500 Mut_500
Its important to state i had 3 for Ctrl_500 but one i removed because is a outlier.
And what i did was this:
dds<-DESeqDataSetFromMatrix(countData = table, colData = data, design= ~ CellType_Gen)
dds<-DESeq(dds)
results(dds, name=c("Cell_Gen_Ctrl_500_vs_Ctrl_1"), pAdjustMethod="fdr", lfcThreshold = 0.59, alpha=0.05)
But gives me a DEGs that i am not confortable calling DEGs:
baseMean log2FoldChange lfcSE stat pvalue padj Gene1 3914.910978 -2.29767 0.3983925 -4.286401 1.815913e-05 9.350556e-03 Gene2 106.197308 -29.94073 5.6210867 -5.221540 1.774413e-07 1.202515e-04 Gene3 84.828549 -29.83375 5.6220937 -5.201577 1.976048e-07 1.202515e-04 Gene4 586.986077 -12.48934 2.8471245 -4.179423 2.922500e-05 1.397373e-02 Gene5 6036.720262 -30.00000 5.6208094 -5.232342 1.673753e-07 1.202515e-04 Gene6 40.291988 -29.99999 5.1266506 -5.736687 9.654640e-09 1.615704e-05 Gene7 1.974293 -29.99423 4.0568705 -7.248008 4.229450e-13 1.415597e-09 Gene8 1.865582 -29.92969 4.8053043 -6.105688 1.023586e-09 2.283961e-06 Gene9 5.142519 -29.94403 5.6271704 -5.216481 1.823542e-07 1.202515e-04 Gene10 108.487045 -30.00000 5.6212903 -5.231895 1.677811e-07 1.202515e-04 Gene11 188.233450 -29.96725 5.6210734 -5.226271 1.729627e-07 1.202515e-04 Gene12 132.808155 -29.95709 4.0282970 -7.290200 3.094953e-13 1.415597e-09 Gene13 230.582865 28.20695 5.5955986 4.935477 7.995492e-07 4.460152e-04 Gene14 83.802949 30.00000 5.5596687 5.289884 1.223942e-07 1.202515e-04
The raw counts of this DEGs are:
28A 28C 29A 29C 30A 31A 32A 33A 34A 34C 37A 37C 38C 41A 41C 45C 68C Gene1 6129 924 4520 761 3147 3294 8336 8978 8605 2187 4928 2247 2474 6074 2059 1540 1299 Gene2 0 0 614 0 0 0 0 0 0 0 0 0 0 684 0 0 540 Gene3 0 0 147 0 0 205 0 255 0 0 0 278 255 0 246 0 0 Gene4 1441 0 1131 0 1002 869 0 816 792 682 991 617 685 0 587 657 0 Gene5 18177 0 13712 0 0 0 0 0 0 21031 25555 0 0 0 0 29532 0 Gene6 137 0 0 0 102 0 101 106 0 0 0 64 108 0 0 0 65 Gene7 0 0 0 0 5 0 0 0 4 0 15 0 0 0 6 0 4 Gene8 10 0 0 0 0 0 0 0 6 0 0 9 7 0 0 0 0 Gene9 34 0 0 0 0 0 0 0 0 0 0 0 0 43 0 13 0 Gene10 0 0 620 0 0 0 0 0 0 0 0 0 0 1289 0 0 0 Gene11 0 0 0 0 1940 0 0 0 0 0 1488 0 0 0 0 0 0 Gene12 215 0 257 0 0 153 0 193 0 336 331 323 182 0 0 310 0 Gene13 0 0 0 3 0 0 1187 0 0 0 1300 2 1457 0 0 0 0 Gene14 0 913 0 0 0 0 0 0 0 0 0 0 0 0 444 0 0
So, if you look to the samples 28A, 29A, 30A (Ctrl_1) and 28C, 29C (Ctrl_500) there are some DEGs with counts in only one sample. Thats why i need to apply a manual filter to remove these false positives.
I applied:
mcols(dds)$maxCooks<-apply(assays(dds)[["cooks"]],1,max)
And now, my number of DEGs are 11! But i still have got DEGs with counts in one sample. This step removed the Genes 10,11 and 14.
I am asking if there is other approaches that i can do to "fix" this issue! Or i just do the rest of the analysis (functional analysis) with this 11 DEGs.
Thanks in advance. All the best. Andreia
Thanks for your fast reply :)
Today, at lunch i have got a filtering topic conversation with my collegue :) its seems that not exist a standard steps or the best approach to follow... its understandble because the dataset is different everytime for a different sequencing! We need to adjust the filter to the dataset. But, almost of the times i choose to do a filter:
and then apply the DESeq normalization.
So, I did what you suggest and i end up with 5 DEGs ... Gene 1, 3, 4 and 6. Its away better but still wonder if the Gene3 is a false positive or not.
I understand this filter improved the dataset, but looking again the threads about the filtering settings, sometimes you answer like you did to me:
and other times you answer:
i wonder what is the correct one, given that counts is different from the normalized counts.
Also, if i want to use the CellType instead of Celltype_Gen to compare only Ctrl and Mut the correct filter in this case would be:
Thanks in advance, Andreia