Question

Deseq2 cooks filter for 2 replicates x 3 replicates

0

Entering edit mode

andreia ▴ 10

@andreia-23745

Last seen 3.1 years ago

Portugal

Hi there,

I am here to get advice regarding the cooks filter for a comparison 2 x 3. I already read posts in several forums but i still got doubts because i need to do a "manual" filter since the cooks filter don´t apply here in my situation.

(DESeq2 user guide states: "The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds,cooksCutoff=FALSE).")

I dont feel confortable enough to use a "manual" filter without sharing here and someone give a input of this...

My dataset is

 SampleID CellType    Generation  CellType_Gen 
 28A        Ctrl         1          Ctrl_1
 29A        Ctrl         1          Ctrl_1 
 30A        Ctrl         1          Ctrl_1 
 28C        Ctrl       500          Ctrl_500
 29C        Ctrl       500          Ctrl_500
 31A        Mut          1          Mut_1
 32A        Mut          1          Mut_1
 33A        Mut          1          Mut_1
 34A        Mut          1          Mut_1
 37A        Mut          1          Mut_1
 41A        Mut          1          Mut_1
 34C        Mut        500          Mut_500
 37C        Mut        500          Mut_500
 38C        Mut        500          Mut_500
 41C        Mut        500          Mut_500
 45C        Mut        500          Mut_500
 68C        Mut        500          Mut_500

Its important to state i had 3 for Ctrl_500 but one i removed because is a outlier.

And what i did was this:


dds<-DESeqDataSetFromMatrix(countData = table, colData = data, design= ~ CellType_Gen)

dds<-DESeq(dds)

results(dds, name=c("Cell_Gen_Ctrl_500_vs_Ctrl_1"), pAdjustMethod="fdr", lfcThreshold = 0.59, alpha=0.05)

But gives me a DEGs that i am not confortable calling DEGs:

            baseMean           log2FoldChange     lfcSE          stat               pvalue         padj
Gene1     3914.910978          -2.29767        0.3983925     -4.286401        1.815913e-05     9.350556e-03
Gene2     106.197308          -29.94073        5.6210867     -5.221540        1.774413e-07     1.202515e-04
Gene3     84.828549           -29.83375        5.6220937     -5.201577        1.976048e-07     1.202515e-04
Gene4     586.986077          -12.48934        2.8471245     -4.179423        2.922500e-05     1.397373e-02
Gene5     6036.720262         -30.00000        5.6208094     -5.232342        1.673753e-07     1.202515e-04
Gene6     40.291988           -29.99999        5.1266506     -5.736687        9.654640e-09     1.615704e-05
Gene7     1.974293            -29.99423        4.0568705     -7.248008        4.229450e-13     1.415597e-09
Gene8     1.865582            -29.92969        4.8053043     -6.105688        1.023586e-09     2.283961e-06
Gene9     5.142519            -29.94403        5.6271704     -5.216481        1.823542e-07     1.202515e-04
Gene10    108.487045          -30.00000        5.6212903     -5.231895        1.677811e-07     1.202515e-04
Gene11    188.233450          -29.96725        5.6210734     -5.226271        1.729627e-07     1.202515e-04
Gene12    132.808155          -29.95709        4.0282970     -7.290200        3.094953e-13     1.415597e-09
Gene13    230.582865           28.20695        5.5955986      4.935477        7.995492e-07     4.460152e-04
Gene14    83.802949            30.00000        5.5596687      5.289884        1.223942e-07     1.202515e-04

The raw counts of this DEGs are:

          28A 28C  29A 29C  30A  31A  32A  33A  34A  34C  37A  37C  38C  41A  41C  45C  68C
Gene1    6129 924 4520 761 3147 3294 8336 8978 8605 2187 4928 2247 2474 6074 2059 1540 1299
Gene2      0   0  614   0    0    0    0     0   0     0    0    0   0   684   0    0   540
Gene3      0   0  147   0    0   205   0    255  0     0    0  278  255   0   246   0   0        
Gene4    1441  0  1131  0  1002  869   0    816 792  682   991 617  685   0   587  657  0
Gene5    18177 0  13712 0    0    0    0     0   0  21031 25555  0   0    0    0  29532 0
Gene6     137  0   0    0   102   0   101   106  0     0    0   64  108   0    0    0   65
Gene7      0   0   0    0    5    0    0     0   4     0   15    0   0    0    6    0   4
Gene8     10   0   0    0    0    0    0     0   6     0    0    9   7    0    0    0   0
Gene9     34   0   0    0    0    0    0     0   0     0    0    0   0    43   0   13   0
Gene10     0   0  620   0    0    0    0     0   0     0    0    0   0   1289  0    0   0
Gene11     0   0   0    0  1940   0    0     0   0     0  1488   0   0    0    0    0   0
Gene12    215  0  257    0   0   153   0    193  0    336  331 323  182   0    0  310   0
Gene13     0   0   0    3    0    0  1187    0   0     0  1300   2  1457  0    0    0   0
Gene14     0  913  0    0    0    0    0     0   0     0    0   0    0    0   444   0   0

So, if you look to the samples 28A, 29A, 30A (Ctrl_1) and 28C, 29C (Ctrl_500) there are some DEGs with counts in only one sample. Thats why i need to apply a manual filter to remove these false positives.

I applied:

mcols(dds)$maxCooks<-apply(assays(dds)[["cooks"]],1,max)

And now, my number of DEGs are 11! But i still have got DEGs with counts in one sample. This step removed the Genes 10,11 and 14.

I am asking if there is other approaches that i can do to "fix" this issue! Or i just do the rest of the analysis (functional analysis) with this 11 DEGs.

Thanks in advance. All the best. Andreia

DESeq2 Nullcook'svalues 2x3 DEGs • 1.3k views

ADD COMMENT • link 4.1 years ago andreia ▴ 10

score 0 · Answer 1 · 2021-03-25

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

On other threads, what I've recommended when there are many count outliers like the above is to simply pre-filter, at the top of your script (before running DESeq()):

keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]

You can put in X as the minimum number of samples with a moderate count, e.g. maybe 6 samples for your design.

ADD COMMENT • link 4.1 years ago Michael Love 43k

0

Entering edit mode

Thanks for your fast reply :)

Today, at lunch i have got a filtering topic conversation with my collegue :) its seems that not exist a standard steps or the best approach to follow... its understandble because the dataset is different everytime for a different sequencing! We need to adjust the filter to the dataset. But, almost of the times i choose to do a filter:

    table <- table[rowSums(table) > 0,]
    max<-apply(table, 1, function(x) max(x))
    table <- table[rowSums(table) != max, ]

and then apply the DESeq normalization.

So, I did what you suggest and i end up with 5 DEGs ... Gene 1, 3, 4 and 6. Its away better but still wonder if the Gene3 is a false positive or not.

I understand this filter improved the dataset, but looking again the threads about the filtering settings, sometimes you answer like you did to me:

    keep <- rowSums(counts(dds) >= X) >= Y
    dds <- dds[keep,]

and other times you answer:

    keep <- rowSums(counts(dds, normalized=TRUE) >= X) >= Y
    dds <- dds[keep,]

i wonder what is the correct one, given that counts is different from the normalized counts.

Also, if i want to use the CellType instead of Celltype_Gen to compare only Ctrl and Mut the correct filter in this case would be:

    keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= 5
    dds <- dds[keep,]

Thanks in advance, Andreia

ADD REPLY • link 4.1 years ago andreia ▴ 10