Hello,
I have seen various posts regarding Cooks distance for outlier detection, but I couldn't see how to determine what is the actual cuttoff value used? Also, is it on a per gene basis or the same for all genes?
So I am running a Log Ratio Test accounting for batch:
dds.LRT <- DESeq(dds, betaPrior=FALSE, test="LRT", full=~ batch + condition, reduced=~ batch)
I don't see any gene replacement in the output:
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
replaced_rows = mcols(dds.LRT)$replace
# replaced_rows is empty
I am interested in the results of a few features defined in "features", some of which are being filtered out as outliers based on Cooks distance (basemean > 0, padj and pvalue = "NA"):
# Get the results of the LRT
res=as.data.frame(results(dds.LRT))
# Get the basemean, pvalue and padj values
LRT_details_df=res[,c("baseMean", "pvalue","padj")]
LRT_details_df[features, ]
baseMean pvalue padj
feat1 93.144247 0.1233287949 0.41052328
feat2 17.379321 0.6224541566 0.82849752
feat3 463.849956 0.0058166253 0.09553337
feat4 8.685396 0.2886092354 0.60136924
feat5 55.207087 0.9526211931 0.98063952
feat6 82.898650 NA NA
feat7 35.431748 NA NA
feat8 145.626899 NA NA
feat9 108.813750 0.0805032677 0.33776712
feat10 124.267515 NA NA
feat11 31.028438 0.0041612469 0.08059552
feat12 118.585676 NA NA
feat13 137.557844 0.2471188499 0.56279998
feat14 18.188999 0.5650030076 0.79687040
feat15 6.319050 0.0001653492 0.01305311
# Get Cooks distance for the features:
cooks_distance_sample = assays(dds.LRT)[["cooks"]][features,]
# For each feature (id), get the maximum cooks distance (out of all the samples)
maxCooks_features <- apply(cooks_distance_sample, 1, max)
flagged = maxCooks_features > mcols(dds.LRT, use.names=TRUE)[features, ]$maxCooks
flagged
feat1 FALSE
feat2 TRUE
feat3 FALSE
feat4 TRUE
feat5 FALSE
feat6 FALSE
feat7 FALSE
feat8 FALSE
feat9 FALSE
feat10 FALSE
feat11 FALSE
feat12 FALSE
feat13 FALSE
feat14 FALSE
feat15 TRUE
I thought maxCooks was the threshold on a feature basis to declare the row as containing an outlier but the rows surpassing these threshold do not correspond to the rows marked as outliers, how can one determine the actual number used to threshold rows as outliers?
Thanks!