Question

DESeq2 - assessing expression of low count genes

0

Entering edit mode

rodrigo.duarte88 ▴ 40

@rodrigoduarte88-16306

Last seen 9 months ago

United States

Hi all

I've used an unpublished pipeline developed by collaborators to obtain raw counts of human endogenous retroviruses (HERVs) in RNA-seq data. I have control (CTRL; n= 279) and cases (n=259) which I want to test for differential HERV expression ("Profile") whilst covarying for the effect of institution where sequencing happened (3 levels), gender (2 levels), age bins (5 levels), RIN (continuous), post-mortem interval (PMI, continuous), and five population covariates per sample (this model was previously used in the analysis of gene expression differences between cases and controls reported using these samples). I additionally included a prefiltering step, as I noticed this substantially improved the dispersion estimates (see dispersion estimates prior to prefiltering ). Note that my initial dds object contained 14,211 rows, but I ended up with only 5,740 after removing HERVs which were very lowly expressed (see code below).

pd_nodups$Institution <- factor(pd_nodups$Institution)
pd_nodups$Profile <- factor(pd_nodups$Profile, levels = c("Control","Condition"))
pd_nodups$Gender <- factor(pd_nodups$Gender)
pd_nodups$Age.bins <- factor(pd_nodups$Age.bins)
dds <- DESeqDataSetFromMatrix(countData = raw_exprs_sub, colData = pd_nodups, design = ~ Institution + Profile + Gender + Age.bins + RIN + PMI.hrs + C1 + C2 + C3 + C4 + C5 + Profile)
keep <- rowSums(counts(dds) >= 10) >= 4
dds <- dds[keep,]
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds, maxit=500)
res <- results(dds,name="Profile_Condition_vs_Control")
summary(res)

out of 5740 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 106, 1.8%
LFC < 0 (down)     : 96, 1.7%
outliers [1]       : 0, 0%
low counts [2]     : 1669, 29%
(mean count < 4)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

plotDispEsts(dds)                                    # to plot dispersion estimates
plotCounts(dds,"HERV", intgroup = "Profile")          # to plot counts of HERVs of interest

I was a bit concerned because the s don't look exactly like the graph in the DESeq2 tutorial. The shows very modest expression differences between the conditions, which is sort of expected, considering that the condition I am studying is discreet, so I am not expecting massive HERV expression differences between cases and controls at this stage. The expression profile of the actually does not seem so dissimilar from the significant HERV shown above, but this must be DESeq2 trying to avoid false-positives due to very low counts, which is great.

My point is: I want to make sure these findings are correct, as I've never worked with such a complex dataset before, and I have limited knowledge about whether I should or how I could manually filter these data. Therefore I'd appreciate if you could share your experience or concerns.

Thanks a lot!

Supporting information:

My raw count table contains sample IDs as column names, HERV IDs as row names, and integers stating the number of raw counts per HERV per sample.

head(pd_nodups)
sample_id Institution RIN Profile Gender PMI.hrs Age Age.bins C1 C2 C3 C4 C5
Sample1 1 7.5 Cond M 8.9 68 3 -0.0169115 -0.00503876 0.00316163 -0.000208601 -0.00408248
Sample2 1 8.8 CTRL M 12.3 58 3 -0.00416409 -0.0128559 -0.0643876 -0.0166669 0.00954494
Sample3 1 6.7 Cond M 12.1 66 3 -0.0128424 0.00925869 0.0020633 -0.00311339 -0.0159096
Sample4 1 6.4 Cond F 21.2 76 4 -0.0142869 0.00795162 0.00105317 -0.00398493 -0.0150919
Sample5 1 8.2 CTRL F 5.7 82 4 -0.0123858 0.0218645 -0.000390619 -0.00170997 0.00612761
Sample6 1 7.4 CTRL F 4 81 4 0.0755209 -0.000988837 0.00153065 0.00449107 0.00566539

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.8           DESeq2_1.22.1               SummarizedExperiment_1.12.0 DelayedArray_0.8.0        
[5] BiocParallel_1.16.0         matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0      
[9] GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
[1] locfit_1.5-9.1         Rcpp_1.0.0             lattice_0.20-38        digest_0.6.18          plyr_1.8.4           
[6] backports_1.1.2        acepack_1.4.1          RSQLite_2.1.1          ggplot2_3.1.0          pillar_1.3.0         
[11] zlibbioc_1.28.0        rlang_0.3.0.1          lazyeval_0.2.1         rstudioapi_0.8         annotate_1.60.0      
[16] blob_1.1.1             rpart_4.1-13           Matrix_1.2-15          checkmate_1.8.5        splines_3.5.1        
[21] geneplotter_1.60.0     stringr_1.3.1          foreign_0.8-71         htmlwidgets_1.3        RCurl_1.95-4.11      
[26] bit_1.1-14             munsell_0.5.0          compiler_3.5.1         base64enc_0.1-3        htmltools_0.3.6      
[31] nnet_7.3-12            tibble_1.4.2           gridExtra_2.3          htmlTable_1.12         GenomeInfoDbData_1.2.0
[36] Hmisc_4.1-1            XML_3.98-1.16          crayon_1.3.4           bitops_1.0-6           grid_3.5.1           
[41] xtable_1.8-3           gtable_0.2.0           DBI_1.0.0              magrittr_1.5           scales_1.0.0         
[46] stringi_1.2.4          XVector_0.22.0         genefilter_1.64.0      latticeExtra_0.6-28    Formula_1.2-3        
[51] RColorBrewer_1.1-2     tools_3.5.1            bit64_0.9-7            survival_2.43-1        AnnotationDbi_1.44.0 
[56] colorspace_1.3-2       cluster_2.0.7-1        memoise_1.1.0          knitr_1.20

deseq2 HERVs low counts prefiltering dispersion estimates • 976 views

ADD COMMENT • link updated 6.3 years ago by Michael Love 43k • written 6.3 years ago by rodrigo.duarte88 ▴ 40

score 1 · Answer 1 · 2018-11-26

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

The dispersion plot you show looks perfectly fine to me. I would proceed as you are doing, checking that the top genes make sense by looking at plotCounts(), etc. as a diagnostic. Note that with so many samples, you are powered to find very small differences. You can use the lfcThreshold argument in results() to change the null hypothesis if you are interested in larger LFC (see paper and vignette).