Hi all
I've used an unpublished pipeline developed by collaborators to obtain raw counts of human endogenous retroviruses (HERVs) in RNA-seq data. I have control (CTRL; n= 279) and cases (n=259) which I want to test for differential HERV expression ("Profile") whilst covarying for the effect of institution where sequencing happened (3 levels), gender (2 levels), age bins (5 levels), RIN (continuous), post-mortem interval (PMI, continuous), and five population covariates per sample (this model was previously used in the analysis of gene expression differences between cases and controls reported using these samples). I additionally included a prefiltering step, as I noticed this substantially improved the dispersion estimates (see dispersion estimates prior to prefiltering ). Note that my initial dds object contained 14,211 rows, but I ended up with only 5,740 after removing HERVs which were very lowly expressed (see code below).
pd_nodups$Institution <- factor(pd_nodups$Institution) pd_nodups$Profile <- factor(pd_nodups$Profile, levels = c("Control","Condition")) pd_nodups$Gender <- factor(pd_nodups$Gender) pd_nodups$Age.bins <- factor(pd_nodups$Age.bins) dds <- DESeqDataSetFromMatrix(countData = raw_exprs_sub, colData = pd_nodups, design = ~ Institution + Profile + Gender + Age.bins + RIN + PMI.hrs + C1 + C2 + C3 + C4 + C5 + Profile) keep <- rowSums(counts(dds) >= 10) >= 4 dds <- dds[keep,] dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds, maxit=500) res <- results(dds,name="Profile_Condition_vs_Control") summary(res) out of 5740 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 106, 1.8% LFC < 0 (down) : 96, 1.7% outliers [1] : 0, 0% low counts [2] : 1669, 29% (mean count < 4) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results plotDispEsts(dds) # to plot dispersion estimates plotCounts(dds,"HERV", intgroup = "Profile") # to plot counts of HERVs of interest
I was a bit concerned because the s don't look exactly like the graph in the DESeq2 tutorial. The shows very modest expression differences between the conditions, which is sort of expected, considering that the condition I am studying is discreet, so I am not expecting massive HERV expression differences between cases and controls at this stage. The expression profile of the actually does not seem so dissimilar from the significant HERV shown above, but this must be DESeq2 trying to avoid false-positives due to very low counts, which is great.
My point is: I want to make sure these findings are correct, as I've never worked with such a complex dataset before, and I have limited knowledge about whether I should or how I could manually filter these data. Therefore I'd appreciate if you could share your experience or concerns.
Thanks a lot!
Supporting information:
My raw count table contains sample IDs as column names, HERV IDs as row names, and integers stating the number of raw counts per HERV per sample.
head(pd_nodups) sample_id Institution RIN Profile Gender PMI.hrs Age Age.bins C1 C2 C3 C4 C5 Sample1 1 7.5 Cond M 8.9 68 3 -0.0169115 -0.00503876 0.00316163 -0.000208601 -0.00408248 Sample2 1 8.8 CTRL M 12.3 58 3 -0.00416409 -0.0128559 -0.0643876 -0.0166669 0.00954494 Sample3 1 6.7 Cond M 12.1 66 3 -0.0128424 0.00925869 0.0020633 -0.00311339 -0.0159096 Sample4 1 6.4 Cond F 21.2 76 4 -0.0142869 0.00795162 0.00105317 -0.00398493 -0.0150919 Sample5 1 8.2 CTRL F 5.7 82 4 -0.0123858 0.0218645 -0.000390619 -0.00170997 0.00612761 Sample6 1 7.4 CTRL F 4 81 4 0.0755209 -0.000988837 0.00153065 0.00449107 0.00566539
sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.11.8 DESeq2_1.22.1 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 [5] BiocParallel_1.16.0 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 [9] GenomeInfoDb_1.18.1 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 loaded via a namespace (and not attached): [1] locfit_1.5-9.1 Rcpp_1.0.0 lattice_0.20-38 digest_0.6.18 plyr_1.8.4 [6] backports_1.1.2 acepack_1.4.1 RSQLite_2.1.1 ggplot2_3.1.0 pillar_1.3.0 [11] zlibbioc_1.28.0 rlang_0.3.0.1 lazyeval_0.2.1 rstudioapi_0.8 annotate_1.60.0 [16] blob_1.1.1 rpart_4.1-13 Matrix_1.2-15 checkmate_1.8.5 splines_3.5.1 [21] geneplotter_1.60.0 stringr_1.3.1 foreign_0.8-71 htmlwidgets_1.3 RCurl_1.95-4.11 [26] bit_1.1-14 munsell_0.5.0 compiler_3.5.1 base64enc_0.1-3 htmltools_0.3.6 [31] nnet_7.3-12 tibble_1.4.2 gridExtra_2.3 htmlTable_1.12 GenomeInfoDbData_1.2.0 [36] Hmisc_4.1-1 XML_3.98-1.16 crayon_1.3.4 bitops_1.0-6 grid_3.5.1 [41] xtable_1.8-3 gtable_0.2.0 DBI_1.0.0 magrittr_1.5 scales_1.0.0 [46] stringi_1.2.4 XVector_0.22.0 genefilter_1.64.0 latticeExtra_0.6-28 Formula_1.2-3 [51] RColorBrewer_1.1-2 tools_3.5.1 bit64_0.9-7 survival_2.43-1 AnnotationDbi_1.44.0 [56] colorspace_1.3-2 cluster_2.0.7-1 memoise_1.1.0 knitr_1.20
Thank you Michael! This is super helpful!