I am trying to run DEXSeq package for 2 conditions, control and treatment.
This is hg19 genome. I have used python scripts to create the matrix. In the matrix I have >500,000 exons which is typical for human genome. Below is the first 20 lines of my matrix:
exon C1A C1B C1C C2A C2B C2C
ENSG00000000003.10:001 862 817 923 495 501 484
ENSG00000000003.10:002 443 410 421 196 220 202
ENSG00000000003.10:003 396 348 372 165 197 178
ENSG00000000003.10:004 350 319 310 138 175 169
ENSG00000000003.10:005 350 332 341 142 181 179
ENSG00000000003.10:006 377 369 397 159 205 181
ENSG00000000003.10:007 362 348 399 197 219 186
ENSG00000000003.10:008 332 312 398 187 198 183
ENSG00000000003.10:009 511 473 564 293 265 236
ENSG00000000003.10:010 42 54 48 61 37 42
ENSG00000000003.10:011 346 308 343 155 157 116
ENSG00000000003.10:012 273 255 265 126 116 92
ENSG00000000003.10:013 26 19 18 20 15 10
ENSG00000000003.10:014 17 9 10 9 10 7
ENSG00000000003.10:015 0 0 0 0 1 0
ENSG00000000005.5:001 0 0 0 0 0 0
ENSG00000000005.5:002 0 0 0 0 0 0
ENSG00000000005.5:003 0 0 0 0 0 0
ENSG00000000005.5:004 0 0 0 0 0 0
Following is the code i am using :
seqdata <- read.table("FINAL_EXON_COUNTS_SS.txt",header=TRUE,sep="\t",row.names = 1)
sampleinfo <- read.table("Sample_Info.txt",header=TRUE,sep="\t",row.name=1)
dxd <- DEXSeqDataSet(seqdata, sampleinfo, design= ~ sample + exon + condition:exon,
featureID = gsub(pattern = "ENSG.*:","",row.names(seqdata)),
groupID = gsub(pattern = ":.*","",row.names(seqdata)) )
dxd = estimateSizeFactors(dxd )
dxd = estimateDispersions( dxd )
I am stuck at estimateDispersions step. It is taking forever to finish this step. Is there a way to speed this up??
This is my sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] statmod_1.4.30 DEXSeq_1.28.3 org.Hs.eg.db_3.7.0
[4] AnnotationDbi_1.44.0 DESeq2_1.22.2 SummarizedExperiment_1.12.0
[7] DelayedArray_0.8.0 BiocParallel_1.16.6 matrixStats_0.54.0
[10] Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[13] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
[16] pheatmap_1.0.12 RColorBrewer_1.1-2 edgeR_3.24.3
[19] limma_3.38.3
loaded via a namespace (and not attached):
[1] httr_1.4.0 bit64_0.9-7 splines_3.5.1 Formula_1.2-3
[5] assertthat_0.2.1 BiocManager_1.30.4 latticeExtra_0.6-28 blob_1.1.1
[9] Rsamtools_1.34.1 GenomeInfoDbData_1.2.0 progress_1.2.0 pillar_1.3.1
[13] RSQLite_2.1.1 backports_1.1.4 lattice_0.20-38 glue_1.3.1
[17] digest_0.6.18 XVector_0.22.0 checkmate_1.9.3 colorspace_1.4-1
[21] htmltools_0.3.6 Matrix_1.2-17 plyr_1.8.4 XML_3.98-1.19
[25] pkgconfig_2.0.2 biomaRt_2.38.0 genefilter_1.64.0 zlibbioc_1.28.0
[29] purrr_0.3.2 xtable_1.8-4 scales_1.0.0 htmlTable_1.13.1
[33] tibble_2.1.1 annotate_1.60.1 ggplot2_3.1.1 nnet_7.3-12
[37] lazyeval_0.2.2 survival_2.44-1.1 magrittr_1.5 crayon_1.3.4
[41] memoise_1.1.0 hwriter_1.3.2 foreign_0.8-71 prettyunits_1.0.2
[45] tools_3.5.1 data.table_1.12.2 hms_0.4.2 stringr_1.4.0
[49] munsell_0.5.0 locfit_1.5-9.1 cluster_2.0.9 Biostrings_2.50.2
[53] compiler_3.5.1 rlang_0.3.4 grid_3.5.1 RCurl_1.95-4.12
[57] rstudioapi_0.10 htmlwidgets_1.3 bitops_1.0-6 base64enc_0.1-3
[61] gtable_0.3.0 DBI_1.0.0 R6_2.4.0 gridExtra_2.3
[65] knitr_1.22 dplyr_0.8.0.1 bit_1.1-14 Hmisc_4.2-0
[69] stringi_1.4.3 Rcpp_1.0.1 geneplotter_1.60.0 rpart_4.1-15
[73] acepack_1.4.1 tidyselect_0.2.5 xfun_0.6
Hope to hear from you soon.
Thanks
HI Michael, Thanks for the reply. I thought that too. But the problem I see in that approach is : What if for a particular gene which has 10 exons, exon1 does not have enough reads(threshold value >10), then this exon according to my filtering will not be present in the initial matrix. Would that be fine since I would like to use plotDexseq function and if exon 1 is missing in my initial matrix, it will also not make in the dxr object? Your thoughts on this? In the above example, ENSG00000000003.10:015 will not be present in the matrix after filtering it. Would that be fine because I would like to make some graphs downstream?
Hi gv,
Last time I ran an analysis with 6 samples for a human genome, it took around an hour for the whole pipeline. When you say "It is taking forever to finish this step.", how much time are you referring to?
Also, please make sure you are using the latest versions of R and Bioconductor (I could not see the your sessionInfo() in your original message).
Hi Alenjandro, I have edited the post with my sessionInfo(). My program did finish overnight though. It took more than 3 hours I guess. I will try to put in parallel option. Also can you tell me about removing exons which has no counts and it's effect downstream such as in making plots? Thanks a lot for your help.
Sure. If you remove these exons from the object, they won't appear on these plots.
However, as Mike mentioned, you could filter genes that are lowly expressed. If a gene is not expressed to begin with, it is not worth testing this gene for differential exon usage. Statistically, you won't have power to detect differential exon usage. Biologically, those genes are probably not relevant in the context of your experiment.
Alejandro
Sure. If you remove these exons from the object, they won't appear on these plots.
However, as Mike mentioned, you could filter genes that are lowly expressed. If a gene is not expressed to begin with, it is not worth testing this gene for differential exon usage. Statistically, you won't have power to detect differential exon usage. Biologically, those genes are probably not relevant in the context of your experiment.
Alejandro