Hello All
I have 4 samples M1 , M2 , W1, W2 . I have done alignment with Ecoli ref genome usng hisat2 on this sample i ran stringtie for abundance estimation Now i am trying to run ball gown on this samples of now i tried this commands:
data_directory <- "/Hisat_stringtie_/string_out/"
bg = ballgown(dataDir=data_directory, samplePattern="string_out_", meas='all')
transcript_fpkm = texpr(bg, 'FPKM')
transcript_cov = texpr(bg, 'cov')
whole_tx_table = texpr(bg, 'all')
exon_mcov = eexpr(bg, 'mcov')
junction_rcount = iexpr(bg)
whole_intron_table = iexpr(bg, 'all')
gene_expression = gexpr(bg)
pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=2))
pheno_data <- pData(bg)
head(pheno_data)
head(bg@dirs)
bg_2 = subset(bg,"rowVars(texpr(bg))>1", genomesubset = TRUE )
stat_results = stattest(bg_2, feature='transcript' ,meas='FPKM', covariate='group')
After running stattest i am getting error:
Error in solve.default(t(mod) %*% mod) :
Lapack routine dgesv: system is exactly singular: U[3,3] = 0
please help me i am not able to understand the error .
Where i am getting wrong
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10
locale:
[1] LC_CTYPE=en_IN.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_IN.UTF-8 LC_COLLATE=en_IN.UTF-8
[5] LC_MONETARY=en_IN.UTF-8 LC_MESSAGES=en_IN.UTF-8
[7] LC_PAPER=en_IN.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] dplyr_0.5.0 genefilter_1.52.1
[3] BiocStyle_1.8.0 ballgown_2.2.0
[5] Rqc_1.4.2 ggplot2_2.2.0
[7] ShortRead_1.28.0 GenomicAlignments_1.6.3
[9] SummarizedExperiment_1.0.2 Biobase_2.30.0
[11] Rsamtools_1.22.0 GenomicRanges_1.22.4
[13] GenomeInfoDb_1.6.3 Biostrings_2.38.4
[15] XVector_0.10.0 IRanges_2.4.8
[17] S4Vectors_0.8.11 BiocGenerics_0.16.1
[19] BiocParallel_1.4.3
loaded via a namespace (and not attached):
[1] splines_3.2.2 GenomicFiles_1.6.2
[3] Formula_1.2-1 shiny_0.14.2
[5] assertthat_0.1 highr_0.6
[7] latticeExtra_0.6-28 BSgenome_1.38.0
[9] RSQLite_1.1 lattice_0.20-34
[11] biovizBase_1.18.0 limma_3.26.9
[13] digest_0.6.10 RColorBrewer_1.1-2
[15] colorspace_1.3-1 htmltools_0.3.5
[17] httpuv_1.3.3 Matrix_1.2-7.1
[19] plyr_1.8.4 XML_3.98-1.5
[21] biomaRt_2.26.1 zlibbioc_1.16.0
[23] xtable_1.8-2 scales_0.4.1
[25] htmlTable_1.7 tibble_1.2
[27] annotate_1.48.0 mgcv_1.8-16
[29] GenomicFeatures_1.22.13 nnet_7.3-12
[31] lazyeval_0.2.0 survival_2.40-1
[33] magrittr_1.5 mime_0.5
[35] evaluate_0.10 memoise_1.0.0
[37] nlme_3.1-128 hwriter_1.3.2
[39] foreign_0.8-67 tools_3.2.2
[41] data.table_1.10.0 stringr_1.1.0
[43] munsell_0.4.3 cluster_2.0.5
[45] AnnotationDbi_1.32.3 lambda.r_1.1.9
[47] futile.logger_1.4.3 grid_3.2.2
[49] RCurl_1.95-4.8 dichromat_2.0-0
[51] VariantAnnotation_1.16.4 labeling_0.3
[53] bitops_1.0-6 gtable_0.2.0
[55] DBI_0.5-1 markdown_0.7.7
[57] reshape2_1.4.2 R6_2.2.0
[59] gridExtra_2.2.1 knitr_1.15.1
[61] rtracklayer_1.30.4 Hmisc_4.0-0
[63] futile.options_1.0.0 stringi_1.1.2
[65] sva_3.18.0 Rcpp_0.12.8
[67] rpart_4.1-10 acepack_1.4.1
Thanks
The output of `pData(bg)` isn't showing up, but I'm assuming you do have two replicates per group (two W's, and two M's). @Jeff if you only have one rep, ballgown throws this error message which isn't what's happening: https://github.com/alyssafrazee/ballgown/blob/master/R/stattest.R#L215
Without seeing the data, my *guess* as to what's going on is that you have 0 variance in one of your groups in one of your genes, i.e., both W's or both M's have equal FPKM values (which is pretty likely with many genes and only two samples per group). It's easy to get 0 variance there, and linear regression then requires dividing by the variance, and you're essentially getting a "divide by 0" error.
If you like, you can work around this by excluding genes with this property (having equal FPKM in either the W or the M group). But in general, it's good to have more than 2 samples per group (as many as possible). For smaller sample sizes, limma might be a good choice (as they use empirical Bayes to share info about variance across genes).