Dear all,
txImport
requires a txdb file with correspondences between transcript names and gene names. In the sample data provided by txImportData
, there is a pre-constructed table.
I was interested in reconstructing it from the sequence files, but I ended up with a discrepancy: a gene (3 transcripts) seem to be present in the files, but not the table (see code below).
So, my questions are:
1/ is this discrepancy a problem with the data or with my reasoning? Is it safe for me to construct the txdb table by reading all files and finding associated gene names?
2/ if it's a problem with the data, will it affect downstream analysis? Should I report it to tximportData
developers and how?
Sorry if this seem like basic questions, I'm pretty new to this field.
# as per the vignette library(tximportData) dir <- system.file("extdata", package = "tximportData") samples <- read.table(file.path(dir, "samples.txt"), header = TRUE) files <- file.path(dir, "kallisto", samples$run, "abundance.tsv") names(files) <- paste0("sample", 1:6) tx2gene <- read.csv(file.path(dir, "tx2gene.csv")) imported <- sort(as.character(tx2gene$TXNAME)) # by manually reading the samples comb <- factor() for(f in files){ aa <- read.table(f, header = TRUE)$target_id comb <- c(comb, levels(aa)) } manually <- unique(sort(comb)) # comparing length(imported) #48006 length(manually) #48009 all.equal(imported, manually) # 9950 mismatches tests <- which(imported != manually) tests[1] #38057 imported[38055:38061] # "NR_001525_1" "NR_001525_2" "NR_001527" "NR_001527_1" "NR_001530" "NR_001530_1" # "NR_001533" manually[38055:38061] # "NR_001525_1" "NR_001525_2" "NR_001526" "NR_001526_1" "NR_001526_2" "NR_001527" # "NR_001527_1" #tximportData version 1.2.0 sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux locale: [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 [4] LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.3 [3] AnnotationDbi_1.36.2 tximportData_1.2.0 [5] tximport_1.2.0 BiocInstaller_1.24.0 [7] DESeq2_1.14.1 SummarizedExperiment_1.4.0 [9] Biobase_2.34.0 GenomicRanges_1.26.3 [11] GenomeInfoDb_1.10.3 IRanges_2.8.1 [13] S4Vectors_0.12.1 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 locfit_1.5-9.1 lattice_0.20-34 [4] Rsamtools_1.26.1 Biostrings_2.42.1 assertthat_0.1 [7] digest_0.6.12 plyr_1.8.4 backports_1.0.5 [10] acepack_1.4.1 RSQLite_1.1-2 ggplot2_2.2.1 [13] zlibbioc_1.20.0 lazyeval_0.2.0 data.table_1.10.4 [16] annotate_1.52.1 rpart_4.1-10 Matrix_1.2-7.1 [19] checkmate_1.8.2 splines_3.3.2 BiocParallel_1.8.1 [22] geneplotter_1.52.0 stringr_1.2.0 foreign_0.8-67 [25] htmlwidgets_0.8 RCurl_1.95-4.8 biomaRt_2.30.0 [28] munsell_0.4.3 rtracklayer_1.34.2 base64enc_0.1-3 [31] htmltools_0.3.5 nnet_7.3-12 tibble_1.2 [34] gridExtra_2.2.1 htmlTable_1.9 Hmisc_4.0-2 [37] XML_3.98-1.5 GenomicAlignments_1.10.0 bitops_1.0-6 [40] grid_3.3.2 xtable_1.8-2 gtable_0.2.0 [43] DBI_0.5-1 magrittr_1.5 scales_0.4.1 [46] stringi_1.1.2 XVector_0.14.0 genefilter_1.56.0 [49] latticeExtra_0.6-28 Formula_1.2-1 RColorBrewer_1.1-2 [52] tools_3.3.2 survival_2.40-1 colorspace_1.3-2 [55] cluster_2.0.5 memoise_1.0.0 knitr_1.15.1
For my question 1, I didn't notice the function
makeTxDbFromGFF()
fromGenomicFeatures
, that indeed provides another way to generate the tx2gene list.So I'm just left with the second question: out of curiosity, is it going to be a problem, and should I report it someplace?