txImport: a gene seems missing from the sample data
1
0
Entering edit mode
alexlok • 0
@alexlok-12509
Last seen 7.8 years ago

Dear all,

txImport requires a txdb file with correspondences between transcript names and gene names. In the sample data provided by txImportData, there is a pre-constructed table.

I was interested in reconstructing it from the sequence files, but I ended up with a discrepancy: a gene (3 transcripts) seem to be present in the files, but not the table (see code below).

So, my questions are:

1/ is this discrepancy a problem with the data or with my reasoning? Is it safe for me to construct the txdb table by reading all files and finding associated gene names?

2/ if it's a problem with the data, will it affect downstream analysis? Should I report it to tximportData developers and how?

Sorry if this seem like basic questions, I'm pretty new to this field.

 

# as per the vignette
library(tximportData)
dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:6)
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))

imported <- sort(as.character(tx2gene$TXNAME))

# by manually reading the samples
comb <- factor()
for(f in files){
  aa <- read.table(f, header = TRUE)$target_id
  comb <- c(comb, levels(aa))
}
manually <- unique(sort(comb))

# comparing
length(imported)
#48006
length(manually)
#48009
all.equal(imported, manually)
# 9950 mismatches

tests <- which(imported != manually)
tests[1]
#38057
imported[38055:38061]
# "NR_001525_1" "NR_001525_2" "NR_001527"   "NR_001527_1" "NR_001530"   "NR_001530_1"
# "NR_001533"
manually[38055:38061]
# "NR_001525_1" "NR_001525_2" "NR_001526"   "NR_001526_1" "NR_001526_2" "NR_001527"  
# "NR_001527_1"



#tximportData version 1.2.0
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
 [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.3                 
 [3] AnnotationDbi_1.36.2                    tximportData_1.2.0                     
 [5] tximport_1.2.0                          BiocInstaller_1.24.0                   
 [7] DESeq2_1.14.1                           SummarizedExperiment_1.4.0             
 [9] Biobase_2.34.0                          GenomicRanges_1.26.3                   
[11] GenomeInfoDb_1.10.3                     IRanges_2.8.1                          
[13] S4Vectors_0.12.1                        BiocGenerics_0.20.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9              locfit_1.5-9.1           lattice_0.20-34         
 [4] Rsamtools_1.26.1         Biostrings_2.42.1        assertthat_0.1          
 [7] digest_0.6.12            plyr_1.8.4               backports_1.0.5         
[10] acepack_1.4.1            RSQLite_1.1-2            ggplot2_2.2.1           
[13] zlibbioc_1.20.0          lazyeval_0.2.0           data.table_1.10.4       
[16] annotate_1.52.1          rpart_4.1-10             Matrix_1.2-7.1          
[19] checkmate_1.8.2          splines_3.3.2            BiocParallel_1.8.1      
[22] geneplotter_1.52.0       stringr_1.2.0            foreign_0.8-67          
[25] htmlwidgets_0.8          RCurl_1.95-4.8           biomaRt_2.30.0          
[28] munsell_0.4.3            rtracklayer_1.34.2       base64enc_0.1-3         
[31] htmltools_0.3.5          nnet_7.3-12              tibble_1.2              
[34] gridExtra_2.2.1          htmlTable_1.9            Hmisc_4.0-2             
[37] XML_3.98-1.5             GenomicAlignments_1.10.0 bitops_1.0-6            
[40] grid_3.3.2               xtable_1.8-2             gtable_0.2.0            
[43] DBI_0.5-1                magrittr_1.5             scales_0.4.1            
[46] stringi_1.1.2            XVector_0.14.0           genefilter_1.56.0       
[49] latticeExtra_0.6-28      Formula_1.2-1            RColorBrewer_1.1-2      
[52] tools_3.3.2              survival_2.40-1          colorspace_1.3-2        
[55] cluster_2.0.5            memoise_1.0.0            knitr_1.15.1 
txImportData txImport • 2.9k views
ADD COMMENT
0
Entering edit mode

 For my question 1, I didn't notice the function makeTxDbFromGFF() from GenomicFeatures, that indeed provides another way to generate the tx2gene list.

So I'm just left with the second question: out of curiosity, is it going to be a problem, and should I report it someplace?

ADD REPLY
3
Entering edit mode
@mikelove
Last seen 1 day ago
United States

hi Alex,

The tx2gene.csv file in tximportData is only for the specific quantification files in tximportData, but is not appropriate in general. I couldn't tell from your post, but are you interested in using tximport on your own data?

If so, you should generate your own tx2gene table for your specific transcriptome. So then, you'd have to know which transcriptome you quantified against. You would then want to go find a GTF file for that transcriptome, and you could use some Bioconductor functions to generate an appropriate tx2gene table.

If you are just interested in the data in tximportData, I believe the gene with three transcripts had an ID of "" (empty string) from the GTF file, which caused problems at one or more places throughout the pipeline, and so these transcripts were removed from the tx2gene table.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Indeed I'm generating my own tx2gene table. My question was really just about the tximportData dataset, and you have perfectly answered it.

Thank you.

ADD REPLY
0
Entering edit mode

Hello, I am also new and I am also learning. If I create my own table tx2gene. Will these missing transcripts in tx2gene cause problems calculating offset? I am using the recommendation for edgeR in https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor

ADD REPLY
0
Entering edit mode

Please don't add a comment on a years-old thread. Instead ask a new question.

ADD REPLY

Login before adding your answer.

Traffic: 530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6