Hello,
I'm trying to annotate genes in DESeq2 results object with chrom, strand, start, end. Code works fine but many genes have incorrect TSS/TTS annotation on IGV/UCSC browser. If I convert GENEID to ENSEML the problem is solved but not all GENEIDs map, which is suboptimal. How to fix? Thanks in advance!
B_samples <- read.table(file.path(dir, "sample_info.tsv"), header = TRUE)
B_files <- file.path(dir, B_samples$name)
names(B_files) <- file.path(B_samples$sample)
txdb <- (TxDb.Mmusculus.UCSC.mm10.knownGene)
k <- keys(txdb, keytype = "GENEID")
df <- AnnotationDbi::select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
B_txi <- tximport(B_files, type = "kallisto", tx2gene = tx2gene)
rownames(B_samples) <- B_samples$sample
B_ddsTxi <- DESeqDataSetFromTximport(B_txi, colData = B_samples, design = ~sequencer + condition)
B_ddsTxi <- DESeq(B_ddsTxi)
B_res <- results(B_ddsTxi, c("condition", "Y", "X"))
B_res$symbol <- mapIds(Mus.musculus, keys=row.names(B_res), column="SYMBOL", keytype="GENEID", multiVals="first")
B_res$chrom <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXCHROM", keytype="GENEID", multiVals="first")
B_res$strand <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXSTRAND", keytype="GENEID", multiVals="first")
B_res$start <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXSTART", keytype="GENEID", multiVals="first")
B_res$end <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXEND", keytype="GENEID", multiVals="first")
> head(tx2gene)
TXNAME GENEID
1 uc009veu.1 100009600
2 uc033jjg.1 100009600
3 uc012fog.1 100009609
4 uc011xhj.2 100009614
5 uc007inp.3 100009664
6 uc008vqx.2 100012
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0 BiocInstaller_1.32.1 gdtools_0.2.1
[4] UpSetR_1.4.0 gprofiler2_0.1.9 GeneOverlap_1.18.0
[7] sva_3.30.1 genefilter_1.64.0 mgcv_1.8-31
[10] nlme_3.1-148 rafalib_1.0.0 ggpubr_0.3.0
[13] ggsignif_0.6.0 ashr_2.2-47 gridExtra_2.3
[16] gtools_3.8.2 naturalsort_0.1.3 ggthemes_4.2.0
[19] viridis_0.5.1 viridisLite_0.3.0 RColorBrewer_1.1-2
[22] pheatmap_1.0.12 circlize_0.4.10 ComplexHeatmap_1.20.0
[25] reshape2_1.4.4 scales_1.1.1 forcats_0.5.0
[28] stringr_1.4.0 purrr_0.3.4 tidyr_1.1.0
[31] tibble_3.0.1 tidyverse_1.3.0 dplyr_1.0.0
[34] ggrepel_0.8.2 eulerr_6.1.0 gplots_3.0.3
[37] ggplot2_3.3.1 VennDiagram_1.6.20 futile.logger_1.4.3
[40] Mus.musculus_1.3.1 org.Mm.eg.db_3.7.0 GO.db_3.7.0
[43] OrganismDbi_1.24.0 TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.4 GenomicFeatures_1.34.8
[46] AnnotationDbi_1.44.0 limma_3.38.3 DESeq2_1.22.2
[49] SummarizedExperiment_1.12.0 DelayedArray_0.8.0 BiocParallel_1.16.6
[52] matrixStats_0.56.0 Biobase_2.42.0 GenomicRanges_1.34.0
[55] GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1
[58] BiocGenerics_0.28.0 readr_1.3.1 tximport_1.10.1
Thanks for your reply. I'm using the GENEID/REFSEQ mappings on the UCSC and IGV browser and see disagreements.
Many TSS/TTSs that I get from GENEID and REFSEQ don't correspond to a TSS/TTS at all on UCSC/IGV mm10 browser (~10%), they are in the middle of a gene for example so it's not an alternative start site. If I cross-reference the GENEID/REFSEQ to ENSEMBL (below) and then pull the TSS/TTS then I get coordinates that map correctly both on UCSC/IGV mm10. The problem appears to be the annotation for GENEID/REFSEQ gene models. Seems best option is just to redo everything with ENSEMBL unless you know of a way to fix.
Thanks for your help!
Do you have some examples? It's hard to diagnose, based on a statement that ~10% are off.
I updated R, Bioconductor, and reinstalled all packages and now TXNAME is ENSEMBL style (ENSMUST00000115494.2) instead of UCSC (uc009veu.1) and transcripts don't match.
I'm just going to redo everything with the ENSEMBL index.