Question

Changing Gene ID annotation style - working with Salmon output [NM...] and resulting count files are numeric-only IDs

0

Entering edit mode

holmkn • 0

@holmkn-22769

Last seen 5.2 years ago

I used Salmon generated output to generate read counts for my RNA-seq data, but when I realized I was working with (I think) RefSeq gene annotations, the only txdb variable I was successfully able to generate was using

> # Install the latest version of DEseq2 if (!requireNamespace("BiocManager", quietly = TRUE))  
> install.packages("BiocManager")
> 
> # load the library source("https://bioconductor.org/biocLite.R")
> 
> BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
> 
> BiocManager::install("tximport")
> 
> BiocManager::install(c("AnnotationHub", "Homo.sapiens",
>                        "Organism.dplyr",
>                        "TxDb.Hsapiens.UCSC.hg19.knownGene",
>                        "TxDb.Hsapiens.UCSC.hg38.knownGene",
>                        "BSgenome.Hsapiens.UCSC.hg19", "biomaRt",
>                        "TxDb.Athaliana.BioMart.plantsmart22"))
> 
> library(tximport) library(readr) library(dplyr)
> 
> 
> setwd
> ("~/Documents/School/HagermanLab/Data/RNAseq/RedoRNASeqAnalysis/")
> 
> # read in file names  files <- read_csv("samples_FXTAS_nonribo_malesPresCtls.csv")
> 
> 
> ### read in human counts
> 
> #using UCSC and trying to get RefSeq
> install.packages("RMariaDB") 
> library("RMariaDB") 
> library("GenomicFeatures") 
> txdb <-makeTxDbFromUCSC(genome="hg38", tablename="refGene")
> #https://bioconductor.org/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf
> 
> 
> #next, make keys and tx2gene object using txdb 
> k <- keys(txdb, keytype="TXNAME") 
> tx2gene <- select(txdb, k, "GENEID", "TXNAME")
> 
> data <- tximport(files = files$files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion=TRUE) #ignoreTxVersion=TRUE 
> datacounts <-data$counts
> 
> # change rownames for provenance 
> colnames(datacounts) <- files$sample

However, my count output looks like this:

GeneID 1 10 100 1000 10000 100008586 100008587 100008588

I’m not sure how to match these Gene IDs, especially since I get a few different results depending on the genome reference website I use (ncbi Gene, HGNC, UCSC, Uniprot, etc).

Is there a way to automate these matches? Or can I use a different original genome in the "makeTxDBFromUCSC" tablename line, to get ENSMBL Gene IDs or any other alphanumeric ID?

Using “RefGene” as my tablename input was the only way I could successfully generate the txdb variable - are there other methods I might be missing?

tximport Gene IDs txdb RefSeq Salmon • 2.5k views

ADD COMMENT • link updated 5.2 years ago by Michael Love 43k • written 5.2 years ago by holmkn • 0

score 0 · Answer 1 · 2020-01-27

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 7 days ago

United States

We have created tximeta to automate this task but you need to use a versioned and permalink friendly set of reference transcripts (see tximeta paper for details). Gencode is the best source for a variety of reasons. Versioned and permalink friendly RefSeq sets exist but are not easy to find on the website, and beware of “latest” RefSeq sets which are not stable.

ADD COMMENT • link 5.2 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael! I just tried tximeta and am following the vignette, but am a little lost.

First, if the path to my quant.sf files are each store in my files object under the column "files" [files$files], do you know how I would designate coldata?

coldata <- data.frame(files=file, names="SRR1197474", sample="1",
                      stringsAsFactors=FALSE)

coldata <- data.frame(files = files, names = sample, stringsAsFactors=FALSE) did not work, and neither did ...files = files$files

I'm still working through the rest of the code but stuck at this step.

ADD REPLY • link 5.2 years ago holmkn • 0

0

Entering edit mode

So tximeta will only help here if you use a particular FASTA from RefSeq, because the versioned and stable transcript sets are a bit buried on their website. Can you post the URL you used to download the RefSeq transcripts?

In the case that you didn't use a stable RefSeq transcript set, you need to obtain the GTF or GFF that corresponds to the FASTA you obtained. These files are paired. If they are stable and versioned, then tximeta can help, if not then you need to download both files at the same time. So again, you would need to know the URL of the transcript set so you can go back to the source and obtain the paired GTF or GFF file.

ADD REPLY • link 5.2 years ago Michael Love 43k

0

Entering edit mode

Great thank you.

I was able to get it to work by importing the feature table that corresponded to my RefSeq, and specifying the product accession number and Gene Symbol columns instead of Gene ID.

Then using tximport from the newly created tx2gene file.

I couldn't get tximeta to work, but very good to know if needed in the future. I really appreciate your quick response and aid!

> #find and download tabular file for canine from NCBI (feature_table.txt.gz) download.file(url =
 feat_table <- read_tsv('GCF_000001405.39_GRCh38.p13_feature_table.txt')
> 
> #select relevant columns: product accession and GeneID (in that order)
> 
> feat_table <- dplyr::select(feat_table, feat_table$product_accession, feat_table$symbol, feat_table$GeneID) 
feat_table <- feat_table%>%  
 unique()
> 
> #make sure table is capturing all salmon counts, 
> 
> quant <- read_tsv("SnakemakeSalmonReadcount_TestRun/outputs/quant_hs/PH4ER7_UMI_S23_quant/quant.sf")
> 
> table(quant$Name %in% feat_table$product_accession)
> write.table(feat_table, "test_humantx2gene.tsv", quote = F, row.names
> = F, sep = "\t")

ADD REPLY • link 5.2 years ago holmkn • 0

0

Entering edit mode

FWIW:

Here are the stable RefSeq FASTA permalinks (except the latest build, e.g. GRCh38.p13 which is not stable until another build appears):

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebratemammalian/Homosapiens/allassemblyversions/

ADD REPLY • link 5.2 years ago Michael Love 43k