I used Salmon generated output to generate read counts for my RNA-seq data, but when I realized I was working with (I think) RefSeq gene annotations, the only txdb variable I was successfully able to generate was using
> # Install the latest version of DEseq2 if (!requireNamespace("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
>
> # load the library source("https://bioconductor.org/biocLite.R")
>
> BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
>
> BiocManager::install("tximport")
>
> BiocManager::install(c("AnnotationHub", "Homo.sapiens",
> "Organism.dplyr",
> "TxDb.Hsapiens.UCSC.hg19.knownGene",
> "TxDb.Hsapiens.UCSC.hg38.knownGene",
> "BSgenome.Hsapiens.UCSC.hg19", "biomaRt",
> "TxDb.Athaliana.BioMart.plantsmart22"))
>
> library(tximport) library(readr) library(dplyr)
>
>
> setwd
> ("~/Documents/School/HagermanLab/Data/RNAseq/RedoRNASeqAnalysis/")
>
> # read in file names files <- read_csv("samples_FXTAS_nonribo_malesPresCtls.csv")
>
>
> ### read in human counts
>
> #using UCSC and trying to get RefSeq
> install.packages("RMariaDB")
> library("RMariaDB")
> library("GenomicFeatures")
> txdb <-makeTxDbFromUCSC(genome="hg38", tablename="refGene")
> #https://bioconductor.org/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf
>
>
> #next, make keys and tx2gene object using txdb
> k <- keys(txdb, keytype="TXNAME")
> tx2gene <- select(txdb, k, "GENEID", "TXNAME")
>
> data <- tximport(files = files$files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion=TRUE) #ignoreTxVersion=TRUE
> datacounts <-data$counts
>
> # change rownames for provenance
> colnames(datacounts) <- files$sample
However, my count output looks like this:
GeneID 1 10 100 1000 10000 100008586 100008587 100008588
I’m not sure how to match these Gene IDs, especially since I get a few different results depending on the genome reference website I use (ncbi Gene, HGNC, UCSC, Uniprot, etc).
Is there a way to automate these matches? Or can I use a different original genome in the "makeTxDBFromUCSC" tablename line, to get ENSMBL Gene IDs or any other alphanumeric ID?
Using “RefGene” as my tablename input was the only way I could successfully generate the txdb variable - are there other methods I might be missing?
Thanks Michael! I just tried tximeta and am following the vignette, but am a little lost.
First, if the path to my quant.sf files are each store in my files object under the column "files" [files$files], do you know how I would designate coldata?
coldata <- data.frame(files = files, names = sample, stringsAsFactors=FALSE) did not work, and neither did ...files = files$files
I'm still working through the rest of the code but stuck at this step.
So tximeta will only help here if you use a particular FASTA from RefSeq, because the versioned and stable transcript sets are a bit buried on their website. Can you post the URL you used to download the RefSeq transcripts?
In the case that you didn't use a stable RefSeq transcript set, you need to obtain the GTF or GFF that corresponds to the FASTA you obtained. These files are paired. If they are stable and versioned, then tximeta can help, if not then you need to download both files at the same time. So again, you would need to know the URL of the transcript set so you can go back to the source and obtain the paired GTF or GFF file.
Great thank you.
I was able to get it to work by importing the feature table that corresponded to my RefSeq, and specifying the product accession number and Gene Symbol columns instead of Gene ID.
Then using tximport from the newly created tx2gene file.
I couldn't get tximeta to work, but very good to know if needed in the future. I really appreciate your quick response and aid!
FWIW:
Here are the stable RefSeq FASTA permalinks (except the latest build, e.g. GRCh38.p13 which is not stable until another build appears):
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebratemammalian/Homosapiens/allassemblyversions/