question about TranscriptDb

0

Entering edit mode

Matthew D. Wilkerson ▴ 20

@matthew-d-wilkerson-5649

Last seen 9.2 years ago

Hello, I have a question about the gene_id attribute of TxDb.Hsapiens.UCSC.hg19.knownGene, version 2.80 (latest). I noticed that some transcripts such as uc021ums.1, do not have an associated gene_id. library(TxDb.Hsapiens.UCSC.hg19.knownGene) t=transcripts(txdb,columns=c("gene_id","tx_id","tx_name","cds_id","cds _name")) t[ which(elementMetadata(t)[,"tx_name"]=="uc021ums.1"), ] I understand that some ucsc genes might not have an entrez gene id associated. I checked this locus and found that currently UCSC db does have this locus associated with LINGO3. #hg19.knownGene.name hg19.knownGene.chrom hg19.knownGene.strand hg19.knownGene.txStart hg19.knownGene.txEnd hg19.knownGene.cdsStart hg19.knownGene.cdsEnd hg19.knownGene.exonCount hg19.knownGene.exonStarts hg19.knownGene.exonEnds hg19.knownGene.proteinID hg19.knownGene.alignID hg19.kgXref.kgID hg19.kgXref.geneSymbol uc021ums.1 chr19 - 2289996 2291775 2289996 2291775 1 2289996, 2291775, P0C6S8 uc021ums.1 uc021ums.1 LINGO3 The kgXref table was last updated 2/5/12. The bioconductor package was made on: Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012) If this date also refers to the date of download, then why is this transcript not affiliated with LINGO3? If not, then what date does known gene refer to? Thanks, Matt -- Matthew D. Wilkerson, Ph.D. Lineberger Comprehensive Cancer Center University of North Carolina at Chapel Hill http://www.unc.edu/~mwilkers

Cancer Cancer • 1.2k views

ADD COMMENT • link updated 12.4 years ago by Marc Carlson ★ 7.2k • written 12.4 years ago by Matthew D. Wilkerson ▴ 20

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.7 years ago

United States

Hi Matthew, Thanks for your detailed exploration of this. After looking more closely, I think the confusion here is being caused by the fact that you are looking at the kgXref table, and what was actually used to attach gene Ids to the TxDb database is actually the knownToLocusLink <http: genome.ucsc.edu="" cgi-bin="" hgtables?hgsid="316115443&hgta_doSchema" db="hg19&hgta_doSchemaTable=knownToLocusLink"> table. Adding to the mayhem, UCSC has apparently decided to allow different values to exist into the latest versions of these two tables. We chose to use the Entrez Gene IDs as gene identifiers because (unlike gene symbols) they represent a real identifier and can thus be relied on to not have multiple different meanings etc. Marc On 12/10/2012 09:06 AM, Matthew D. Wilkerson wrote: > Hello, > > I have a question about the gene_id attribute of > TxDb.Hsapiens.UCSC.hg19.knownGene, version 2.80 (latest). > > I noticed that some transcripts such as uc021ums.1, do not have an > associated gene_id. > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > t=transcripts(txdb,columns=c("gene_id","tx_id","tx_name","cds_id","c ds_name")) > > t[ which(elementMetadata(t)[,"tx_name"]=="uc021ums.1"), ] > > I understand that some ucsc genes might not have an entrez gene id > associated. > I checked this locus and found that currently UCSC db does have this > locus associated with LINGO3. > > #hg19.knownGene.name hg19.knownGene.chrom > hg19.knownGene.strand hg19.knownGene.txStart > hg19.knownGene.txEnd hg19.knownGene.cdsStart > hg19.knownGene.cdsEnd hg19.knownGene.exonCount > hg19.knownGene.exonStarts hg19.knownGene.exonEnds > hg19.knownGene.proteinID hg19.knownGene.alignID > hg19.kgXref.kgID hg19.kgXref.geneSymbol > uc021ums.1 chr19 - 2289996 2291775 2289996 > 2291775 1 2289996, 2291775, P0C6S8 uc021ums.1 > uc021ums.1 LINGO3 > > > The kgXref table was last updated 2/5/12. > > > The bioconductor package was made on: > Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012) > > If this date also refers to the date of download, then why is this > transcript not affiliated with LINGO3? > If not, then what date does known gene refer to? > > > Thanks, > Matt > [[alternative HTML version deleted]]

ADD COMMENT • link 12.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

I have also been bitten by the fact that some transcripts are missing gene IDs. Is it possible to add placeholder gene IDs to these? For example, just assigning them UNKNOWN1, UNKNOWN2, etc.? On Mon 10 Dec 2012 11:40:35 AM PST, Marc Carlson wrote: > Hi Matthew, > > Thanks for your detailed exploration of this. After looking more > closely, I think the confusion here is being caused by the fact that you > are looking at the kgXref table, and what was actually used to attach > gene Ids to the TxDb database is actually the knownToLocusLink > <http: genome.ucsc.edu="" cgi-bin="" hgtables?hgsid="316115443&hgta_doSche" madb="hg19&hgta_doSchemaTable=knownToLocusLink"> > table. Adding to the mayhem, UCSC has apparently decided to allow > different values to exist into the latest versions of these two tables. > > We chose to use the Entrez Gene IDs as gene identifiers because (unlike > gene symbols) they represent a real identifier and can thus be relied on > to not have multiple different meanings etc. > > > Marc > > > > On 12/10/2012 09:06 AM, Matthew D. Wilkerson wrote: >> Hello, >> >> I have a question about the gene_id attribute of >> TxDb.Hsapiens.UCSC.hg19.knownGene, version 2.80 (latest). >> >> I noticed that some transcripts such as uc021ums.1, do not have an >> associated gene_id. >> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> t=transcripts(txdb,columns=c("gene_id","tx_id","tx_name","cds_id"," cds_name")) >> >> t[ which(elementMetadata(t)[,"tx_name"]=="uc021ums.1"), ] >> >> I understand that some ucsc genes might not have an entrez gene id >> associated. >> I checked this locus and found that currently UCSC db does have this >> locus associated with LINGO3. >> >> #hg19.knownGene.name hg19.knownGene.chrom >> hg19.knownGene.strand hg19.knownGene.txStart >> hg19.knownGene.txEnd hg19.knownGene.cdsStart >> hg19.knownGene.cdsEnd hg19.knownGene.exonCount >> hg19.knownGene.exonStarts hg19.knownGene.exonEnds >> hg19.knownGene.proteinID hg19.knownGene.alignID >> hg19.kgXref.kgID hg19.kgXref.geneSymbol >> uc021ums.1 chr19 - 2289996 2291775 2289996 >> 2291775 1 2289996, 2291775, P0C6S8 uc021ums.1 >> uc021ums.1 LINGO3 >> >> >> The kgXref table was last updated 2/5/12. >> >> >> The bioconductor package was made on: >> Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012) >> >> If this date also refers to the date of download, then why is this >> transcript not affiliated with LINGO3? >> If not, then what date does known gene refer to? >> >> >> Thanks, >> Matt >> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.4 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Unfortunately, if we did that, there could be all sorts of unfortunate consequences. By doing this, you would be introducing an arbitrary number of new strings as IDs for all of these orphaned transcripts. And unlike NAs (which is the traditional way of indicating that data is missing in R), you would get no warnings about any of these when you used them in subsequent analysis. Others could use your new faux IDs to get into all sorts of trouble. And would be even worse because they would mixed in with real IDs (entrez gene IDs) which would lend them a confusing air of authenticity. Downstream users might even mix the faux IDs from different species etc. And even if we accepted the risks, we don't even have a good way of always grouping the unassigned transcripts, which means that transcripts that are probably from the same gene will be assigned like this: unknown1 = tx1 (overlaps with tx2) unknown2 = tx2 (overlaps with tx1) etc. Which means that this strategy would also end up implying things that we know are sometimes not going to be true. Meanwhile these half wrong unknown transcript assignments will be mixed in with the "real" ones... I could go on and on, but I am hoping you can see some of what I am concerned about? Anyhow you can already discover about which genes are associated with transcripts in many other ways. The simplest approach is probably to just use select() like this: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb = TxDb.Hsapiens.UCSC.hg19.knownGene k = keys(txdb, "TXNAME") res <- select(txdb, cols=c("TXNAME","GENEID"), keys=k, keytype="TXNAME") head(res) Alternatively you could ALSO do something like this (if you had previously already called transcripts like below): t <- transcripts(txdb,columns="gene_id") as.character(mcols(t)$gene_id) Marc On 12/10/2012 12:25 PM, Ryan C. Thompson wrote: > I have also been bitten by the fact that some transcripts are missing > gene IDs. Is it possible to add placeholder gene IDs to these? For > example, just assigning them UNKNOWN1, UNKNOWN2, etc.? > > On Mon 10 Dec 2012 11:40:35 AM PST, Marc Carlson wrote: >> Hi Matthew, >> >> Thanks for your detailed exploration of this. After looking more >> closely, I think the confusion here is being caused by the fact that you >> are looking at the kgXref table, and what was actually used to attach >> gene Ids to the TxDb database is actually the knownToLocusLink >> <http: genome.ucsc.edu="" cgi-bin="" hgtables?hgsid="316115443&hgta_doSch" emadb="hg19&hgta_doSchemaTable=knownToLocusLink"> >> >> table. Adding to the mayhem, UCSC has apparently decided to allow >> different values to exist into the latest versions of these two tables. >> >> We chose to use the Entrez Gene IDs as gene identifiers because (unlike >> gene symbols) they represent a real identifier and can thus be relied on >> to not have multiple different meanings etc. >> >> >> Marc >> >> >> >> On 12/10/2012 09:06 AM, Matthew D. Wilkerson wrote: >>> Hello, >>> >>> I have a question about the gene_id attribute of >>> TxDb.Hsapiens.UCSC.hg19.knownGene, version 2.80 (latest). >>> >>> I noticed that some transcripts such as uc021ums.1, do not have an >>> associated gene_id. >>> >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>> t=transcripts(txdb,columns=c("gene_id","tx_id","tx_name","cds_id", "cds_name")) >>> >>> >>> t[ which(elementMetadata(t)[,"tx_name"]=="uc021ums.1"), ] >>> >>> I understand that some ucsc genes might not have an entrez gene id >>> associated. >>> I checked this locus and found that currently UCSC db does have this >>> locus associated with LINGO3. >>> >>> #hg19.knownGene.name hg19.knownGene.chrom >>> hg19.knownGene.strand hg19.knownGene.txStart >>> hg19.knownGene.txEnd hg19.knownGene.cdsStart >>> hg19.knownGene.cdsEnd hg19.knownGene.exonCount >>> hg19.knownGene.exonStarts hg19.knownGene.exonEnds >>> hg19.knownGene.proteinID hg19.knownGene.alignID >>> hg19.kgXref.kgID hg19.kgXref.geneSymbol >>> uc021ums.1 chr19 - 2289996 2291775 2289996 >>> 2291775 1 2289996, 2291775, P0C6S8 uc021ums.1 >>> uc021ums.1 LINGO3 >>> >>> >>> The kgXref table was last updated 2/5/12. >>> >>> >>> The bioconductor package was made on: >>> Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012) >>> >>> If this date also refers to the date of download, then why is this >>> transcript not affiliated with LINGO3? >>> If not, then what date does known gene refer to? >>> >>> >>> Thanks, >>> Matt >>> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Thank you for those other suggestions. I will try some of them out when I get the chance. Incidentally, the problem I ran into with they missing gene IDs was that cummeRbund chokes on cuffdiff results if the GTF file used has some transcripts with no gene ID. On Dec 10, 2012 3:34 PM, "Marc Carlson" <mcarlson@fhcrc.org> wrote: > Unfortunately, if we did that, there could be all sorts of unfortunate > consequences. > > By doing this, you would be introducing an arbitrary number of new strings > as IDs for all of these orphaned transcripts. And unlike NAs (which is the > traditional way of indicating that data is missing in R), you would get no > warnings about any of these when you used them in subsequent analysis. > Others could use your new faux IDs to get into all sorts of trouble. And > would be even worse because they would mixed in with real IDs (entrez gene > IDs) which would lend them a confusing air of authenticity. Downstream > users might even mix the faux IDs from different species etc. > > And even if we accepted the risks, we don't even have a good way of always > grouping the unassigned transcripts, which means that transcripts that are > probably from the same gene will be assigned like this: > > unknown1 = tx1 (overlaps with tx2) > unknown2 = tx2 (overlaps with tx1) > etc. > > Which means that this strategy would also end up implying things that we > know are sometimes not going to be true. Meanwhile these half wrong > unknown transcript assignments will be mixed in with the "real" ones... > > I could go on and on, but I am hoping you can see some of what I am > concerned about? > > > > Anyhow you can already discover about which genes are associated with > transcripts in many other ways. The simplest approach is probably to just > use select() like this: > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > txdb = TxDb.Hsapiens.UCSC.hg19.**knownGene > k = keys(txdb, "TXNAME") > res <- select(txdb, cols=c("TXNAME","GENEID"), keys=k, keytype="TXNAME") > head(res) > > > > Alternatively you could ALSO do something like this (if you had previously > already called transcripts like below): > > t <- transcripts(txdb,columns="**gene_id") > as.character(mcols(t)$gene_id) > > > > Marc > > > > > On 12/10/2012 12:25 PM, Ryan C. Thompson wrote: > >> I have also been bitten by the fact that some transcripts are missing >> gene IDs. Is it possible to add placeholder gene IDs to these? For example, >> just assigning them UNKNOWN1, UNKNOWN2, etc.? >> >> On Mon 10 Dec 2012 11:40:35 AM PST, Marc Carlson wrote: >> >>> Hi Matthew, >>> >>> Thanks for your detailed exploration of this. After looking more >>> closely, I think the confusion here is being caused by the fact that you >>> are looking at the kgXref table, and what was actually used to attach >>> gene Ids to the TxDb database is actually the knownToLocusLink >>> <http: genome.ucsc.edu="" cgi-**bin="" hgtables?hgsid="316115443&**">>> hgta_doSchemaDb=hg19&hgta_**doSchemaTable=knownToLocusLink<http: genome.ucsc.edu="" cgi-bin="" hgtables?hgsid="316115443&hgta_doSchemaDb=hg19&" hgta_doschematable="knownToLocusLink"> >>> **> >>> table. Adding to the mayhem, UCSC has apparently decided to allow >>> different values to exist into the latest versions of these two tables. >>> >>> We chose to use the Entrez Gene IDs as gene identifiers because (unlike >>> gene symbols) they represent a real identifier and can thus be relied on >>> to not have multiple different meanings etc. >>> >>> >>> Marc >>> >>> >>> >>> On 12/10/2012 09:06 AM, Matthew D. Wilkerson wrote: >>> >>>> Hello, >>>> >>>> I have a question about the gene_id attribute of >>>> TxDb.Hsapiens.UCSC.hg19.**knownGene, version 2.80 (latest). >>>> >>>> I noticed that some transcripts such as uc021ums.1, do not have an >>>> associated gene_id. >>>> >>>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene) >>>> t=transcripts(txdb,columns=c("**gene_id","tx_id","tx_name","**cds _id","cds_name")) >>>> >>>> >>>> t[ which(elementMetadata(t)[,"tx_**name"]=="uc021ums.1"), ] >>>> >>>> I understand that some ucsc genes might not have an entrez gene id >>>> associated. >>>> I checked this locus and found that currently UCSC db does have this >>>> locus associated with LINGO3. >>>> >>>> #hg19.knownGene.name hg19.knownGene.chrom >>>> hg19.knownGene.strand hg19.knownGene.txStart >>>> hg19.knownGene.txEnd hg19.knownGene.cdsStart >>>> hg19.knownGene.cdsEnd hg19.knownGene.exonCount >>>> hg19.knownGene.exonStarts hg19.knownGene.exonEnds >>>> hg19.knownGene.proteinID hg19.knownGene.alignID >>>> hg19.kgXref.kgID hg19.kgXref.geneSymbol >>>> uc021ums.1 chr19 - 2289996 2291775 2289996 >>>> 2291775 1 2289996, 2291775, P0C6S8 uc021ums.1 >>>> uc021ums.1 LINGO3 >>>> >>>> >>>> The kgXref table was last updated 2/5/12. >>>> >>>> >>>> The bioconductor package was made on: >>>> Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012) >>>> >>>> If this date also refers to the date of download, then why is this >>>> transcript not affiliated with LINGO3? >>>> If not, then what date does known gene refer to? >>>> >>>> >>>> Thanks, >>>> Matt >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >> > [[alternative HTML version deleted]]

ADD REPLY • link 12.4 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.