Fwd: Probe matching using vmatchPDict

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 21 hours ago

Seattle, WA, United States

Hi Ian, I'm putting this discussion back on the mailing list as it might be of interest to others. On 11-08-01 02:00 AM, Ian Henry wrote: > Hi Herve, > > I did some extra digging and the UCSC refGene track for NM_001320 CSNK2B > shows: > > chr6 31633656 31637843 CSNK2B > chr6_cox_hap2 3143276 3147463 CSNK2B > chr6_dbb_hap3 2919235 2923423 CSNK2B > chr6_mcf_hap5 3013336 3017523 CSNK2B > chr6_qbl_hap6 2927299 2931486 CSNK2B > chr6_mann_hap4 2976546 2980733 CSNK2B > chr6_ssto_hap7 2964461 2968648 CSNK2B > > > Which seems to be the results in my table. So the answer appears to be > that the transcripts are on 'unusual chromosomes' could I filter these > away an just keep chr6 easily? > > Thanks so much for your help, > > Ian > > > Begin forwarded message: > >> *From: *Ian Henry <henry at="" mpi-cbg.de="" <mailto:henry="" at="" mpi-="" cbg.de="">> >> *Date: *August 1, 2011 10:31:54 AM GMT+02:00 >> *To: *Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> >> *Subject: **Re: [BioC] Probe matching using vmatchPDict* >> >> Hello Herve, >> >> Thanks for your help, >> >> I got the code running by first using vwhichPDict and then running >> matchPDict on those positive matches. It takes around 15 minutes to >> run on the around 6000 transcripts hit by my probes which isn't bad. >> >> >> getSequenceMatches<-function(transcript) >> { >> probeset_pdict <- PDict(probeset) >> txmatches <- matchPDict(probeset_pdict, transcript[[1]]) >> match <-extractAllMatches(transcript[[1]], txmatches) >> if( length(match)>0 ){ >> matchdf <- as.data.frame(match) >> matchdf$Transcript = names(transcript) >> matchdf$Sequence <- as.character(subseq(rep(transcript[1], >> length(matchdf$Transcript)), start=matchdf$start, end=matchdf$end)) >> return(matchdf) >> } >> } >> >> tx_matches <- vwhichPDict(probeset_pdict, hg19_tx) >> tx_match_indexes <- which(elementLengths(tx_matches)>0) >> >> f <- function(f) getSequenceMatches(hg19_tx[f]) >> a <- lapply(tx_match_indexes,f) >> outputTable<-do.call(rbind,a) >> >> >> I have one small additional question which may or may not be best >> directed at you. If I extract the human hg19 refGene transcript >> database as follows: >> >> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >> "refGene") >> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >> >> This works, but if I explore the object produced to look at transcript >> names: >> >> > t1 <- names(hg19_tx) >> > t1<- data.frame(t1) >> > subset(t1, t1=="NM_001320") >> >> this gives: >> t1 >> 12387 NM_001320 >> 38438 NM_001320 >> 38976 NM_001320 >> 39122 NM_001320 >> 39645 NM_001320 >> 39963 NM_001320 >> 40204 NM_001320 >> >> So there are multiple entries for one transcript. Do you know why this >> is, I guess it is a 'feature' of the UCSC refGene? I suspect they may >> be versions of the same transcript e.g. NM_001320.1, NM_001320.2 >> etc.... but the version number is missing and I guess I'd want the >> current transcript for my purposes, as in my probe position matches >> often I get multiple positions with the same transcript ID. Not sure >> if you can help, but any advice would be greatly appreciated. Yes this a "feature" of the UCSC refGene track. AFAIK the UCSC knownGene track does not contain duplicated transcript IDs. Note that the names you see on the DNAStringSet object returned by extractTranscriptsFromGenome() are exactly the names you get on the GRangesList object returned by exonsBy() when you group the exons by transcripts and use 'use.names=TRUE': > exbytx <- exonsBy(hg19txdb, by="tx", use.names=TRUE) Warning message: In .set.group.names(ans, use.names, txdb, by) : some group names are NAs or duplicated > identical(names(exbytx), names(hg19_tx)) [1] TRUE > table(duplicated(names(exbytx))) FALSE TRUE 37308 3231 Also note the warning about some group names being NAs or duplicated. Here the group names are the transcript names i.e. the top-level names of 'exbytx'. You also get this warning when calling extractTranscriptsFromGenome() on 'hg19txdb', which is not surprising because 'exonsBy( , by="tx", use.names=TRUE)' is called internally. To keep only the transcripts that belong to a pre-defined subset of genomic sequences of interest, you can do: ## Let's say my sequences of interest are the main chromosomes: > mychroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="") ## 'exbytx' contains the sequence that each exon belongs to. ## We use this to infer the sequence that each transcript belongs to. > tx_seqnames <- lapply(seqnames(exbytx), runValue) ## Just to make sure that no transcript contains exons from more ## than 1 sequence: > all(elementLengths(tx_seqnames) == 1) [1] TRUE > tx_seqnames <- unlist(tx_seqnames) > idx <- which(tx_seqnames %in% mychroms) ## Note that, even transcripts on the main chromosomes have ## duplicated names: > table(duplicated(names(exbytx)[idx])) FALSE TRUE 37285 1015 ## Finally, subset 'hg19_tx': > mytx <- hg19_tx[idx] Cheers, H. >> >> Thanks, >> >> Ian >> >> >> >> On Jul 25, 2011, at 7:57 PM, Hervé Pagès wrote: >> >>> Hi Ian, >>> >>> On 11-07-25 08:41 AM, Ian Henry wrote: >>>> Hello, >>>> >>>> I'm trying to match a list of 60mer probes against a transcriptome to >>>> see which probes hit which transcripts. >>>> >>>> I have my 60mer probe list as a DNAStringSet and also as a "PDict" >>>> > probeset <- DNAStringSet(probelist$ProbeSeq) >>>> > probeset_pdict <- PDict(probeset) >>>> >>>> My transcriptome was created as follows: >>>> > zv9txdb <- makeTranscriptDbFromUCSC(genome = "danRer7", tablename = >>>> "ensGene") >>>> > zv9_tx <- extractTranscriptsFromGenome(Drerio, zv9txdb) >>>> >>>> >>>> To find which transcripts are hit by the probes I've used vwhichPDict: >>>> > tx_matches <- vwhichPDict(probeset_pdict, zv9_tx) >>>> which works brilliantly! >>>> >>>> However, I also would like the locations of the matches and so tried: >>>> > tx_locs <- vmatchPDict(probeset_pdict, zv9_tx) >>>> This doesn't work and errors to say: >>>> Error in .local(pdict, subject, max.mismatch, min.mismatch, >>>> with.indels, : >>>> vmatchPDict() is not ready yet, sorry >>>> >>>> Does this just mean it's not yet implemented and is there a >>>> solution/workaround? >>> >>> It is not yet implemented and has been on my TODO list for a long time, >>> sorry... >>> >>> One workaround I can think of: use matchPDict in a loop (you loop over >>> the transcripts). The Drerio transcriptome is not be that big so it >>> might run in descent time. It might help a little bit to reduce the >>> size of the problem by getting rid of the probes and transcripts that >>> don't have hit (based on the result of vwhichPDict). >>> >>> Let me know if that's not fast enough or if you need further help with >>> this. >>> >>> Cheers, >>> H. >>> >>>> >>>> vmatchPDict(probeset_pdict, Drerio) works but I'd really like to match >>>> to the transcriptome rather than the genome. >>>> >>>> Thanks for any advice in advance, >>>> >>>> Ian >>>> >>>> Ian Henry >>>> MPI-CBG Dresden >>>> Pfotenhauerstrasse 108 >>>> 01307 Dresden >>>> Germany >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >> >> Dr. Ian Henry >> Bioinformatics Service Leader >> MPI-CBG Dresden >> Pfotenhauerstrasse 108 >> 01307 Dresden >> Germany >> phone: +49 351 210 2638 >> email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de=""> >> web: http://people.mpi-cbg.de/henry >> >> Check out the Bioinformatics Service wiki site: >> https://wiki.mpi-cbg.de/wiki/bioinfo/ >> > > Dr. Ian Henry > Bioinformatics Service Leader > MPI-CBG Dresden > Pfotenhauerstrasse 108 > 01307 Dresden > Germany > phone: +49 351 210 2638 > email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de=""> > web: http://people.mpi-cbg.de/henry > > Check out the Bioinformatics Service wiki site: > https://wiki.mpi-cbg.de/wiki/bioinfo/ > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer probe Cancer probe • 1.2k views

ADD COMMENT • link updated 13.7 years ago by Ian Henry ▴ 50 • written 13.7 years ago by Hervé Pagès 16k

0

Entering edit mode

Ian Henry ▴ 50

@ian-henry-4772

Last seen 10.6 years ago

> Hi Ian, > > I'm putting this discussion back on the mailing list as it might > be of interest to others. OK sounds good! I'll report some more of my findings too, to make things more clear. This issue is really due to the way the refGene track is created by the UCSC. They use BLAT and can take multiple strong BLAT alignments to their assembly, which includes these 'unusual chromosomes'. However, it can also be that a transcript maps multiple times to a 'standard chromsome'. If you create a transcriptDB object: > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene") then you can explore it: > txdb_data <-as.list(hg19txdb) > txList <-txdb_data$transcripts This gives a transcript table where it is possible to see that one tx_name (NM_????????) is actually not unique but has a tx_id associated with it that is unique. tx_id tx_name tx_chrom tx_strand tx_start tx_end 613 NM_001008221 chr1 + 104198141 104207173 2324 NM_001008221 chr1 + 104292279 104301311 2323 NM_001008221 chr1 - 104230040 104239073 In this case NM_001008221 is a salivary amylase alpha 1A precursor I looked at these positions in the UCSC genome browser to get an idea of what other annotation tracks had to say: chr1:104198141-104207173 has AMY1A,B,C,D mapping in the refseq track but is described as AMY1A by VEGA chr1:104230040 104239073 has AMY1A,B,C,D mapping in the refseq track but is described as AMY1B by VEGA chr1:104292279 104301311 has AMY1A,B,C,D mapping in the refseq track but is described as AMY1C by VEGA I guess the multiple mapping is due to this BLAT sequence similarity. It is also worth pointing out that if you want to get these transcripts out of the TranscriptDB object you would normally do the following: >hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) The object returned though, does not appear to keep the unique tx_ids and so contains multiple copies of some NM_???????s without a reference to their origin (ie tx_id). Anyway, for my purposes I didn't want these multiple mappings and so I decided to just take the transcript with the longest sequence. There is most likely a better way to do this, but this is what I did: > test_tx<-hg19_tx > names<-names(test_tx) > tmpdf<-as.data.frame(names) > tmpdf$width<-width(test_tx) > tmpdf$seq<-as.character(subseq(test_tx)) > tx_unique_df<-ddply(tmpdf, .(names), summarize, width=max(width), seq=seq[which.max(width)]) > tx_unique<-DNAStringSet(tx_unique_df$seq) > names(tx_unique)<-tx_unique_df$names This creates a new DNAStringSet containing just the longest mapping transcript. (although I've not used it for anything yet) I thought I'd include this info for clarification and to hopefully help others. Ian On Aug 3, 2011, at 10:19 AM, Hervé Pagès wrote: > Hi Ian, > > I'm putting this discussion back on the mailing list as it might > be of interest to others. > > On 11-08-01 02:00 AM, Ian Henry wrote: >> Hi Herve, >> >> I did some extra digging and the UCSC refGene track for NM_001320 >> CSNK2B >> shows: >> >> chr6 31633656 31637843 CSNK2B >> chr6_cox_hap2 3143276 3147463 CSNK2B >> chr6_dbb_hap3 2919235 2923423 CSNK2B >> chr6_mcf_hap5 3013336 3017523 CSNK2B >> chr6_qbl_hap6 2927299 2931486 CSNK2B >> chr6_mann_hap4 2976546 2980733 CSNK2B >> chr6_ssto_hap7 2964461 2968648 CSNK2B >> >> >> Which seems to be the results in my table. So the answer appears to >> be >> that the transcripts are on 'unusual chromosomes' could I filter >> these >> away an just keep chr6 easily? >> >> Thanks so much for your help, >> >> Ian >> >> >> Begin forwarded message: >> >>> *From: *Ian Henry <henry@mpi-cbg.de <mailto:henry@mpi-cbg.de="">> >>> *Date: *August 1, 2011 10:31:54 AM GMT+02:00 >>> *To: *Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >>> *Subject: **Re: [BioC] Probe matching using vmatchPDict* >>> >>> Hello Herve, >>> >>> Thanks for your help, >>> >>> I got the code running by first using vwhichPDict and then running >>> matchPDict on those positive matches. It takes around 15 minutes to >>> run on the around 6000 transcripts hit by my probes which isn't bad. >>> >>> >>> getSequenceMatches<-function(transcript) >>> { >>> probeset_pdict <- PDict(probeset) >>> txmatches <- matchPDict(probeset_pdict, transcript[[1]]) >>> match <-extractAllMatches(transcript[[1]], txmatches) >>> if( length(match)>0 ){ >>> matchdf <- as.data.frame(match) >>> matchdf$Transcript = names(transcript) >>> matchdf$Sequence <- as.character(subseq(rep(transcript[1], >>> length(matchdf$Transcript)), start=matchdf$start, end=matchdf$end)) >>> return(matchdf) >>> } >>> } >>> >>> tx_matches <- vwhichPDict(probeset_pdict, hg19_tx) >>> tx_match_indexes <- which(elementLengths(tx_matches)>0) >>> >>> f <- function(f) getSequenceMatches(hg19_tx[f]) >>> a <- lapply(tx_match_indexes,f) >>> outputTable<-do.call(rbind,a) >>> >>> >>> I have one small additional question which may or may not be best >>> directed at you. If I extract the human hg19 refGene transcript >>> database as follows: >>> >>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >>> "refGene") >>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >>> >>> This works, but if I explore the object produced to look at >>> transcript >>> names: >>> >>> > t1 <- names(hg19_tx) >>> > t1<- data.frame(t1) >>> > subset(t1, t1=="NM_001320") >>> >>> this gives: >>> t1 >>> 12387 NM_001320 >>> 38438 NM_001320 >>> 38976 NM_001320 >>> 39122 NM_001320 >>> 39645 NM_001320 >>> 39963 NM_001320 >>> 40204 NM_001320 >>> >>> So there are multiple entries for one transcript. Do you know why >>> this >>> is, I guess it is a 'feature' of the UCSC refGene? I suspect they >>> may >>> be versions of the same transcript e.g. NM_001320.1, NM_001320.2 >>> etc.... but the version number is missing and I guess I'd want the >>> current transcript for my purposes, as in my probe position matches >>> often I get multiple positions with the same transcript ID. Not sure >>> if you can help, but any advice would be greatly appreciated. > > Yes this a "feature" of the UCSC refGene track. AFAIK the UCSC > knownGene track does not contain duplicated transcript IDs. > Note that the names you see on the DNAStringSet object returned > by extractTranscriptsFromGenome() are exactly the names you get > on the GRangesList object returned by exonsBy() when you group > the exons by transcripts and use 'use.names=TRUE': > > > exbytx <- exonsBy(hg19txdb, by="tx", use.names=TRUE) > Warning message: > In .set.group.names(ans, use.names, txdb, by) : > some group names are NAs or duplicated > > > identical(names(exbytx), names(hg19_tx)) > [1] TRUE > > > table(duplicated(names(exbytx))) > > FALSE TRUE > 37308 3231 > > Also note the warning about some group names being NAs or duplicated. > Here the group names are the transcript names i.e. the top-level > names of 'exbytx'. You also get this warning when calling > extractTranscriptsFromGenome() on 'hg19txdb', which is not > surprising because 'exonsBy( , by="tx", use.names=TRUE)' is > called internally. > > To keep only the transcripts that belong to a pre-defined subset > of genomic sequences of interest, you can do: > > ## Let's say my sequences of interest are the main chromosomes: > > mychroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="") > > ## 'exbytx' contains the sequence that each exon belongs to. > ## We use this to infer the sequence that each transcript belongs to. > > tx_seqnames <- lapply(seqnames(exbytx), runValue) > ## Just to make sure that no transcript contains exons from more > ## than 1 sequence: > > all(elementLengths(tx_seqnames) == 1) > [1] TRUE > > tx_seqnames <- unlist(tx_seqnames) > > idx <- which(tx_seqnames %in% mychroms) > > ## Note that, even transcripts on the main chromosomes have > ## duplicated names: > > table(duplicated(names(exbytx)[idx])) > > FALSE TRUE > 37285 1015 > > ## Finally, subset 'hg19_tx': > > mytx <- hg19_tx[idx] > > Cheers, > H. > >>> >>> Thanks, >>> >>> Ian >>> >>> >>> >>> On Jul 25, 2011, at 7:57 PM, Hervé Pagès wrote: >>> >>>> Hi Ian, >>>> >>>> On 11-07-25 08:41 AM, Ian Henry wrote: >>>>> Hello, >>>>> >>>>> I'm trying to match a list of 60mer probes against a >>>>> transcriptome to >>>>> see which probes hit which transcripts. >>>>> >>>>> I have my 60mer probe list as a DNAStringSet and also as a "PDict" >>>>> > probeset <- DNAStringSet(probelist$ProbeSeq) >>>>> > probeset_pdict <- PDict(probeset) >>>>> >>>>> My transcriptome was created as follows: >>>>> > zv9txdb <- makeTranscriptDbFromUCSC(genome = "danRer7", >>>>> tablename = >>>>> "ensGene") >>>>> > zv9_tx <- extractTranscriptsFromGenome(Drerio, zv9txdb) >>>>> >>>>> >>>>> To find which transcripts are hit by the probes I've used >>>>> vwhichPDict: >>>>> > tx_matches <- vwhichPDict(probeset_pdict, zv9_tx) >>>>> which works brilliantly! >>>>> >>>>> However, I also would like the locations of the matches and so >>>>> tried: >>>>> > tx_locs <- vmatchPDict(probeset_pdict, zv9_tx) >>>>> This doesn't work and errors to say: >>>>> Error in .local(pdict, subject, max.mismatch, min.mismatch, >>>>> with.indels, : >>>>> vmatchPDict() is not ready yet, sorry >>>>> >>>>> Does this just mean it's not yet implemented and is there a >>>>> solution/workaround? >>>> >>>> It is not yet implemented and has been on my TODO list for a long >>>> time, >>>> sorry... >>>> >>>> One workaround I can think of: use matchPDict in a loop (you loop >>>> over >>>> the transcripts). The Drerio transcriptome is not be that big so it >>>> might run in descent time. It might help a little bit to reduce the >>>> size of the problem by getting rid of the probes and transcripts >>>> that >>>> don't have hit (based on the result of vwhichPDict). >>>> >>>> Let me know if that's not fast enough or if you need further help >>>> with >>>> this. >>>> >>>> Cheers, >>>> H. >>>> >>>>> >>>>> vmatchPDict(probeset_pdict, Drerio) works but I'd really like to >>>>> match >>>>> to the transcriptome rather than the genome. >>>>> >>>>> Thanks for any advice in advance, >>>>> >>>>> Ian >>>>> >>>>> Ian Henry >>>>> MPI-CBG Dresden >>>>> Pfotenhauerstrasse 108 >>>>> 01307 Dresden >>>>> Germany >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>> >>> Dr. Ian Henry >>> Bioinformatics Service Leader >>> MPI-CBG Dresden >>> Pfotenhauerstrasse 108 >>> 01307 Dresden >>> Germany >>> phone: +49 351 210 2638 >>> email: henry@mpi-cbg.de <mailto:henry@mpi-cbg.de> >>> web: http://people.mpi-cbg.de/henry >>> >>> Check out the Bioinformatics Service wiki site: >>> https://wiki.mpi-cbg.de/wiki/bioinfo/ >>> >> >> Dr. Ian Henry >> Bioinformatics Service Leader >> MPI-CBG Dresden >> Pfotenhauerstrasse 108 >> 01307 Dresden >> Germany >> phone: +49 351 210 2638 >> email: henry@mpi-cbg.de <mailto:henry@mpi-cbg.de> >> web: http://people.mpi-cbg.de/henry >> >> Check out the Bioinformatics Service wiki site: >> https://wiki.mpi-cbg.de/wiki/bioinfo/ >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 Dr. Ian Henry Bioinformatics Service Leader MPI-CBG Dresden Pfotenhauerstrasse 108 01307 Dresden Germany phone: +49 351 210 2638 email: henry@mpi-cbg.de web: http://people.mpi-cbg.de/henry Check out the Bioinformatics Service wiki site: https://wiki.mpi- cbg.de/wiki/bioinfo/ [[alternative HTML version deleted]]

ADD COMMENT • link 13.7 years ago Ian Henry ▴ 50

0

Entering edit mode

Hi Ian, On 11-08-03 01:58 AM, Ian Henry wrote: >> Hi Ian, >> >> I'm putting this discussion back on the mailing list as it might >> be of interest to others. > > OK sounds good! I'll report some more of my findings too, to make things > more clear. > > This issue is really due to the way the refGene track is created by the > UCSC. They use BLAT and can take multiple strong BLAT alignments to > their assembly, which includes these 'unusual chromosomes'. However, it > can also be that a transcript maps multiple times to a 'standard chromsome'. > > If you create a transcriptDB object: > > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = > "refGene") > > then you can explore it: > > txdb_data <-as.list(hg19txdb) > > txList <-txdb_data$transcripts > > This gives a transcript table where it is possible to see that one > tx_name (NM_????????) is actually not unique but has a tx_id associated > with it that is unique. > > tx_id tx_name tx_chrom tx_strand tx_start tx_end > 613 NM_001008221 chr1 + 104198141 104207173 > 2324 NM_001008221 chr1 + 104292279 104301311 > 2323 NM_001008221 chr1 - 104230040 104239073 Yes the tx_id col is guaranteed to contain distinct values. Note that those ids are not coming from UCSC: they were assigned to each transcript when the TranscriptDb object was created. This is just standard db design practice: the main tables in the underlying SQLite db of a TranscriptDb object are the transcript, exon and cds tables. And they all have a column that serves as a primary key: tx_id, exon_id and cds_id, respectively. Those "TranscriptDb internal ids" have no meaning outside the TranscriptDb object e.g. you cannot use them to connect transcripts from 2 different TranscriptDb objects. Even creating the same TranscriptDb object twice (with 2 identical calls to makeTranscriptDbFromUCSC()) is not guaranteed to generate the same internal ids. The only purpose of this "TranscriptDb internal id" is to unambiguously identify rows in the 3 main tables, and, in particular, to unambiguously link exons and cds to their corresponding transcript. Note that you can control what names are put on the object returned by most TranscriptDb extractors (e.g. transcripts(), transcriptsBy(), exons(), exonsBy(), extractTranscriptsFromGenome() etc...) with the use.names arg: when 'use.names=FALSE', then the internal ids are used for the names. > > In this case NM_001008221 is a salivary amylase alpha 1A precursor > > I looked at these positions in the UCSC genome browser to get an idea of > what other annotation tracks had to say: > > chr1:104198141-104207173 has AMY1A,B,C,D mapping in the refseq track but > is described as AMY1A by VEGA > chr1:104230040 104239073 has AMY1A,B,C,D mapping in the refseq track but > is described as AMY1B by VEGA > chr1:104292279 104301311 has AMY1A,B,C,D mapping in the refseq track but > is described as AMY1C by VEGA > > I guess the multiple mapping is due to this BLAT sequence similarity. Some details on how the RefSeq RNAs were aligned against the reference genome are provided on the track description page: http://genome.ucsc.edu/cgi- bin/hgTrackUi?hgsid=205474499&c=chr21&g=refGene > > It is also worth pointing out that if you want to get these transcripts > out of the TranscriptDB object you would normally do the following: > >hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) > > The object returned though, does not appear to keep the unique tx_ids > and so contains multiple copies of some NM_???????s without a reference > to their origin (ie tx_id). If you want the tx_id ("TranscriptDb internal ids") instead of the UCSC tx names, call extractTranscriptsFromGenome() with 'use.names=FALSE'. Unlike for the other TranscriptDb extractors, the default here is 'use.names=TRUE'. See ?extractTranscriptsFromGenome for the details. > > Anyway, for my purposes I didn't want these multiple mappings and so I > decided to just take the transcript with the longest sequence. > There is most likely a better way to do this, but this is what I did: > > > test_tx<-hg19_tx > > names<-names(test_tx) > > tmpdf<-as.data.frame(names) > > tmpdf$width<-width(test_tx) > > tmpdf$seq<-as.character(subseq(test_tx)) > > tx_unique_df<-ddply(tmpdf, .(names), summarize, width=max(width), > seq=seq[which.max(width)]) > > tx_unique<-DNAStringSet(tx_unique_df$seq) > > names(tx_unique)<-tx_unique_df$names > > This creates a new DNAStringSet containing just the longest mapping > transcript. (although I've not used it for anything yet) Note that in the case of NM_001008221 (the salivary amylase alpha 1A precursor above), all the transcripts have the same length: > idx <- which(!is.na(match(names(hg19_tx), "NM_001008221"))) > width(hg19_tx)[idx] [1] 1781 1781 1781 Furthermore, the sequences are the same: > hg19_tx[idx] A DNAStringSet instance of length 3 width seq names [1] 1781 CTTCTCAATATCAGCACTGGATT...ATTAAATGCAAATCCGCAAAGCA NM_001008221 [2] 1781 CTTCTCAATATCAGCACTGGATT...ATTAAATGCAAATCCGCAAAGCA NM_001008221 [3] 1781 CTTCTCAATATCAGCACTGGATT...ATTAAATGCAAATCCGCAAAGCA NM_001008221 > duplicated(hg19_tx[idx]) [1] FALSE TRUE TRUE I would actually not be surprised that you see something like this for almost all those transcripts with repeated IDs since they are the result of mapping the same RefSeq RNA. When you map your probes against the transcriptome, it seems unavoidable that you will end up with probes that have multiple hits: some of those ambiguous probes will hit more than 1 transcript *name*, but even probes that hit a single transcript name can hit the reference genome in multiple places just because a single transcript name/sequence can occur at more than 1 location in the reference genome. Cheers, H. > > I thought I'd include this info for clarification and to hopefully help > others. > > Ian > > > > On Aug 3, 2011, at 10:19 AM, Hervé Pagès wrote: > >> Hi Ian, >> >> I'm putting this discussion back on the mailing list as it might >> be of interest to others. >> >> On 11-08-01 02:00 AM, Ian Henry wrote: >>> Hi Herve, >>> >>> I did some extra digging and the UCSC refGene track for NM_001320 CSNK2B >>> shows: >>> >>> chr6 31633656 31637843 CSNK2B >>> chr6_cox_hap2 3143276 3147463 CSNK2B >>> chr6_dbb_hap3 2919235 2923423 CSNK2B >>> chr6_mcf_hap5 3013336 3017523 CSNK2B >>> chr6_qbl_hap6 2927299 2931486 CSNK2B >>> chr6_mann_hap4 2976546 2980733 CSNK2B >>> chr6_ssto_hap7 2964461 2968648 CSNK2B >>> >>> >>> Which seems to be the results in my table. So the answer appears to be >>> that the transcripts are on 'unusual chromosomes' could I filter these >>> away an just keep chr6 easily? >>> >>> Thanks so much for your help, >>> >>> Ian >>> >>> >>> Begin forwarded message: >>> >>>> *From: *Ian Henry <henry at="" mpi-cbg.de="" <mailto:henry="" at="" mpi-="" cbg.de="">> >>>> *Date: *August 1, 2011 10:31:54 AM GMT+02:00 >>>> *To: *Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> >>>> *Subject: **Re: [BioC] Probe matching using vmatchPDict* >>>> >>>> Hello Herve, >>>> >>>> Thanks for your help, >>>> >>>> I got the code running by first using vwhichPDict and then running >>>> matchPDict on those positive matches. It takes around 15 minutes to >>>> run on the around 6000 transcripts hit by my probes which isn't bad. >>>> >>>> >>>> getSequenceMatches<-function(transcript) >>>> { >>>> probeset_pdict <- PDict(probeset) >>>> txmatches <- matchPDict(probeset_pdict, transcript[[1]]) >>>> match <-extractAllMatches(transcript[[1]], txmatches) >>>> if( length(match)>0 ){ >>>> matchdf <- as.data.frame(match) >>>> matchdf$Transcript = names(transcript) >>>> matchdf$Sequence <- as.character(subseq(rep(transcript[1], >>>> length(matchdf$Transcript)), start=matchdf$start, end=matchdf$end)) >>>> return(matchdf) >>>> } >>>> } >>>> >>>> tx_matches <- vwhichPDict(probeset_pdict, hg19_tx) >>>> tx_match_indexes <- which(elementLengths(tx_matches)>0) >>>> >>>> f <- function(f) getSequenceMatches(hg19_tx[f]) >>>> a <- lapply(tx_match_indexes,f) >>>> outputTable<-do.call(rbind,a) >>>> >>>> >>>> I have one small additional question which may or may not be best >>>> directed at you. If I extract the human hg19 refGene transcript >>>> database as follows: >>>> >>>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = >>>> "refGene") >>>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb) >>>> >>>> This works, but if I explore the object produced to look at transcript >>>> names: >>>> >>>> > t1 <- names(hg19_tx) >>>> > t1<- data.frame(t1) >>>> > subset(t1, t1=="NM_001320") >>>> >>>> this gives: >>>> t1 >>>> 12387 NM_001320 >>>> 38438 NM_001320 >>>> 38976 NM_001320 >>>> 39122 NM_001320 >>>> 39645 NM_001320 >>>> 39963 NM_001320 >>>> 40204 NM_001320 >>>> >>>> So there are multiple entries for one transcript. Do you know why this >>>> is, I guess it is a 'feature' of the UCSC refGene? I suspect they may >>>> be versions of the same transcript e.g. NM_001320.1, NM_001320.2 >>>> etc.... but the version number is missing and I guess I'd want the >>>> current transcript for my purposes, as in my probe position matches >>>> often I get multiple positions with the same transcript ID. Not sure >>>> if you can help, but any advice would be greatly appreciated. >> >> Yes this a "feature" of the UCSC refGene track. AFAIK the UCSC >> knownGene track does not contain duplicated transcript IDs. >> Note that the names you see on the DNAStringSet object returned >> by extractTranscriptsFromGenome() are exactly the names you get >> on the GRangesList object returned by exonsBy() when you group >> the exons by transcripts and use 'use.names=TRUE': >> >> > exbytx <- exonsBy(hg19txdb, by="tx", use.names=TRUE) >> Warning message: >> In .set.group.names(ans, use.names, txdb, by) : >> some group names are NAs or duplicated >> >> > identical(names(exbytx), names(hg19_tx)) >> [1] TRUE >> >> > table(duplicated(names(exbytx))) >> >> FALSE TRUE >> 37308 3231 >> >> Also note the warning about some group names being NAs or duplicated. >> Here the group names are the transcript names i.e. the top-level >> names of 'exbytx'. You also get this warning when calling >> extractTranscriptsFromGenome() on 'hg19txdb', which is not >> surprising because 'exonsBy( , by="tx", use.names=TRUE)' is >> called internally. >> >> To keep only the transcripts that belong to a pre-defined subset >> of genomic sequences of interest, you can do: >> >> ## Let's say my sequences of interest are the main chromosomes: >> > mychroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="") >> >> ## 'exbytx' contains the sequence that each exon belongs to. >> ## We use this to infer the sequence that each transcript belongs to. >> > tx_seqnames <- lapply(seqnames(exbytx), runValue) >> ## Just to make sure that no transcript contains exons from more >> ## than 1 sequence: >> > all(elementLengths(tx_seqnames) == 1) >> [1] TRUE >> > tx_seqnames <- unlist(tx_seqnames) >> > idx <- which(tx_seqnames %in% mychroms) >> >> ## Note that, even transcripts on the main chromosomes have >> ## duplicated names: >> > table(duplicated(names(exbytx)[idx])) >> >> FALSE TRUE >> 37285 1015 >> >> ## Finally, subset 'hg19_tx': >> > mytx <- hg19_tx[idx] >> >> Cheers, >> H. >> >>>> >>>> Thanks, >>>> >>>> Ian >>>> >>>> >>>> >>>> On Jul 25, 2011, at 7:57 PM, Hervé Pagès wrote: >>>> >>>>> Hi Ian, >>>>> >>>>> On 11-07-25 08:41 AM, Ian Henry wrote: >>>>>> Hello, >>>>>> >>>>>> I'm trying to match a list of 60mer probes against a transcriptome to >>>>>> see which probes hit which transcripts. >>>>>> >>>>>> I have my 60mer probe list as a DNAStringSet and also as a "PDict" >>>>>> > probeset <- DNAStringSet(probelist$ProbeSeq) >>>>>> > probeset_pdict <- PDict(probeset) >>>>>> >>>>>> My transcriptome was created as follows: >>>>>> > zv9txdb <- makeTranscriptDbFromUCSC(genome = "danRer7", tablename = >>>>>> "ensGene") >>>>>> > zv9_tx <- extractTranscriptsFromGenome(Drerio, zv9txdb) >>>>>> >>>>>> >>>>>> To find which transcripts are hit by the probes I've used vwhichPDict: >>>>>> > tx_matches <- vwhichPDict(probeset_pdict, zv9_tx) >>>>>> which works brilliantly! >>>>>> >>>>>> However, I also would like the locations of the matches and so tried: >>>>>> > tx_locs <- vmatchPDict(probeset_pdict, zv9_tx) >>>>>> This doesn't work and errors to say: >>>>>> Error in .local(pdict, subject, max.mismatch, min.mismatch, >>>>>> with.indels, : >>>>>> vmatchPDict() is not ready yet, sorry >>>>>> >>>>>> Does this just mean it's not yet implemented and is there a >>>>>> solution/workaround? >>>>> >>>>> It is not yet implemented and has been on my TODO list for a long time, >>>>> sorry... >>>>> >>>>> One workaround I can think of: use matchPDict in a loop (you loop over >>>>> the transcripts). The Drerio transcriptome is not be that big so it >>>>> might run in descent time. It might help a little bit to reduce the >>>>> size of the problem by getting rid of the probes and transcripts that >>>>> don't have hit (based on the result of vwhichPDict). >>>>> >>>>> Let me know if that's not fast enough or if you need further help with >>>>> this. >>>>> >>>>> Cheers, >>>>> H. >>>>> >>>>>> >>>>>> vmatchPDict(probeset_pdict, Drerio) works but I'd really like to match >>>>>> to the transcriptome rather than the genome. >>>>>> >>>>>> Thanks for any advice in advance, >>>>>> >>>>>> Ian >>>>>> >>>>>> Ian Henry >>>>>> MPI-CBG Dresden >>>>>> Pfotenhauerstrasse 108 >>>>>> 01307 Dresden >>>>>> Germany >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> Dr. Ian Henry >>>> Bioinformatics Service Leader >>>> MPI-CBG Dresden >>>> Pfotenhauerstrasse 108 >>>> 01307 Dresden >>>> Germany >>>> phone: +49 351 210 2638 >>>> email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de=""> >>>> web: http://people.mpi-cbg.de/henry >>>> >>>> Check out the Bioinformatics Service wiki site: >>>> https://wiki.mpi-cbg.de/wiki/bioinfo/ >>>> >>> >>> Dr. Ian Henry >>> Bioinformatics Service Leader >>> MPI-CBG Dresden >>> Pfotenhauerstrasse 108 >>> 01307 Dresden >>> Germany >>> phone: +49 351 210 2638 >>> email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de=""> >>> web: http://people.mpi-cbg.de/henry >>> >>> Check out the Bioinformatics Service wiki site: >>> https://wiki.mpi-cbg.de/wiki/bioinfo/ >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > Dr. Ian Henry > Bioinformatics Service Leader > MPI-CBG Dresden > Pfotenhauerstrasse 108 > 01307 Dresden > Germany > phone: +49 351 210 2638 > email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de=""> > web: http://people.mpi-cbg.de/henry > > Check out the Bioinformatics Service wiki site: > https://wiki.mpi-cbg.de/wiki/bioinfo/ > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.7 years ago Hervé Pagès 16k

Login before adding your answer.