Entering edit mode
Hi Ian,
I'm putting this discussion back on the mailing list as it might
be of interest to others.
On 11-08-01 02:00 AM, Ian Henry wrote:
> Hi Herve,
>
> I did some extra digging and the UCSC refGene track for NM_001320
CSNK2B
> shows:
>
> chr6 31633656 31637843 CSNK2B
> chr6_cox_hap2 3143276 3147463 CSNK2B
> chr6_dbb_hap3 2919235 2923423 CSNK2B
> chr6_mcf_hap5 3013336 3017523 CSNK2B
> chr6_qbl_hap6 2927299 2931486 CSNK2B
> chr6_mann_hap4 2976546 2980733 CSNK2B
> chr6_ssto_hap7 2964461 2968648 CSNK2B
>
>
> Which seems to be the results in my table. So the answer appears to
be
> that the transcripts are on 'unusual chromosomes' could I filter
these
> away an just keep chr6 easily?
>
> Thanks so much for your help,
>
> Ian
>
>
> Begin forwarded message:
>
>> *From: *Ian Henry <henry at="" mpi-cbg.de="" <mailto:henry="" at="" mpi-="" cbg.de="">>
>> *Date: *August 1, 2011 10:31:54 AM GMT+02:00
>> *To: *Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
>> *Subject: **Re: [BioC] Probe matching using vmatchPDict*
>>
>> Hello Herve,
>>
>> Thanks for your help,
>>
>> I got the code running by first using vwhichPDict and then running
>> matchPDict on those positive matches. It takes around 15 minutes to
>> run on the around 6000 transcripts hit by my probes which isn't
bad.
>>
>>
>> getSequenceMatches<-function(transcript)
>> {
>> probeset_pdict <- PDict(probeset)
>> txmatches <- matchPDict(probeset_pdict, transcript[[1]])
>> match <-extractAllMatches(transcript[[1]], txmatches)
>> if( length(match)>0 ){
>> matchdf <- as.data.frame(match)
>> matchdf$Transcript = names(transcript)
>> matchdf$Sequence <- as.character(subseq(rep(transcript[1],
>> length(matchdf$Transcript)), start=matchdf$start, end=matchdf$end))
>> return(matchdf)
>> }
>> }
>>
>> tx_matches <- vwhichPDict(probeset_pdict, hg19_tx)
>> tx_match_indexes <- which(elementLengths(tx_matches)>0)
>>
>> f <- function(f) getSequenceMatches(hg19_tx[f])
>> a <- lapply(tx_match_indexes,f)
>> outputTable<-do.call(rbind,a)
>>
>>
>> I have one small additional question which may or may not be best
>> directed at you. If I extract the human hg19 refGene transcript
>> database as follows:
>>
>> > hg19txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename =
>> "refGene")
>> > hg19_tx <- extractTranscriptsFromGenome(Hsapiens, hg19txdb)
>>
>> This works, but if I explore the object produced to look at
transcript
>> names:
>>
>> > t1 <- names(hg19_tx)
>> > t1<- data.frame(t1)
>> > subset(t1, t1=="NM_001320")
>>
>> this gives:
>> t1
>> 12387 NM_001320
>> 38438 NM_001320
>> 38976 NM_001320
>> 39122 NM_001320
>> 39645 NM_001320
>> 39963 NM_001320
>> 40204 NM_001320
>>
>> So there are multiple entries for one transcript. Do you know why
this
>> is, I guess it is a 'feature' of the UCSC refGene? I suspect they
may
>> be versions of the same transcript e.g. NM_001320.1, NM_001320.2
>> etc.... but the version number is missing and I guess I'd want the
>> current transcript for my purposes, as in my probe position matches
>> often I get multiple positions with the same transcript ID. Not
sure
>> if you can help, but any advice would be greatly appreciated.
Yes this a "feature" of the UCSC refGene track. AFAIK the UCSC
knownGene track does not contain duplicated transcript IDs.
Note that the names you see on the DNAStringSet object returned
by extractTranscriptsFromGenome() are exactly the names you get
on the GRangesList object returned by exonsBy() when you group
the exons by transcripts and use 'use.names=TRUE':
> exbytx <- exonsBy(hg19txdb, by="tx", use.names=TRUE)
Warning message:
In .set.group.names(ans, use.names, txdb, by) :
some group names are NAs or duplicated
> identical(names(exbytx), names(hg19_tx))
[1] TRUE
> table(duplicated(names(exbytx)))
FALSE TRUE
37308 3231
Also note the warning about some group names being NAs or duplicated.
Here the group names are the transcript names i.e. the top-level
names of 'exbytx'. You also get this warning when calling
extractTranscriptsFromGenome() on 'hg19txdb', which is not
surprising because 'exonsBy( , by="tx", use.names=TRUE)' is
called internally.
To keep only the transcripts that belong to a pre-defined subset
of genomic sequences of interest, you can do:
## Let's say my sequences of interest are the main chromosomes:
> mychroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="")
## 'exbytx' contains the sequence that each exon belongs to.
## We use this to infer the sequence that each transcript belongs to.
> tx_seqnames <- lapply(seqnames(exbytx), runValue)
## Just to make sure that no transcript contains exons from more
## than 1 sequence:
> all(elementLengths(tx_seqnames) == 1)
[1] TRUE
> tx_seqnames <- unlist(tx_seqnames)
> idx <- which(tx_seqnames %in% mychroms)
## Note that, even transcripts on the main chromosomes have
## duplicated names:
> table(duplicated(names(exbytx)[idx]))
FALSE TRUE
37285 1015
## Finally, subset 'hg19_tx':
> mytx <- hg19_tx[idx]
Cheers,
H.
>>
>> Thanks,
>>
>> Ian
>>
>>
>>
>> On Jul 25, 2011, at 7:57 PM, Hervé Pagès wrote:
>>
>>> Hi Ian,
>>>
>>> On 11-07-25 08:41 AM, Ian Henry wrote:
>>>> Hello,
>>>>
>>>> I'm trying to match a list of 60mer probes against a
transcriptome to
>>>> see which probes hit which transcripts.
>>>>
>>>> I have my 60mer probe list as a DNAStringSet and also as a
"PDict"
>>>> > probeset <- DNAStringSet(probelist$ProbeSeq)
>>>> > probeset_pdict <- PDict(probeset)
>>>>
>>>> My transcriptome was created as follows:
>>>> > zv9txdb <- makeTranscriptDbFromUCSC(genome = "danRer7",
tablename =
>>>> "ensGene")
>>>> > zv9_tx <- extractTranscriptsFromGenome(Drerio, zv9txdb)
>>>>
>>>>
>>>> To find which transcripts are hit by the probes I've used
vwhichPDict:
>>>> > tx_matches <- vwhichPDict(probeset_pdict, zv9_tx)
>>>> which works brilliantly!
>>>>
>>>> However, I also would like the locations of the matches and so
tried:
>>>> > tx_locs <- vmatchPDict(probeset_pdict, zv9_tx)
>>>> This doesn't work and errors to say:
>>>> Error in .local(pdict, subject, max.mismatch, min.mismatch,
>>>> with.indels, :
>>>> vmatchPDict() is not ready yet, sorry
>>>>
>>>> Does this just mean it's not yet implemented and is there a
>>>> solution/workaround?
>>>
>>> It is not yet implemented and has been on my TODO list for a long
time,
>>> sorry...
>>>
>>> One workaround I can think of: use matchPDict in a loop (you loop
over
>>> the transcripts). The Drerio transcriptome is not be that big so
it
>>> might run in descent time. It might help a little bit to reduce
the
>>> size of the problem by getting rid of the probes and transcripts
that
>>> don't have hit (based on the result of vwhichPDict).
>>>
>>> Let me know if that's not fast enough or if you need further help
with
>>> this.
>>>
>>> Cheers,
>>> H.
>>>
>>>>
>>>> vmatchPDict(probeset_pdict, Drerio) works but I'd really like to
match
>>>> to the transcriptome rather than the genome.
>>>>
>>>> Thanks for any advice in advance,
>>>>
>>>> Ian
>>>>
>>>> Ian Henry
>>>> MPI-CBG Dresden
>>>> Pfotenhauerstrasse 108
>>>> 01307 Dresden
>>>> Germany
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>
>> Dr. Ian Henry
>> Bioinformatics Service Leader
>> MPI-CBG Dresden
>> Pfotenhauerstrasse 108
>> 01307 Dresden
>> Germany
>> phone: +49 351 210 2638
>> email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de="">
>> web: http://people.mpi-cbg.de/henry
>>
>> Check out the Bioinformatics Service wiki site:
>> https://wiki.mpi-cbg.de/wiki/bioinfo/
>>
>
> Dr. Ian Henry
> Bioinformatics Service Leader
> MPI-CBG Dresden
> Pfotenhauerstrasse 108
> 01307 Dresden
> Germany
> phone: +49 351 210 2638
> email: henry at mpi-cbg.de <mailto:henry at="" mpi-cbg.de="">
> web: http://people.mpi-cbg.de/henry
>
> Check out the Bioinformatics Service wiki site:
> https://wiki.mpi-cbg.de/wiki/bioinfo/
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319