vmatchPDict?
1
0
Entering edit mode
David Iles ▴ 130
@david-iles-4487
Last seen 9.9 years ago
Hi, I need to re-map the probe sequences of the Affymetrix Bovine genome array to a recent draft sequence of the sheep genome (please, don't ask why...). As a first step, I successfully created a new BSgenome package from a seed file, listing individual chromosomes as 'seqnames' and unmapped, and two multiple sequence fasta files as 'mseqnames', as per the forgeBSgenomeDataPkg vignette (see session info below). When calling the matchPDict() function to map the probe sequences to the + and - strands of individual chromosomes, all went smoothly, but the following error occurred with multiple sequences: > runAnConScaff(bt.probes.all, outfile="bt.probes.2.oarv3.1.unmapped.txt") Target: strand + of Oar v3.1 sequence unmapped_scaffolds, unmapped_contigs >>> Finding all hits in strand + of sequence unmapped_scaffolds ... Error in matchPDict(pdict, subject) : please use vmatchPDict() when 'subject' is an XStringSet object (multiple sequence) So, I edited my script to call vmatchPDict() instead, with the following result.... > runAnConScaff(bt.probes.all, outfile="bt.probes.2.oarv3.1.unmapped.txt") Target: strand + of Oar v3.1 sequence unmapped_scaffolds, unmapped_contigs >>> Finding all hits in strand + of sequence unmapped_scaffolds ... Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels, : vmatchPDict() is not ready yet, sorry While I can work around this by splitting the multiple sequences into loads of small fasta files, each with a single sequence, I wondered, will the vmatchPDict() function be ready in the not-too-distant future? Many thanks Dr David Iles School of Biology University of Leeds Leeds LS2 9JT d.e.iles at leeds.ac.uk > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Oaries.ISGC.Oarv3.1 BSgenome_1.26.1 Biostrings_2.26.2 [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] parallel_2.15.2 stats4_2.15.2 tools_2.15.2 >
BSgenome probe BSgenome BSgenome probe BSgenome • 2.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States
Hi David, On 12/14/2012 03:45 AM, David Iles wrote: > Hi, > > I need to re-map the probe sequences of the Affymetrix Bovine genome array to a recent draft sequence of the sheep genome (please, don't ask why...). As a first step, I successfully created a new BSgenome package from a seed file, listing individual chromosomes as 'seqnames' and unmapped, and two multiple sequence fasta files as 'mseqnames', as per the forgeBSgenomeDataPkg vignette (see session info below). > > When calling the matchPDict() function to map the probe sequences to the + and - strands of individual chromosomes, all went smoothly, but the following error occurred with multiple sequences: > >> runAnConScaff(bt.probes.all, outfile="bt.probes.2.oarv3.1.unmapped.txt") > > Target: strand + of Oar v3.1 sequence unmapped_scaffolds, unmapped_contigs >>>> Finding all hits in strand + of sequence unmapped_scaffolds ... > Error in matchPDict(pdict, subject) : > please use vmatchPDict() when 'subject' is an XStringSet object (multiple sequence) > > So, I edited my script to call vmatchPDict() instead, with the following result.... > >> runAnConScaff(bt.probes.all, outfile="bt.probes.2.oarv3.1.unmapped.txt") > > Target: strand + of Oar v3.1 sequence unmapped_scaffolds, unmapped_contigs >>>> Finding all hits in strand + of sequence unmapped_scaffolds ... > Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels, : > vmatchPDict() is not ready yet, sorry > > While I can work around this by splitting the multiple sequences into loads of small fasta files, each with a single sequence, I wondered, will the vmatchPDict() function be ready in the not-too- distant future? Sure. If I remember correctly, I delayed this because (1) it required spending a little bit of time thinking about what kind of container would be most appropriate for storing the result of vmatchPDict() (conceptually something like a list of lists of IRanges objects, or a 2-D ragged array of IRanges objects, or...), and (2) I don't think anybody asked for this before. In the meantime the workaround of course, as you figured it out, is to call matchPDict() in a loop. FWIW tvcountPDict() and vwhichPDict() are implemented. Cheers, H. > > Many thanks > > Dr David Iles > School of Biology > University of Leeds > Leeds LS2 9JT > > d.e.iles at leeds.ac.uk > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Oaries.ISGC.Oarv3.1 BSgenome_1.26.1 Biostrings_2.26.2 > [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] parallel_2.15.2 stats4_2.15.2 tools_2.15.2 >> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Limma and EdgeR both used linear models to fit the data. Anyone could explain the differences between these two linear models? I noticed Limma requires normal data and edgeR needs negative binomial distribution. Thanks. Xin [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Dear Xin the accompanying papers and package vignettes might be a good place to start. The authors of these packages have put a lot of effort in these, you could check them out. Best wishes Wolfgang Il giorno Dec 15, 2012, alle ore 7:56 AM, capricy gao <capricyg at="" yahoo.com=""> ha scritto: > Limma and EdgeR both used linear models to fit the data. Anyone could explain the differences between these two linear models? I noticed Limma requires normal data and edgeR needs negative binomial distribution. > > Thanks. > > Xin > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
I hate to say this. However, to be honest, I don't think those papers are very helpful for biologists. ________________________________ From: Wolfgang Huber <whuber@embl.de> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Saturday, December 15, 2012 10:02 AM Subject: Re: [BioC] linear models for microarray and RNA-seq Dear Xin the accompanying papers and package vignettes might be a good place to start. The authors of these packages have put a lot of effort in these, you could check them out. ��� Best wishes ��� Wolfgang ha scritto: > Limma and EdgeR both used linear models to fit the data. Anyone could explain the differences between these two linear models? I noticed Limma requires normal data and edgeR needs negative binomial distribution. > > Thanks. > > Xin > ��� [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, On Saturday, December 15, 2012, capricy gao wrote: > I hate to say this. However, to be honest, I don't think those papers are > very helpful for biologists. I also hate to say this, but if the tutorials are written at a level that you can't make much sense of, then perhaps you should be collaborating with someone who can? I don't mean that as an insult -- I also had a hard time digesting all of this stuff at first, but there's a non-trivial amount of expertise (stats) you need to get comfortable with before you can hope to *really* make sense of this stuff. Bench protocols aren't written to be understood by statisticians, you know? Not that there is any harm in someone trying to learn. In short, limma has been used to analyze microarray data, and there are assumptions about the characteristics of that data that do not apply to sequencing data. edgeR was built to handle these differences for sequencing data There is also "voom" in limma which apparently can handle sequencing data You are approaching the reason why each of them isn't suited for the other with your reference to normal vs. negative binomial distributions. There is a subtlety in what you said, though ... in layman terms, it's not that they "require" normal (or neg B) data, but rather that they use these distributions as assumptions for aspects of the data in order to correctly model it ... or so. Using one in place of the other can (in some sense) be done, which is my point re: your use of "require", but it would be wrong. Anyway. There is lots of literature (and tutorials (even on the bioconductor website from precious workshop)) on this -- ie. the use of NB or poisson distros for sequencing data, and a google search with those keywords would likely be very fruitful if you are genuinely interested into digging into this further for self study. If you just want to analyze data and get results, then just take it all as gospel and use the suggested tools for the data you have at hand ;-) HTH, -steve > > > > ________________________________ > From: Wolfgang Huber <whuber@embl.de <javascript:;="">> > > Cc: "bioconductor@r-project.org <javascript:;>" < > bioconductor@r-project.org <javascript:;>> > Sent: Saturday, December 15, 2012 10:02 AM > Subject: Re: [BioC] linear models for microarray and RNA-seq > > Dear Xin > > the accompanying papers and package vignettes might be a good place to > start. > The authors of these packages have put a lot of effort in these, you could > check them out. > > Best wishes > Wolfgang > > > ha scritto: > > > Limma and EdgeR both used linear models to fit the data. Anyone could > explain the differences between these two linear models? I noticed Limma > requires normal data and edgeR needs negative binomial distribution. > > > > Thanks. > > > > Xin > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org <javascript:;> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] > > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I really wish that the answers are only related to my questions in this maillist. I know I am learning. If you feel the questions appear too stupid, you can simply choose to ignore. Unnecessary suggestions are not helpful at all. ________________________________ From: Steve Lianoglou <mailinglist.honeypot@gmail.com> Cc: Wolfgang Huber <whuber@embl.de>; "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Saturday, December 15, 2012 2:25 PM Subject: Re: [BioC] linear models for microarray and RNA-seq Hi, On Saturday, December 15, 2012, capricy gao wrote: I hate to say this. However, to be honest, I don't think those papers are very helpful for biologists. I also hate to say this, but if the tutorials are written at a level that you can't make much sense of, then perhaps you should be collaborating with someone who can? I don't mean that as an insult -- I also had a hard time digesting all of this stuff at first, but there's a non-trivial amount of expertise (stats) you need to get comfortable with before you can hope to *really* make sense of this stuff. Bench protocols aren't written to be understood by statisticians, you know?  Not that there is any harm in someone trying to learn. In short, limma has been used to analyze microarray data, and there are assumptions about the characteristics of that data that do not apply to sequencing data. edgeR was built to handle these differences for sequencing data There is also "voom" in limma which apparently can handle sequencing data You are approaching the reason why each of them isn't suited for the other with your reference to normal vs. negative binomial distributions. There is a subtlety in what you said, though ... in layman terms, it's not that they "require" normal (or neg B) data, but rather that they use these distributions as assumptions for aspects of the data in order to correctly  model it ... or so. Using one in place of the other can (in some sense) be done, which is my point re: your use of "require", but it would be wrong. Anyway. There is lots of literature (and tutorials (even on the bioconductor website from precious workshop)) on this -- ie. the use of NB or poisson distros for sequencing data, and a google search with those keywords would likely be very fruitful if you are genuinely interested into digging into this further for self study. If you just want to analyze data and get results, then just take it all as gospel and use the suggested tools for the data you have at hand ;-) HTH, -steve > > > >________________________________ > From: Wolfgang Huber <whuber@embl.de> > >Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> >Sent: Saturday, December 15, 2012 10:02 AM >Subject: Re: [BioC] linear models for microarray and RNA-seq > >Dear Xin > >the accompanying papers and package vignettes might be a good place to start. >The authors of these packages have put a lot of effort in these, you could check them out. > >    Best wishes >    Wolfgang > > >ha scritto: > >> Limma and EdgeR both used linear models to fit the data. Anyone could explain the differences between these two linear models? I noticed Limma requires normal data and edgeR needs negative binomial distribution. >> >> Thanks. >> >> Xin >>     [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >        [[alternative HTML version deleted]] > > -- Steve Lianoglou Graduate Student: Computational Systems Biology  | Memorial Sloan-Kettering Cancer Center  | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Sat, Dec 15, 2012 at 4:14 PM, capricy gao <capricyg@yahoo.com> wrote: > I really wish that the answers are only related to my questions in this > maillist. I know I am learning. If you feel the questions appear too > stupid, you can simply choose to ignore. Unnecessary suggestions are not > helpful at all. > > Hi, Capricy. For the record, you are asking *volunteers* on a public email list to explain something to you. No one is trying to be patronizing and I hope you are not taking answers offered as anything other than an honest attempt to answer your question. Now, back to your question. Both limma and edgeR fall into a category of models called "Generalized Linear Models". So, if you really want to get up-to-speed on what the differences are, you can start there. Believe it or not, youtube has several lectures on glm, but google and wikipedia may be helpful here, also, in addition to the material that Wolfgang suggested. That said, basically, limma and edgeR differ in the components of the generalized linear model that are used. These differences are, as Steve pointed out, dictated by the characteristics of the data being analyzed. In microarray data, the data are generally approximately normally distributed. With sequencing data, the data are close to a poisson model, but with a little extra variance than expected for a poisson; this extra variance leads to the use of the negative binomial distribution. In summary, a generalized linear model can have different components "plugged in" and that the choice of these components is dictated by the data under analysis. Hope that helps. Sean > > > > ________________________________ > From: Steve Lianoglou <mailinglist.honeypot@gmail.com> > > Cc: Wolfgang Huber <whuber@embl.de>; "bioconductor@r-project.org" < > bioconductor@r-project.org> > Sent: Saturday, December 15, 2012 2:25 PM > Subject: Re: [BioC] linear models for microarray and RNA-seq > > > Hi, > > On Saturday, December 15, 2012, capricy gao wrote: > > I hate to say this. However, to be honest, I don't think those papers are > very helpful for biologists. > > I also hate to say this, but if the tutorials are written at a level that > you can't make much sense of, then perhaps you should be collaborating with > someone who can? > > I don't mean that as an insult -- I also had a hard time digesting all of > this stuff at first, but there's a non-trivial amount of expertise > (stats) you need to get comfortable with before you can hope to *really* > make sense of this stuff. > > Bench protocols aren't written to be understood by statisticians, you know? > > > > > > > > > Not that there is any harm in someone trying to learn. > > In short, limma has been used to analyze microarray data, and there are > assumptions about the characteristics of that data that do not apply to > sequencing data. edgeR was built to handle these differences for sequencing > data > > There is also "voom" in limma which apparently can handle sequencing data > > You are approaching the reason why each of them isn't suited for the other > with your reference to normal vs. negative binomial distributions. There is > a subtlety in what you said, though ... in layman terms, it's not that they > "require" normal (or neg B) data, but rather that they use these > distributions as assumptions for aspects of the data in order to > correctly model it ... or so. Using one in place of the other can (in some > sense) be done, which is my point re: your use of "require", but it would > be wrong. > > Anyway. There is lots of literature (and tutorials (even on the > bioconductor website from precious workshop)) on this -- ie. the use of NB > or poisson distros for sequencing data, and a google search with those > keywords would likely be very fruitful if you are genuinely interested into > digging into this further for self study. > > If you just want to analyze data and get results, then just take it all as > gospel and use the suggested tools for the data you have at hand ;-) > > HTH, > -steve > > > > > > > > > > > > >________________________________ > > From: Wolfgang Huber <whuber@embl.de> > > > >Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> > >Sent: Saturday, December 15, 2012 10:02 AM > >Subject: Re: [BioC] linear models for microarray and RNA-seq > > > >Dear Xin > > > >the accompanying papers and package vignettes might be a good place to > start. > >The authors of these packages have put a lot of effort in these, you > could check them out. > > > > Best wishes > > Wolfgang > > > > > >ha scritto: > > > >> Limma and EdgeR both used linear models to fit the data. Anyone could > explain the differences between these two linear models? I noticed Limma > requires normal data and edgeR needs negative binomial distribution. > >> > >> Thanks. > >> > >> Xin > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, Sean, Thank you very much for the explanations. By the way, I know this is a public email list, but I still assume "volunteers" here can act professionally. Something like "If youjust want to analyze data and get results, then just take it all as gospel and use the suggested tools for the data you have at hand ;-)" sounds really funny to me. If that is the case, I didn't have to post anything here... I am here to seek help so I can understand more than just typing the codes. I don't have stat background, but I am trying to improve myself to a level so that I could talk to the stat people. ________________________________ From: Sean Davis <sdavis2@mail.nih.gov> Cc: Steve Lianoglou <mailinglist.honeypot@gmail.com>; "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Saturday, December 15, 2012 4:25 PM Subject: Re: [BioC] linear models for microarray and RNA-seq I really wish that the answers are only related to my questions in this maillist. I know I am learning. If you feel the questions appear too stupid, you can simply choose to ignore. Unnecessary suggestions are not helpful at all. > > Hi, Capricy. For the record, you are asking *volunteers* on a public email list to explain something to you.  No one is trying to be patronizing and I hope you are not taking answers offered as anything other than an honest attempt to answer your question. Now, back to your question.  Both limma and edgeR fall into a category of models called "Generalized Linear Models".  So, if you really want to get up-to-speed on what the differences are, you can start there. Believe it or not, youtube has several lectures on glm, but google and wikipedia may be helpful here, also, in addition to the material that Wolfgang suggested. That said, basically, limma and edgeR differ in the components of the generalized linear model that are used.  These differences are, as Steve pointed out, dictated by the characteristics of the data being analyzed.  In microarray data, the data are generally approximately normally distributed.  With sequencing data, the data are close to a poisson model, but with a little extra variance than expected for a poisson; this extra variance leads to the use of the negative binomial distribution.  In summary, a generalized linear model can have different components "plugged in" and that the choice of these components is dictated by the data under analysis. Hope that helps. Sean > > >________________________________ > From: Steve Lianoglou <mailinglist.honeypot@gmail.com> > >Cc: Wolfgang Huber <whuber@embl.de>; "bioconductor@r-project.org" <bioconductor@r-project.org> >Sent: Saturday, December 15, 2012 2:25 PM > >Subject: Re: [BioC] linear models for microarray and RNA-seq > > > >Hi, > >On Saturday, December 15, 2012, capricy gao  wrote: > >I hate to say this. However, to be honest, I don't think those papers are very helpful for biologists. > >I also hate to say this, but if the tutorials are written at a level that you can't make much sense of, then perhaps you should be collaborating with someone who can? > >I don't mean that as an insult -- I also had a hard time digesting all of this stuff at first, but there's a non-trivial amount of expertise (stats) you need to get comfortable with before you can hope to *really* make sense of this stuff. > >Bench protocols aren't written to be understood by statisticians, you know? > > > > > > > > > Not that there is any harm in someone trying to learn. > >In short, limma has been used to analyze microarray data, and there are assumptions about the characteristics of that data that do not apply to sequencing data. edgeR was built to handle these differences for sequencing data > >There is also "voom" in limma which apparently can handle sequencing data > >You are approaching the reason why each of them isn't suited for the other with your reference to normal vs. negative binomial distributions. There is a subtlety in what you said, though ... in layman terms, it's not that they "require" normal (or neg B) data, but rather that they use these distributions as assumptions for aspects of the data in order to correctly  model it ... or so. Using one in place of the other can (in some sense) be done, which is my point re: your use of "require", but it would be wrong. > >Anyway. There is lots of literature (and tutorials (even on the bioconductor website from precious workshop)) on this -- ie. the use of NB or poisson distros for sequencing data, and a google search with those keywords would likely be very fruitful if you are genuinely interested into digging into this further for self study. > >If you just want to analyze data and get results, then just take it all as gospel and use the suggested tools for the data you have at hand ;-) > >HTH, >-steve > > > > > >> >> >> >>________________________________ >> From: Wolfgang Huber <whuber@embl.de> >> >>Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> > >>Sent: Saturday, December 15, 2012 10:02 AM >>Subject: Re: [BioC] linear models for microarray and RNA-seq >> >>Dear Xin >> >>the accompanying papers and package vignettes might be a good place to start. >>The authors of these packages have put a lot of effort in these, you could check them out. >> >>    Best wishes >>    Wolfgang >> >> >>ha scritto: >> >>> Limma and EdgeR both used linear models to fit the data. Anyone could explain the differences between these two linear models? I noticed Limma requires normal data and edgeR needs negative binomial distribution. >>> >>> Thanks. >>> >>> Xin >>>     [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>        [[alternative HTML version deleted]] >> >> > >-- >Steve Lianoglou >Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University >Contact Info: http://cbio.mskcc.org/~lianos/contact >        [[alternative HTML version deleted]] > > >_______________________________________________ >Bioconductor mailing list >Bioconductor@r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6