faster way to get differential calls from pileup?
2
0
Entering edit mode
wrighth ▴ 260
@wrighth-3452
Last seen 10.1 years ago
Hi, all; I've got a pair of lanes of exome sequencing data; we've generated pileup files from samtools and we're interested in looking for discordant calls for quality control or snp discovery. As best I can figure out the way to do this involves doing a findOverlaps and the programatically iterating through the match matrix to get the matching positions and check for differences. However, the overlap finding takes several hours, and since we anticipate there being many lanes in the future I'm curious if there's a faster or better way to go about this sort of process. Thanks... Hollis Wright Sent from my iPhone
SNP Sequencing PROcess SNP Sequencing PROcess • 1.7k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 10 weeks ago
United States
On 10/16/2010 11:54 AM, Hollis Wright wrote: > Hi, all; I've got a pair of lanes of exome sequencing data; we've > generated pileup files from samtools and we're interested in looking > for discordant calls for quality control or snp discovery. As best I > can figure out the way to do this involves doing a findOverlaps and > the programatically iterating through the match matrix to get the > matching positions and check for differences. However, the overlap > finding takes several hours, and since we anticipate there being many This sounds like it's taking longer than findOverlaps should be taking; perhaps you are running out of memory (so process in batches, e.g., by chromosome) or doing something inefficiently. What does your code look like (simplified, if possible...) Martin > lanes in the future I'm curious if there's a faster or better way to > go about this sort of process. Thanks... > > Hollis Wright > > Sent from my iPhone > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 7 weeks ago
United States
On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth@ohsu.edu> wrote: > Hi, all; I've got a pair of lanes of exome sequencing data; we've > generated pileup files from samtools and we're interested in looking > for discordant calls for quality control or snp discovery. As best I > can figure out the way to do this involves doing a findOverlaps and > the programatically iterating through the match matrix to get the > matching positions and check for differences. However, the overlap > finding takes several hours, and since we anticipate there being many > lanes in the future I'm curious if there's a faster or better way to > go about this sort of process. Thanks... > Hi, Hollis. Have you considered converting to VCF format and using some of the VCF tools for this type of thing? With VCF, you get one row per locus with the genotypes for all your samples in that row. Conversion to tab-delimited text is also possible for processing in R. I think Vince Carey was looking into R tools for working with VCF, but I don't know where that work stands. All that said, several hours for finding overlaps sounds like a long time for a couple of pileup outputs from exome sequencing. Sean > Hollis Wright > > Sent from my iPhone > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
On Sat, Oct 16, 2010 at 5:14 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth at="" ohsu.edu=""> wrote: >> Hi, all; I've got a pair of lanes of exome sequencing data; we've >> generated pileup files from samtools and we're interested in looking >> for discordant calls for quality control or snp discovery. As best I >> can figure out the way to do this involves doing a findOverlaps and >> the programatically iterating through the match matrix to get the >> matching positions and check for differences. However, the overlap >> finding takes several hours, and since we anticipate there being many >> lanes in the future I'm curious if there's a faster or better way to >> go about this sort of process. Thanks... >> > > Hi, Hollis. ?Have you considered converting to VCF format and using some of > the VCF tools for this type of thing? ?With VCF, you get one row per locus > with the genotypes for all your samples in that row. ?Conversion to > tab-delimited text is also possible for processing in R. ?I think Vince > Carey was looking into R tools for working with VCF, but I don't know where > that work stands. there is a vcf2sm function in GGtools "devel" branch (should get to release with luck this monday). the intention is to take a compressed tabix-indexed vcf file (as distributed by 1000 genomes, specifically) and create a snpMatrix snp.matrix instance for genotype calls on a chromosome for all individuals archived in a file. the current code was written a while ago and emphasizes small footprint, with a naive piping interface assuming tabix installed and trivially accessible. there is much more information available in VCF that this code makes no effort to extract. some discussion of VCF harvesting would be in order at the Heidelberg developer meeting, and comments from interested developers/users are welcome. > > All that said, several hours for finding overlaps sounds like a long time > for a couple of pileup outputs from exome sequencing. > > Sean > >> Hollis Wright >> >> Sent from my iPhone >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
I forgot to mention that vcf2sm was used to construct the data elements of the ceu1kg experimental data package -- approx 8 million snp calls on each of 60 individuals; with expression data on 41 individuals from the GENEVAR archive. ceu1kg also includes GRanges instances annotating locations and names of the SNP. On Sat, Oct 16, 2010 at 6:23 PM, Vincent Carey <stvjc at="" channing.harvard.edu=""> wrote: > On Sat, Oct 16, 2010 at 5:14 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >> On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth at="" ohsu.edu=""> wrote: >>> Hi, all; I've got a pair of lanes of exome sequencing data; we've >>> generated pileup files from samtools and we're interested in looking >>> for discordant calls for quality control or snp discovery. As best I >>> can figure out the way to do this involves doing a findOverlaps and >>> the programatically iterating through the match matrix to get the >>> matching positions and check for differences. However, the overlap >>> finding takes several hours, and since we anticipate there being many >>> lanes in the future I'm curious if there's a faster or better way to >>> go about this sort of process. Thanks... >>> >> >> Hi, Hollis. ?Have you considered converting to VCF format and using some of >> the VCF tools for this type of thing? ?With VCF, you get one row per locus >> with the genotypes for all your samples in that row. ?Conversion to >> tab-delimited text is also possible for processing in R. ?I think Vince >> Carey was looking into R tools for working with VCF, but I don't know where >> that work stands. > > there is a vcf2sm function in GGtools "devel" branch (should get to > release with luck this monday). ?the intention is > to take a compressed tabix-indexed vcf file (as distributed by 1000 > genomes, specifically) and create a snpMatrix snp.matrix instance for > genotype calls on a chromosome for all individuals archived in a file. > ?the current code was written a while ago > and emphasizes small footprint, with a naive piping interface assuming > tabix installed and trivially accessible. ?there is much more > information available in VCF that this code makes no effort to > extract. ?some discussion of VCF harvesting would be in order at the > Heidelberg developer meeting, and comments from interested > developers/users are welcome. > >> >> All that said, several hours for finding overlaps sounds like a long time >> for a couple of pileup outputs from exome sequencing. >> >> Sean >> >>> Hollis Wright >>> >>> Sent from my iPhone >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> ? ? ? ?[[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY

Login before adding your answer.

Traffic: 1029 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6