faster way to get differential calls from pileup?

0

Entering edit mode

wrighth ▴ 260

@wrighth-3452

Last seen 10.2 years ago

Hi, all; I've got a pair of lanes of exome sequencing data; we've generated pileup files from samtools and we're interested in looking for discordant calls for quality control or snp discovery. As best I can figure out the way to do this involves doing a findOverlaps and the programatically iterating through the match matrix to get the matching positions and check for differences. However, the overlap finding takes several hours, and since we anticipate there being many lanes in the future I'm curious if there's a faster or better way to go about this sort of process. Thanks... Hollis Wright Sent from my iPhone

SNP Sequencing PROcess SNP Sequencing PROcess • 1.7k views

ADD COMMENT • link updated 14.1 years ago by Sean Davis 21k • written 14.1 years ago by wrighth ▴ 260

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 months ago

United States

On 10/16/2010 11:54 AM, Hollis Wright wrote: > Hi, all; I've got a pair of lanes of exome sequencing data; we've > generated pileup files from samtools and we're interested in looking > for discordant calls for quality control or snp discovery. As best I > can figure out the way to do this involves doing a findOverlaps and > the programatically iterating through the match matrix to get the > matching positions and check for differences. However, the overlap > finding takes several hours, and since we anticipate there being many This sounds like it's taking longer than findOverlaps should be taking; perhaps you are running out of memory (so process in batches, e.g., by chromosome) or doing something inefficiently. What does your code look like (simplified, if possible...) Martin > lanes in the future I'm curious if there's a faster or better way to > go about this sort of process. Thanks... > > Hollis Wright > > Sent from my iPhone > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 14.1 years ago Martin Morgan 25k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth@ohsu.edu> wrote: > Hi, all; I've got a pair of lanes of exome sequencing data; we've > generated pileup files from samtools and we're interested in looking > for discordant calls for quality control or snp discovery. As best I > can figure out the way to do this involves doing a findOverlaps and > the programatically iterating through the match matrix to get the > matching positions and check for differences. However, the overlap > finding takes several hours, and since we anticipate there being many > lanes in the future I'm curious if there's a faster or better way to > go about this sort of process. Thanks... > Hi, Hollis. Have you considered converting to VCF format and using some of the VCF tools for this type of thing? With VCF, you get one row per locus with the genotypes for all your samples in that row. Conversion to tab-delimited text is also possible for processing in R. I think Vince Carey was looking into R tools for working with VCF, but I don't know where that work stands. All that said, several hours for finding overlaps sounds like a long time for a couple of pileup outputs from exome sequencing. Sean > Hollis Wright > > Sent from my iPhone > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 14.1 years ago Sean Davis 21k

0

Entering edit mode

On Sat, Oct 16, 2010 at 5:14 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth at="" ohsu.edu=""> wrote: >> Hi, all; I've got a pair of lanes of exome sequencing data; we've >> generated pileup files from samtools and we're interested in looking >> for discordant calls for quality control or snp discovery. As best I >> can figure out the way to do this involves doing a findOverlaps and >> the programatically iterating through the match matrix to get the >> matching positions and check for differences. However, the overlap >> finding takes several hours, and since we anticipate there being many >> lanes in the future I'm curious if there's a faster or better way to >> go about this sort of process. Thanks... >> > > Hi, Hollis. ?Have you considered converting to VCF format and using some of > the VCF tools for this type of thing? ?With VCF, you get one row per locus > with the genotypes for all your samples in that row. ?Conversion to > tab-delimited text is also possible for processing in R. ?I think Vince > Carey was looking into R tools for working with VCF, but I don't know where > that work stands. there is a vcf2sm function in GGtools "devel" branch (should get to release with luck this monday). the intention is to take a compressed tabix-indexed vcf file (as distributed by 1000 genomes, specifically) and create a snpMatrix snp.matrix instance for genotype calls on a chromosome for all individuals archived in a file. the current code was written a while ago and emphasizes small footprint, with a naive piping interface assuming tabix installed and trivially accessible. there is much more information available in VCF that this code makes no effort to extract. some discussion of VCF harvesting would be in order at the Heidelberg developer meeting, and comments from interested developers/users are welcome. > > All that said, several hours for finding overlaps sounds like a long time > for a couple of pileup outputs from exome sequencing. > > Sean > >> Hollis Wright >> >> Sent from my iPhone >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.1 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

I forgot to mention that vcf2sm was used to construct the data elements of the ceu1kg experimental data package -- approx 8 million snp calls on each of 60 individuals; with expression data on 41 individuals from the GENEVAR archive. ceu1kg also includes GRanges instances annotating locations and names of the SNP. On Sat, Oct 16, 2010 at 6:23 PM, Vincent Carey <stvjc at="" channing.harvard.edu=""> wrote: > On Sat, Oct 16, 2010 at 5:14 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >> On Oct 16, 2010 2:55 PM, "Hollis Wright" <wrighth at="" ohsu.edu=""> wrote: >>> Hi, all; I've got a pair of lanes of exome sequencing data; we've >>> generated pileup files from samtools and we're interested in looking >>> for discordant calls for quality control or snp discovery. As best I >>> can figure out the way to do this involves doing a findOverlaps and >>> the programatically iterating through the match matrix to get the >>> matching positions and check for differences. However, the overlap >>> finding takes several hours, and since we anticipate there being many >>> lanes in the future I'm curious if there's a faster or better way to >>> go about this sort of process. Thanks... >>> >> >> Hi, Hollis. ?Have you considered converting to VCF format and using some of >> the VCF tools for this type of thing? ?With VCF, you get one row per locus >> with the genotypes for all your samples in that row. ?Conversion to >> tab-delimited text is also possible for processing in R. ?I think Vince >> Carey was looking into R tools for working with VCF, but I don't know where >> that work stands. > > there is a vcf2sm function in GGtools "devel" branch (should get to > release with luck this monday). ?the intention is > to take a compressed tabix-indexed vcf file (as distributed by 1000 > genomes, specifically) and create a snpMatrix snp.matrix instance for > genotype calls on a chromosome for all individuals archived in a file. > ?the current code was written a while ago > and emphasizes small footprint, with a naive piping interface assuming > tabix installed and trivially accessible. ?there is much more > information available in VCF that this code makes no effort to > extract. ?some discussion of VCF harvesting would be in order at the > Heidelberg developer meeting, and comments from interested > developers/users are welcome. > >> >> All that said, several hours for finding overlaps sounds like a long time >> for a couple of pileup outputs from exome sequencing. >> >> Sean >> >>> Hollis Wright >>> >>> Sent from my iPhone >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> ? ? ? ?[[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 14.1 years ago Vincent J. Carey, Jr. 6.7k

Login before adding your answer.