Comparing DNAStringSetLists
3
0
Entering edit mode
@vince-s-buffalo-4618
Last seen 10.2 years ago
United States
Hi All, I have two vectors of alleles stored as DNAStringSetLists. For each element in both lists, I need to find the length of the intersecting set. Using mapply() and intersect() take too long, as does sapply(dna.set.list, as.character) (and then using mclapply or lapply to find intersect on characters). Is there a fast way to do this? I have vectors ~12 million rows long. Vince -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]
• 1.2k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. For each element > in both lists, I need to find the length of the intersecting set. Using > mapply() and intersect() take too long, as does sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find intersect on > characters). Is there a fast way to do this? I have vectors ~12 million > rows long. For a couple of hacky solutions, maybe create an index i1 into one of the lists l1 i1 <- rep(seq_along(l1), elementLengths(l1)) then create artificial alleles that are tagged by the element id x1 <- paste0(unlist(l1), i1) x2 <- paste0(unlist(l2), rep(seq_along(l2), elementLengths(l2))) and count how many of x1 are in x2, grouped by i1 tabulate(i1[x1 %in% x2]) This seems to be faster than sum(as(l1, "CharacterList"), %in% as(l2, "CharacterList")) (the x1 or as() could be surrounded by unique() if the elements are not already). Martin > > Vince > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States
Hi Vince, On Tue, Oct 15, 2013 at 4:16 PM, Vince S. Buffalo <vsbuffalo at="" gmail.com=""> wrote: > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. For each element > in both lists, I need to find the length of the intersecting set. Using > mapply() and intersect() take too long, as does sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find intersect on > characters). Is there a fast way to do this? I have vectors ~12 million > rows long. Perhaps data.table can help here, but I'm having a hard time understanding the data you have and the output you want. Could you provide small test-set sized dataset that we can poke at? Thanks, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States
Hi Vince, Sounds like maybe you have a use case for a pintersect method for DNAStringSetList objects: setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"), function(x, y, ...) { if (length(x) != length(y)) stop("'x' and 'y' must have the same length") ux <- unlist(x, use.name=FALSE) uy <- unlist(y, use.name=FALSE) string_id <- selfmatch(c(ux, uy)) ux_id <- string_id[seq_along(ux)] uy_id <- string_id[seq_along(uy) + length(ux)] ux_group <- rep.int(seq_along(x), elementLengths(x)) uy_group <- rep.int(seq_along(y), elementLengths(y)) m <- IRanges:::matchIntegerPairs(ux_group, ux_id, uy_group, uy_id) keep_idx <- which(!is.na(m)) ux <- ux[keep_idx] ux_group <- ux_group[keep_idx] ux_id <- ux_id[keep_idx] sm <- IRanges:::selfmatchIntegerPairs(ux_group, ux_id) keep_idx <- sm == seq_along(sm) ux <- ux[keep_idx] ux_group <- ux_group[keep_idx] ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) relist(ux, ans_skeleton) } ) Then: > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), c("T", "G")) > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", c("G", "T")) > pintersect(alleles1, alleles2) DNAStringSetList of length 4 [[1]] A [[2]] A [[3]] A DNAStringSet instance of length 0 [[4]] T G Should take about 20 sec. on your 12-million long vectors. Then use elementLengths() on this result to get the lengths of the intersecting sets. HTH, H. On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. For each element > in both lists, I need to find the length of the intersecting set. Using > mapply() and intersect() take too long, as does sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find intersect on > characters). Is there a fast way to do this? I have vectors ~12 million > rows long. > > Vince > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Neat. I was just going to suggest using matchIntegerPairs(). You could have used togroup(x) for this right? ux_group <- rep.int(seq_along(x), elementLengths(x)) On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Vince, > > Sounds like maybe you have a use case for a pintersect method for > DNAStringSetList objects: > > setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"), > function(x, y, ...) > { > if (length(x) != length(y)) > stop("'x' and 'y' must have the same length") > > ux <- unlist(x, use.name=FALSE) > uy <- unlist(y, use.name=FALSE) > string_id <- selfmatch(c(ux, uy)) > ux_id <- string_id[seq_along(ux)] > uy_id <- string_id[seq_along(uy) + length(ux)] > ux_group <- rep.int(seq_along(x), elementLengths(x)) > uy_group <- rep.int(seq_along(y), elementLengths(y)) > m <- IRanges:::matchIntegerPairs(**ux_group, ux_id, uy_group, > uy_id) > keep_idx <- which(!is.na(m)) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ux_id <- ux_id[keep_idx] > sm <- IRanges:::**selfmatchIntegerPairs(ux_**group, ux_id) > keep_idx <- sm == seq_along(sm) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) > relist(ux, ans_skeleton) > } > ) > > Then: > > > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), > c("T", "G")) > > > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", > c("G", "T")) > > > pintersect(alleles1, alleles2) > DNAStringSetList of length 4 > [[1]] A > [[2]] A > [[3]] A DNAStringSet instance of length 0 > [[4]] T G > > Should take about 20 sec. on your 12-million long vectors. > > Then use elementLengths() on this result to get the lengths of the > intersecting sets. > > HTH, > H. > > > > On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > >> Hi All, >> >> I have two vectors of alleles stored as DNAStringSetLists. For each >> element >> in both lists, I need to find the length of the intersecting set. Using >> mapply() and intersect() take too long, as does sapply(dna.set.list, >> as.character) (and then using mclapply or lapply to find intersect on >> characters). Is there a fast way to do this? I have vectors ~12 million >> rows long. >> >> Vince >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
And I guess this also works for CharacterList, and a bit easier for IntegerList. On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla@gene.com> wrote: > Neat. I was just going to suggest using matchIntegerPairs(). > > You could have used togroup(x) for this right? > ux_group <- rep.int(seq_along(x), elementLengths(x)) > > > On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > >> Hi Vince, >> >> Sounds like maybe you have a use case for a pintersect method for >> DNAStringSetList objects: >> >> setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"), >> function(x, y, ...) >> { >> if (length(x) != length(y)) >> stop("'x' and 'y' must have the same length") >> >> ux <- unlist(x, use.name=FALSE) >> uy <- unlist(y, use.name=FALSE) >> string_id <- selfmatch(c(ux, uy)) >> ux_id <- string_id[seq_along(ux)] >> uy_id <- string_id[seq_along(uy) + length(ux)] >> ux_group <- rep.int(seq_along(x), elementLengths(x)) >> uy_group <- rep.int(seq_along(y), elementLengths(y)) >> m <- IRanges:::matchIntegerPairs(**ux_group, ux_id, uy_group, >> uy_id) >> keep_idx <- which(!is.na(m)) >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ux_id <- ux_id[keep_idx] >> sm <- IRanges:::**selfmatchIntegerPairs(ux_**group, ux_id) >> keep_idx <- sm == seq_along(sm) >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) >> relist(ux, ans_skeleton) >> } >> ) >> >> Then: >> >> > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), >> c("T", "G")) >> >> > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", >> c("G", "T")) >> >> > pintersect(alleles1, alleles2) >> DNAStringSetList of length 4 >> [[1]] A >> [[2]] A >> [[3]] A DNAStringSet instance of length 0 >> [[4]] T G >> >> Should take about 20 sec. on your 12-million long vectors. >> >> Then use elementLengths() on this result to get the lengths of the >> intersecting sets. >> >> HTH, >> H. >> >> >> >> On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: >> >>> Hi All, >>> >>> I have two vectors of alleles stored as DNAStringSetLists. For each >>> element >>> in both lists, I need to find the length of the intersecting set. Using >>> mapply() and intersect() take too long, as does sapply(dna.set.list, >>> as.character) (and then using mclapply or lapply to find intersect on >>> characters). Is there a fast way to do this? I have vectors ~12 million >>> rows long. >>> >>> Vince >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Right, this is actually quite generic and would work on any CompressedList derivative with an elementType that supports c(), [, and match() (by default sefmatch() will just call match()). So it should go to IRanges rather than Biostrings. The other parallel set operations (punion, psetdiff) could be implemented in the same manner. Added to my TODO list. Yes, togroup(x) does rep.int(seq_along(x), elementLengths(x)) but it's cleaner to use the former. H. On 10/16/2013 06:31 AM, Michael Lawrence wrote: > And I guess this also works for CharacterList, and a bit easier for > IntegerList. > > > On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">> wrote: > > Neat. I was just going to suggest using matchIntegerPairs(). > > You could have used togroup(x) for this right? > ux_group <- rep.int <http: rep.int="">(seq_along(x), elementLengths(x)) > > > On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Vince, > > Sounds like maybe you have a use case for a pintersect method for > DNAStringSetList objects: > > setMethod("pintersect", c("DNAStringSetList", > "DNAStringSetList"), > function(x, y, ...) > { > if (length(x) != length(y)) > stop("'x' and 'y' must have the same length") > > ux <- unlist(x, use.name <http: use.name="">=FALSE) > uy <- unlist(y, use.name <http: use.name="">=FALSE) > string_id <- selfmatch(c(ux, uy)) > ux_id <- string_id[seq_along(ux)] > uy_id <- string_id[seq_along(uy) + length(ux)] > ux_group <- rep.int <http: rep.int="">(seq_along(x), > elementLengths(x)) > uy_group <- rep.int <http: rep.int="">(seq_along(y), > elementLengths(y)) > m <- IRanges:::matchIntegerPairs(__ux_group, ux_id, > uy_group, uy_id) > keep_idx <- which(!is.na <http: is.na="">(m)) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ux_id <- ux_id[keep_idx] > sm <- IRanges:::__selfmatchIntegerPairs(ux___group, ux_id) > keep_idx <- sm == seq_along(sm) > ux <- ux[keep_idx] > ux_group <- ux_group[keep_idx] > ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) > relist(ux, ans_skeleton) > } > ) > > Then: > > > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", > "T"), c("T", "G")) > > > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), > "C", c("G", "T")) > > > pintersect(alleles1, alleles2) > DNAStringSetList of length 4 > [[1]] A > [[2]] A > [[3]] A DNAStringSet instance of length 0 > [[4]] T G > > Should take about 20 sec. on your 12-million long vectors. > > Then use elementLengths() on this result to get the lengths of the > intersecting sets. > > HTH, > H. > > > > On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: > > Hi All, > > I have two vectors of alleles stored as DNAStringSetLists. > For each element > in both lists, I need to find the length of the intersecting > set. Using > mapply() and intersect() take too long, as does > sapply(dna.set.list, > as.character) (and then using mclapply or lapply to find > intersect on > characters). Is there a fast way to do this? I have vectors > ~12 million > rows long. > > Vince > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi All, Thank you all so much for your excellent and prompt replies. I've used Hervé's solution and it worked wonderfully — it took less than 1 minute. Bioconductor's power and speed never ceases to amaze me. Thanks, Vince On Wed, Oct 16, 2013 at 9:20 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Right, this is actually quite generic and would work on any > CompressedList derivative with an elementType that supports c(), [, > and match() (by default sefmatch() will just call match()). So it > should go to IRanges rather than Biostrings. The other parallel set > operations (punion, psetdiff) could be implemented in the same manner. > Added to my TODO list. > > Yes, togroup(x) does rep.int(seq_along(x), elementLengths(x)) but > it's cleaner to use the former. > > H. > > > On 10/16/2013 06:31 AM, Michael Lawrence wrote: > >> And I guess this also works for CharacterList, and a bit easier for >> IntegerList. >> >> >> On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla@gene.com>> <mailto:michafla@gene.com>> wrote: >> >> Neat. I was just going to suggest using matchIntegerPairs(). >> >> You could have used togroup(x) for this right? >> ux_group <- rep.int <http: rep.int="">(seq_along(x), elementLengths(x)) >> >> >> >> On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Vince, >> >> Sounds like maybe you have a use case for a pintersect method for >> DNAStringSetList objects: >> >> setMethod("pintersect", c("DNAStringSetList", >> "DNAStringSetList"), >> function(x, y, ...) >> { >> if (length(x) != length(y)) >> stop("'x' and 'y' must have the same length") >> >> ux <- unlist(x, use.name <http: use.name="">=FALSE) >> uy <- unlist(y, use.name <http: use.name="">=FALSE) >> >> string_id <- selfmatch(c(ux, uy)) >> ux_id <- string_id[seq_along(ux)] >> uy_id <- string_id[seq_along(uy) + length(ux)] >> ux_group <- rep.int <http: rep.int="">(seq_along(x), >> elementLengths(x)) >> uy_group <- rep.int <http: rep.int="">(seq_along(y), >> elementLengths(y)) >> m <- IRanges:::matchIntegerPairs(__**ux_group, ux_id, >> uy_group, uy_id) >> keep_idx <- which(!is.na <http: is.na="">(m)) >> >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ux_id <- ux_id[keep_idx] >> sm <- IRanges:::__**selfmatchIntegerPairs(ux___**group, >> ux_id) >> >> keep_idx <- sm == seq_along(sm) >> ux <- ux[keep_idx] >> ux_group <- ux_group[keep_idx] >> ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x)) >> relist(ux, ans_skeleton) >> } >> ) >> >> Then: >> >> > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", >> "T"), c("T", "G")) >> >> > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), >> "C", c("G", "T")) >> >> > pintersect(alleles1, alleles2) >> DNAStringSetList of length 4 >> [[1]] A >> [[2]] A >> [[3]] A DNAStringSet instance of length 0 >> [[4]] T G >> >> Should take about 20 sec. on your 12-million long vectors. >> >> Then use elementLengths() on this result to get the lengths of the >> intersecting sets. >> >> HTH, >> H. >> >> >> >> On 10/15/2013 04:16 PM, Vince S. Buffalo wrote: >> >> Hi All, >> >> I have two vectors of alleles stored as DNAStringSetLists. >> For each element >> in both lists, I need to find the length of the intersecting >> set. Using >> mapply() and intersect() take too long, as does >> sapply(dna.set.list, >> as.character) (and then using mclapply or lapply to find >> intersect on >> characters). Is there a fast way to do this? I have vectors >> ~12 million >> rows long. >> >> Vince >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<http s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 881 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6