Group millions of the same DNA sequences?
5
0
Entering edit mode
Xiaohui Wu ▴ 280
@xiaohui-wu-4141
Last seen 10.2 years ago
Hi all, I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? Thank you! Xiaohui
• 2.3k views
ADD COMMENT
0
Entering edit mode
@moshe-olshansky-2329
Last seen 10.2 years ago
You can use the aggregate function, i.e. if X is a vector of your sequences, i.e. X[i] is a character containing your i-th sequence (i=1,2,...,100M) then do y <- aggregate(X,list(X),length) then y is a two columns data.frame with column 1 containing the sequence and column 2 containing the count. If this is too slow in R, use sort -u on Unix (like Wei suggested) to get the sequences sorted and unique and then write a simple C program which runs over all your sequences and adds 1 to count[i] where i is this sequence's index in the unique and sorted list (you can use binary search to get that i). --- On Tue, 16/11/10, Xiaohui Wu <wux3 at="" muohio.edu=""> wrote: > From: Xiaohui Wu <wux3 at="" muohio.edu=""> > Subject: [BioC] Group millions of the same DNA sequences? > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Received: Tuesday, 16 November, 2010, 9:46 PM > Hi all, > > I have millions like 100M DNA reads each of which is > ~150nt, some of them are duplicate. Is there any way to > group the same sequences into one and count the number, like > unique() function in R, but with the occurrence of read and > also more efficient? > Also, if I want to cluster these 100M? reads based on > their similarity, like editor distance or some distance > <=2, is there some function or package can be used? > Thank you! > > Xiaohui > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 11/16/2010 02:46 AM, Xiaohui Wu wrote: > Hi all, > > I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? dna = Biostrings::DNAStringSet(<reads>) ShortRead::tables(dna, Inf)[["top"]] for this; also selectMethod(tables, "DNAStringSet") to see the code if the format (named list) is not to your liking. Martin > Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? > Thank you! > > Xiaohui > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 3 months ago
Australia/Melbourne/Olivia Newton-John …
Dear Xiaohui: The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine. I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions. Hope this helps. Cheers, Wei On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: > Hi all, > > I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? > Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? > Thank you! > > Xiaohui > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD COMMENT
0
Entering edit mode
Dear Xiaohul Might be better to use CAP3 (contig assembly program) rather than R (or anything else) which is also freeware, unless you have a lot of SSR's. If you have a lot of SSR's nothing will help. Cheers Bob ________________________________________ From: bioconductor-bounces@stat.math.ethz.ch [bioconductor- bounces@stat.math.ethz.ch] On Behalf Of Wei Shi [shi@wehi.edu.au] Sent: Tuesday, November 16, 2010 4:54 PM To: Xiaohui Wu Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Group millions of the same DNA sequences? Dear Xiaohui: The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine. I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions. Hope this helps. Cheers, Wei On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: > Hi all, > > I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? > Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? > Thank you! > > Xiaohui > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:9}}
ADD REPLY
0
Entering edit mode
Sorry if I'm missing the point, but ... For the easier question with edit distance zero, see ?rle, as in > reads <- c("ACG", "TTT", "ACG") > rle(sort(reads)) Run Length Encoding lengths: int [1:2] 2 1 values : chr [1:2] "ACG" "TTT" Beware that, without the sort, rle won't be productive. A cheap alternative, from IRanges, that doesn't yield quite as much information: > high2low(reads) [1] NA NA 1 I think of high2low as a gross abstraction of PDict, from Biostrings, as in > D <- PDict(c("ACG", "TTT", "ACG")) > D at dups0@high2low [1] NA NA 1 For the harder question, you might explore this: PDict(your_reads, max.mismatch=2) -Harris On Nov 16, 2010, at 4:54 PM, Wei Shi wrote: > Dear Xiaohui: > > The unix command sort (which is also available on Mac) can group > your sequences. But I guess you will have to write some code to > count the numbers for each distinct sequence. I have written some C > code to do the similar thing and I will be happy to share them. But > I am not sure if your data format is the same as mine. > > I am not aware of any R functions which can cluster strings. I > image it will be extremely slow if there are such functions. > > Hope this helps. > > Cheers, > Wei > > On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: > >> Hi all, >> >> I have millions like 100M DNA reads each of which is ~150nt, some >> of them are duplicate. Is there any way to group the same >> sequences into one and count the number, like unique() function in >> R, but with the occurrence of read and also more efficient? >> Also, if I want to cluster these 100M reads based on their >> similarity, like editor distance or some distance <=2, is there >> some function or package can be used? >> Thank you! >> >> Xiaohui >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and intend... > {{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Xiaohui Wu ▴ 280
@xiaohui-wu-4141
Last seen 10.2 years ago
Hi Wei, Thank you for your reply! I'll be very appreciated if you could send me your C code for reference. Xiaohui ------------------------------------------------------------- ????Wei Shi ?????2010-11-17 05:55:04 ????Wu, Xiaohui Ms. ???bioconductor at stat.math.ethz.ch ???Re: [BioC] Group millions of the same DNA sequences? Dear Xiaohui: The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine. I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions. Hope this helps. Cheers, Wei On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: > Hi all, > > I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? > Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? > Thank you! > > Xiaohui > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ .
ADD COMMENT
0
Entering edit mode
sort -u | uniq -c will do the counting for you. but as these are sequence data, you should probably be concerned about indels (gaps) as well as mismatches, so the "edit distance" is not a great clustering metric. You could, of course, calculate all-vs-all Needleman-Wunsch pairwise alignments, from which you could count edit operations (counting any indel as one operation, regardless of length), and cluster by some edit operation threshold, but that would be computationally expensive. what is it you're trying to accomplish with these reads? Assembly? If so, just give your reads to an assembler (MIRA, Velvet, etc.) and let the assembler do this clustering for you. -Aaron 2010/11/16 Xiaohui Wu <wux3@muohio.edu> > Hi Wei, > > Thank you for your reply! I'll be very appreciated if you could send me > your C code for reference. > > Xiaohui > > ------------------------------------------------------------- > ·¢¼þÈË£ºWei Shi > ·¢ËÍÈÕÆÚ£º2010-11-17 05:55:04 > ÊÕ¼þÈË£ºWu, Xiaohui Ms. > ³­ËÍ£ºbioconductor@stat.math.ethz.ch > Ö÷Ì⣺Re: [BioC] Group millions of the same DNA sequences? > > Dear Xiaohui: > > The unix command sort (which is also available on Mac) can group > your sequences. But I guess you will have to write some code to count the > numbers for each distinct sequence. I have written some C code to do the > similar thing and I will be happy to share them. But I am not sure if your > data format is the same as mine. > > I am not aware of any R functions which can cluster strings. I image > it will be extremely slow if there are such functions. > > Hope this helps. > > Cheers, > Wei > > On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: > > > Hi all, > > > > I have millions like 100M DNA reads each of which is ~150nt, some of them > are duplicate. Is there any way to group the same sequences into one and > count the number, like unique() function in R, but with the occurrence of > read and also more efficient? > > Also, if I want to cluster these 100M reads based on their similarity, > like editor distance or some distance <=2, is there some function or package > can be used? > > Thank you! > > > > Xiaohui > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > . > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Nov 17, 2010, at 9:58 AM, Aaron Mackey wrote: > sort -u | uniq -c will do the counting for you. Actually, sort reads | uniq -c > what is it you're trying to accomplish with these reads?... > -Aaron > 2010/11/16 Xiaohui Wu <wux3 at="" muohio.edu=""> > >> Hi Wei, >> >> Thank you for your reply! I'll be very appreciated if you could >> send me >> your C code for reference. >> >> Xiaohui >> >> ------------------------------------------------------------- >> ????????Wei Shi >> ??????????2010-11-17 05:55:04 >> ????????Wu, Xiaohui Ms. >> ? ????bioconductor at stat.math.ethz.ch >> ??????Re: [BioC] Group millions of the same DNA sequences? >> >> Dear Xiaohui: >> >> The unix command sort (which is also available on Mac) can >> group >> your sequences. But I guess you will have to write some code to >> count the >> numbers for each distinct sequence. I have written some C code to >> do the >> similar thing and I will be happy to share them. But I am not sure >> if your >> data format is the same as mine. >> >> I am not aware of any R functions which can cluster >> strings. I image >> it will be extremely slow if there are such functions. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: >> >>> Hi all, >>> >>> I have millions like 100M DNA reads each of which is ~150nt, some >>> of them >> are duplicate. Is there any way to group the same sequences into >> one and >> count the number, like unique() function in R, but with the >> occurrence of >> read and also more efficient? >>> Also, if I want to cluster these 100M reads based on their >>> similarity, >> like editor distance or some distance <=2, is there some function >> or package >> can be used? >>> Thank you! >>> >>> Xiaohui >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _____________________________________________________________________ >> _ >> The information in this email is confidential and intended solely >> for the >> addressee. >> You must not disclose, forward, print or use it without the >> permission of >> the sender. >> _____________________________________________________________________ >> _ >> . >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Xiaohui Wu ▴ 280
@xiaohui-wu-4141
Last seen 10.2 years ago
Thank you Aaron! Till now, sort and uniq may be the easiest way to do this. For clustering, I don't think assembler is suitable for my case. I want to cluster similar reads to get different clusters, each cluster has some reads, and do further analysis. Xiaohui ------------------------------------------------------------- ????Aaron Mackey ?????2010-11-17 22:58:57 ????Wu, Xiaohui Ms. ???bioconductor at stat.math.ethz.ch ???Re: [BioC] Group millions of the same DNA sequences? sort -u | uniq -c will do the counting for you. but as these are sequence data, you should probably be concerned about indels (gaps) as well as mismatches, so the "edit distance" is not a great clustering metric. You could, of course, calculate all-vs-all Needleman-Wunsch pairwise alignments, from which you could count edit operations (counting any indel as one operation, regardless of length), and cluster by some edit operation threshold, but that would be computationally expensive. what is it you're trying to accomplish with these reads? Assembly? If so, just give your reads to an assembler (MIRA, Velvet, etc.) and let the assembler do this clustering for you. -Aaron 2010/11/16 Xiaohui Wu <wux3 at="" muohio.edu<mailto:wux3="" at="" muohio.edu="">> Hi Wei, Thank you for your reply! I'll be very appreciated if you could send me your C code for reference. Xiaohui ------------------------------------------------------------- ????Wei Shi ?????2010-11-17 05:55:04 ????Wu, Xiaohui Ms. ???bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""> ???Re: [BioC] Group millions of the same DNA sequences? Dear Xiaohui: The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine. I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions. Hope this helps. Cheers, Wei On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote: > Hi all, > > I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? > Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? > Thank you! > > Xiaohui > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ . _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch<mailto:bioconductor at="" stat.math.ethz.ch=""> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
On Thu, Nov 18, 2010 at 09:21:07AM +0800, Xiaohui Wu wrote: > Thank you Aaron! Till now, sort and uniq may be the easiest way to do this. > For clustering, I don't think assembler is suitable for my case. I want to > cluster similar reads to get different clusters, each cluster has some reads, > and do further analysis. about the clustering, an approach like Fast approximate hierarchical clustering using similarity heuristics Meelis Kull and Jaak Vilo could be worthwhile. If the similarities obey the metric inequality, it should not be necessary to do all-against-all comparisons. best, Stijn -- Stijn van Dongen >8< -o) O< forename pronunciation: [Stan] EMBL-EBI /\\ Tel: +44-(0)1223-492675 Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn
ADD REPLY
0
Entering edit mode
you could try CD-HIT. http://bioinformatics.ljcrf.edu/cd-hi/ meant to work with huge number of sequences. regards KM On Thu, Nov 18, 2010 at 2:53 PM, Stijn van Dongen <stijn at="" ebi.ac.uk=""> wrote: > > On Thu, Nov 18, 2010 at 09:21:07AM +0800, Xiaohui Wu wrote: >> Thank you Aaron! ?Till now, sort and uniq may be the easiest way to do this. >> For clustering, I don't think assembler is suitable for my case. I want to >> cluster similar reads to get different clusters, each cluster has some reads, >> and do further analysis. > > about the clustering, an approach like > > ? Fast approximate hierarchical clustering using similarity heuristics > ? Meelis Kull and Jaak Vilo > > could be worthwhile. If the similarities obey the metric inequality, > it should not be necessary to do all-against-all comparisons. > > best, > Stijn > > -- > Stijn van Dongen ? ? ? ? >8< ? ? ? ?-o) ? O< ?forename pronunciation: [Stan] > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? ? ? ?/\\ ? Tel: +44-(0)1223-492675 > Hinxton, Cambridge, CB10 1SD, UK ? _\_/ ? http://micans.org/stijn > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY

Login before adding your answer.

Traffic: 766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6