Rsubread alingner

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 2 days ago

Australia/Melbourne

Dear João: The default setting of Rsubread alignment requires at least 3 subreads (16bp substrings extracted from the read) to be mapped to the same location to report a hit. Subreads are extracted at a gap of 2bp from the read sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will be performed for mapping three sets of subreads respectively, which are extracted from different start positions of the read (1st base, 2nd base and 3rd base). Therefore, for your trimmed reads which are 20bp long, there will be only up to 2 subreads extracted from each read and they were used for mapping. However, because the consensus cutoff (minimal number of consensus subreads needed to report a hit) is 3 by default (which I believe is the value used in your analysis), no hits can be reported because of insufficient number of extracted subreads. DNA sequences of 20 bases are highly repetitive in human or mouse genome. But I do not know if this is the case for yeast genome. However with shorter read, you are more likely to get ambiguous mapping locations. I would suggest you to use the full read sequences for mapping to take advantage of all the base information in the reads. When mapping 36bp reads, Rsubread can extract 7 subreads from each read. This gives a lot more power for finding mapping locations in a sensitive and accurate way. You can use a consensus threshold of 2 (TH1=2) to call mapping locations, which we found works quite well. Hope this helps. Cheers, Wei On Apr 21, 2011, at 12:31 AM, João Moura wrote: > Dear Wei Shi, > > Indeed I can run the example of the vignette but not my own trimmed reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I got why. the thing is I'm trimming the reads in R and trying to realign them afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, a line like this: > > @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 > > > Will contain "length=36" all the time, even if my reads and length 20. Do you think this might be a problem? Because if I run with untrimmed reads I get some reasonable results. > > I sent you a untrimmed version and a trimmed version. Maybe you can try... > > > Thanks! > > > On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: > Dear Joao: > > Can you run the example enclosed with Rsubread package? > > Cheers, > Wei > > On Apr 20, 2011, at 3:23 AM, João Moura wrote: > >> Dear Wei Shi, >> >> I'm trying out the Rsubread and I'm having strange results. That is I firstly built the reference genome, use a fasta file like this: >> >> [joao@goedel genomes]$ grep ">" yeast.fna >> >Scchr01 1 - 230208 >> >Scchr02 1 - 813178 >> >Scchr03 1 - 316617 >> >Scchr04 1 - 1531917 >> >Scchr05 1 - 576869 >> >Scchr06 1 - 270148 >> >Scchr07 1 - 1090946 >> >Scchr08 1 - 562643 >> >Scchr09 1 - 439885 >> >Scchr10 1 - 745667 >> >Scchr11 1 - 666454 >> >Scchr12 1 - 1078175 >> >Scchr13 1 - 924429 >> >Scchr14 1 - 784333 >> >Scchr15 1 - 1091289 >> >Scchr16 1 - 948062 >> >Scmito 1 - 85779 >> >> In R I ran the following code: >> >> ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna") >> path<-file.path("/home/joao/bowtie-0.12.7/Rsubread") >> buildindex(basename = file.path(path, "reference_index"), reference = ref) >> reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", full=TRUE) >> align(index = file.path(path, "reference_index"), readfile1 = reads[[1]],output_file = file.path(path, "alignResults.SAM")) >> >> 11152683 reads were processed in 121.8 seconds. >> Percentage of successfully mapped reads is 0.00%. >> >> >> >> Completed successfully. >> >> >> If I do the same with bowtie I get more than 70%..What am I doing wrong here? Thanks a lot! >> >> -- >> João Moura > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:22}}

Alignment Yeast Rsubread Alignment Yeast Rsubread • 2.4k views

ADD COMMENT • link 14.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

João Moura ▴ 160

@joao-moura-4505

Last seen 10.6 years ago

Dear Wei Shi, First of all, thanks a lot for you help. Maybe I was doing it wrong also with bowtie, but the truth is I'm a master student, thus quite armature in all of this. Well, the reason why I trimmed my reads in first place was that sequence content across all bases is biased (which i think is a common problem in RNAseq). As you can see in the picture I send you, the first 14/15 position are biased (which I think is due to sequencing adapters?) So, the reason I followed was to trim the reads in order to be able to map the 22 bases that otherwise would be unmapped because of the adapters or other artifacts in the first 14/15 of the reads of the reads. But, since you say that 16bp is the minimum requirement for Rsubread, it seems that this problem will be solved using your software, right? Do you know if bowtie also uses this approach of subseting the reads? Because if I trim the reads I get an improvement of ~15% on the alignments. Once again, thank you very much for your help. On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi@wehi.edu.au> wrote: > Dear João: > > The default setting of Rsubread alignment requires at least 3 subreads > (16bp substrings extracted from the read) to be mapped to the same location > to report a hit. Subreads are extracted at a gap of 2bp from the read > sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will > be performed for mapping three sets of subreads respectively, which are > extracted from different start positions of the read (1st base, 2nd base and > 3rd base). > > Therefore, for your trimmed reads which are 20bp long, there will be only > up to 2 subreads extracted from each read and they were used for mapping. > However, because the consensus cutoff (minimal number of consensus subreads > needed to report a hit) is 3 by default (which I believe is the value used > in your analysis), no hits can be reported because of insufficient number of > extracted subreads. > > DNA sequences of 20 bases are highly repetitive in human or mouse genome. > But I do not know if this is the case for yeast genome. However with shorter > read, you are more likely to get ambiguous mapping locations. > > I would suggest you to use the full read sequences for mapping to take > advantage of all the base information in the reads. When mapping 36bp reads, > Rsubread can extract 7 subreads from each read. This gives a lot more power > for finding mapping locations in a sensitive and accurate way. You can use a > consensus threshold of 2 (TH1=2) to call mapping locations, which we found > works quite well. > > Hope this helps. > > Cheers, > Wei > > On Apr 21, 2011, at 12:31 AM, João Moura wrote: > > Dear Wei Shi, > > Indeed I can run the example of the vignette but not my own trimmed reads. > With bowtie I get more than 80% but with RSubread i get 0%. I think I got > why. the thing is I'm trimming the reads in R and trying to realign them > afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, > a line like this: > > @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 > > > Will contain "length=36" all the time, even if my reads and length 20. Do > you think this might be a problem? Because if I run with untrimmed reads I > get some reasonable results. > > I sent you a untrimmed version and a trimmed version. Maybe you can try... > > > Thanks! > > > On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: > >> Dear Joao: >> >> Can you run the example enclosed with Rsubread package? >> >> Cheers, >> Wei >> >> On Apr 20, 2011, at 3:23 AM, João Moura wrote: >> >> Dear Wei Shi, >> >> I'm trying out the Rsubread and I'm having strange results. That is I >> firstly built the reference genome, use a fasta file like this: >> >> *[joao@goedel genomes]$ grep ">" yeast.fna * >> *>Scchr01 1 - 230208* >> *>Scchr02 1 - 813178* >> *>Scchr03 1 - 316617* >> *>Scchr04 1 - 1531917* >> *>Scchr05 1 - 576869* >> *>Scchr06 1 - 270148* >> *>Scchr07 1 - 1090946* >> *>Scchr08 1 - 562643* >> *>Scchr09 1 - 439885* >> *>Scchr10 1 - 745667* >> *>Scchr11 1 - 666454* >> *>Scchr12 1 - 1078175* >> *>Scchr13 1 - 924429* >> *>Scchr14 1 - 784333* >> *>Scchr15 1 - 1091289* >> *>Scchr16 1 - 948062* >> *>Scmito 1 - 85779* >> * >> * >> >> In R I ran the following code: >> * >> * >> >> *ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")* >> *path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")* >> *buildindex(basename = file.path(path, "reference_index"), reference = >> ref)* >> *reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", >> full=TRUE)* >> *align(index = file.path(path, "reference_index"), readfile1 = >> reads[[1]],output_file = file.path(path, "alignResults.SAM"))* >> * >> >> * >> * >> 11152683 reads were processed in 121.8 seconds. >> * >> * >> Percentage of successfully mapped reads is 0.00%. >> * >> * >> >> * >> * >> >> * >> * >> >> * >> * >> Completed successfully. >> * >> >> * >> >> * >> >> If I do the same with bowtie I get more than 70%..What am I doing wrong >> here? Thanks a lot! >> >> -- >> João Moura >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > > > -- > João Moura > <thousand.fastq><untri_thousand.fastq> > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:15}}

ADD COMMENT • link 14.0 years ago João Moura ▴ 160

0

Entering edit mode

Joao Reading the composition bias you see in the beginning of the reads, read 1. Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 38, e131 (2010). The reason why you map more reads by trimming them is the standard reason: you have an error rate in your reads and depending on your alignment strategy you will always align more reads by trimming them (until you get to a phase shift where very few reads align). However, by trimming the reads you will loose genomic regions that suddenly become non-unique while the additional reads you succeed in aligning most likely will be randomly distributed across the expressed genes (meaning most of the reads will come from the highly expressed genes). In my opinion, it is not clear that trimming is better (however, sometimes the first couple of bases have high error rates as well in which case you might consider trimming a few bases). Generally, you don't align more reads because you chomp off the bit with a composition bias, only because of error rates. All of this is assuming you are doing RNA-seq using random hexamer priming, which is the standard library prep for Illumina RNA-seq. Kasper On Wed, Apr 20, 2011 at 8:44 PM, Jo?o Moura <palerma at="" gmail.com=""> wrote: > Dear Wei Shi, > > First of all, thanks a lot for you help. Maybe I was doing it wrong also > with bowtie, but the truth is I'm a master student, thus quite armature in > all of this. > > Well, the reason why I trimmed my reads in first place was that sequence > content across all bases is biased (which i think is a common problem in > RNAseq). As you can see in the picture I send you, the first 14/15 position > are biased (which I think is due to sequencing adapters?) So, the reason I > followed was to trim the reads in order to be able to map the 22 bases that > otherwise would be unmapped because of the adapters or other artifacts in > the first 14/15 of the reads of the reads. But, since you say that 16bp is > the minimum requirement for Rsubread, it seems that this problem will be > solved using your software, right? > > Do you know if bowtie also uses this approach of subseting the reads? > Because if I trim the reads I get an improvement of ~15% on the alignments. > > Once again, thank you very much for your help. > > On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Dear Jo?o: >> >> The default setting of Rsubread alignment requires at least 3 subreads >> (16bp substrings extracted from the read) to be mapped to the same location >> to report a hit. Subreads are extracted at a gap of 2bp from the read >> sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will >> be performed for mapping three sets of subreads respectively, which are >> extracted from different start positions of the read (1st base, 2nd base and >> 3rd base). >> >> Therefore, for your trimmed reads which are 20bp long, there will be only >> up to 2 subreads extracted from each read and they were used for mapping. >> However, because the consensus cutoff (minimal number of consensus subreads >> needed to report a hit) is 3 by default (which I believe is the value used >> in your analysis), no hits can be reported because of insufficient number of >> extracted subreads. >> >> DNA sequences of 20 bases are highly repetitive in human or mouse genome. >> But I do not know if this is the case for yeast genome. However with shorter >> read, you are more likely to get ambiguous mapping locations. >> >> I would suggest you to use the full read sequences for mapping to take >> advantage of all the base information in the reads. When mapping 36bp reads, >> Rsubread can extract 7 subreads from each read. This gives a lot more power >> for finding mapping locations in a sensitive and accurate way. You can use a >> consensus threshold of 2 (TH1=2) to call mapping locations, which we found >> works quite well. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Apr 21, 2011, at 12:31 AM, Jo?o Moura wrote: >> >> Dear Wei Shi, >> >> Indeed I can run the example of the vignette but not my own trimmed reads. >> With bowtie I get more than 80% but with RSubread i get 0%. I think I got >> why. the thing is I'm trimming the reads in R and trying to realign them >> afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, >> a line like this: >> >> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >> >> >> Will contain "length=36" all the time, even if my reads and length 20. Do >> you think this might be a problem? Because if I run with untrimmed reads I >> get some reasonable results. >> >> I sent you a untrimmed version and a trimmed version. Maybe you can try... >> >> >> Thanks! >> >> >> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> >>> Dear Joao: >>> >>> Can you run the example enclosed with Rsubread package? >>> >>> Cheers, >>> Wei >>> >>> On Apr 20, 2011, at 3:23 AM, Jo?o Moura wrote: >>> >>> Dear Wei Shi, >>> >>> I'm trying out the Rsubread and I'm having strange results. That is I >>> firstly built the reference genome, use a fasta file like this: >>> >>> *[joao at goedel genomes]$ grep ">" yeast.fna * >>> *>Scchr01 1 - 230208* >>> *>Scchr02 1 - 813178* >>> *>Scchr03 1 - 316617* >>> *>Scchr04 1 - 1531917* >>> *>Scchr05 1 - 576869* >>> *>Scchr06 1 - 270148* >>> *>Scchr07 1 - 1090946* >>> *>Scchr08 1 - 562643* >>> *>Scchr09 1 - 439885* >>> *>Scchr10 1 - 745667* >>> *>Scchr11 1 - 666454* >>> *>Scchr12 1 - 1078175* >>> *>Scchr13 1 - 924429* >>> *>Scchr14 1 - 784333* >>> *>Scchr15 1 - 1091289* >>> *>Scchr16 1 - 948062* >>> *>Scmito 1 - 85779* >>> * >>> * >>> >>> In R I ran the following code: >>> * >>> * >>> >>> *ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")* >>> *path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")* >>> *buildindex(basename = file.path(path, "reference_index"), reference = >>> ref)* >>> *reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", >>> full=TRUE)* >>> *align(index = file.path(path, "reference_index"), readfile1 = >>> reads[[1]],output_file = file.path(path, "alignResults.SAM"))* >>> * >>> >>> * >>> * >>> 11152683 reads were processed in 121.8 seconds. >>> * >>> * >>> Percentage of successfully mapped reads is 0.00%. >>> * >>> * >>> >>> * >>> ?* >>> >>> * >>> * >>> >>> * >>> * >>> Completed successfully. >>> * >>> >>> * >>> >>> * >>> >>> If I do the same with bowtie I get more than 70%..What am I doing wrong >>> here? Thanks a lot! >>> >>> -- >>> Jo?o Moura >>> >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >> >> >> >> -- >> Jo?o Moura >> ?<thousand.fastq><untri_thousand.fastq> >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:15}} > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.0 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Dear João: You got to be certain whether the first 14 bases of your reads are adapters or not before you can trim them. You can do this by visual inspection on a small set of reads. But a better way will be to extract all first 14 bases from each read and and then look at the frequency of these 14mer sequences. I wrote a C program before to retrieve barcode sequences for each read, which might be useful for your data as well. This program has not been included in Rsubread package yet, but I will be happy to send you a copy of it if you want to use it. Cheers, Wei On Apr 21, 2011, at 10:44 AM, João Moura wrote: > Dear Wei Shi, > > First of all, thanks a lot for you help. Maybe I was doing it wrong also with bowtie, but the truth is I'm a master student, thus quite armature in all of this. > > Well, the reason why I trimmed my reads in first place was that sequence content across all bases is biased (which i think is a common problem in RNAseq). As you can see in the picture I send you, the first 14/15 position are biased (which I think is due to sequencing adapters?) So, the reason I followed was to trim the reads in order to be able to map the 22 bases that otherwise would be unmapped because of the adapters or other artifacts in the first 14/15 of the reads of the reads. But, since you say that 16bp is the minimum requirement for Rsubread, it seems that this problem will be solved using your software, right? > > Do you know if bowtie also uses this approach of subseting the reads? Because if I trim the reads I get an improvement of ~15% on the alignments. > > Once again, thank you very much for your help. > > On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi@wehi.edu.au> wrote: > Dear João: > > The default setting of Rsubread alignment requires at least 3 subreads (16bp substrings extracted from the read) to be mapped to the same location to report a hit. Subreads are extracted at a gap of 2bp from the read sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will be performed for mapping three sets of subreads respectively, which are extracted from different start positions of the read (1st base, 2nd base and 3rd base). > > Therefore, for your trimmed reads which are 20bp long, there will be only up to 2 subreads extracted from each read and they were used for mapping. However, because the consensus cutoff (minimal number of consensus subreads needed to report a hit) is 3 by default (which I believe is the value used in your analysis), no hits can be reported because of insufficient number of extracted subreads. > > DNA sequences of 20 bases are highly repetitive in human or mouse genome. But I do not know if this is the case for yeast genome. However with shorter read, you are more likely to get ambiguous mapping locations. > > I would suggest you to use the full read sequences for mapping to take advantage of all the base information in the reads. When mapping 36bp reads, Rsubread can extract 7 subreads from each read. This gives a lot more power for finding mapping locations in a sensitive and accurate way. You can use a consensus threshold of 2 (TH1=2) to call mapping locations, which we found works quite well. > > Hope this helps. > > Cheers, > Wei > > On Apr 21, 2011, at 12:31 AM, João Moura wrote: > >> Dear Wei Shi, >> >> Indeed I can run the example of the vignette but not my own trimmed reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I got why. the thing is I'm trimming the reads in R and trying to realign them afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, a line like this: >> >> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >> >> >> Will contain "length=36" all the time, even if my reads and length 20. Do you think this might be a problem? Because if I run with untrimmed reads I get some reasonable results. >> >> I sent you a untrimmed version and a trimmed version. Maybe you can try... >> >> >> Thanks! >> >> >> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: >> Dear Joao: >> >> Can you run the example enclosed with Rsubread package? >> >> Cheers, >> Wei >> >> On Apr 20, 2011, at 3:23 AM, João Moura wrote: >> >>> Dear Wei Shi, >>> >>> I'm trying out the Rsubread and I'm having strange results. That is I firstly built the reference genome, use a fasta file like this: >>> >>> [joao@goedel genomes]$ grep ">" yeast.fna >>> >Scchr01 1 - 230208 >>> >Scchr02 1 - 813178 >>> >Scchr03 1 - 316617 >>> >Scchr04 1 - 1531917 >>> >Scchr05 1 - 576869 >>> >Scchr06 1 - 270148 >>> >Scchr07 1 - 1090946 >>> >Scchr08 1 - 562643 >>> >Scchr09 1 - 439885 >>> >Scchr10 1 - 745667 >>> >Scchr11 1 - 666454 >>> >Scchr12 1 - 1078175 >>> >Scchr13 1 - 924429 >>> >Scchr14 1 - 784333 >>> >Scchr15 1 - 1091289 >>> >Scchr16 1 - 948062 >>> >Scmito 1 - 85779 >>> >>> In R I ran the following code: >>> >>> ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna") >>> path<-file.path("/home/joao/bowtie-0.12.7/Rsubread") >>> buildindex(basename = file.path(path, "reference_index"), reference = ref) >>> reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", full=TRUE) >>> align(index = file.path(path, "reference_index"), readfile1 = reads[[1]],output_file = file.path(path, "alignResults.SAM")) >>> >>> 11152683 reads were processed in 121.8 seconds. >>> Percentage of successfully mapped reads is 0.00%. >>> >>> >>> >>> Completed successfully. >>> >>> >>> If I do the same with bowtie I get more than 70%..What am I doing wrong here? Thanks a lot! >>> >>> -- >>> João Moura >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ >> >> >> >> -- >> João Moura >> <thousand.fastq><untri_thousand.fastq> > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:21}}

ADD REPLY • link 14.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hello, Thank you all for your advices, I'll have that in count from now on. Once again, thank you very much. Best regards, On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi@wehi.edu.au> wrote: > Dear João: > > You got to be certain whether the first 14 bases of your reads are adapters > or not before you can trim them. You can do this by visual inspection on a > small set of reads. But a better way will be to extract all first 14 bases > from each read and and then look at the frequency of these 14mer sequences. > I wrote a C program before to retrieve barcode sequences for each read, > which might be useful for your data as well. This program has not been > included in Rsubread package yet, but I will be happy to send you a copy of > it if you want to use it. > > Cheers, > Wei > > On Apr 21, 2011, at 10:44 AM, João Moura wrote: > > Dear Wei Shi, > > First of all, thanks a lot for you help. Maybe I was doing it wrong also > with bowtie, but the truth is I'm a master student, thus quite armature in > all of this. > > Well, the reason why I trimmed my reads in first place was that sequence > content across all bases is biased (which i think is a common problem in > RNAseq). As you can see in the picture I send you, the first 14/15 position > are biased (which I think is due to sequencing adapters?) So, the reason I > followed was to trim the reads in order to be able to map the 22 bases that > otherwise would be unmapped because of the adapters or other artifacts in > the first 14/15 of the reads of the reads. But, since you say that 16bp is > the minimum requirement for Rsubread, it seems that this problem will be > solved using your software, right? > > Do you know if bowtie also uses this approach of subseting the reads? > Because if I trim the reads I get an improvement of ~15% on the alignments. > > Once again, thank you very much for your help. > > On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi@wehi.edu.au> wrote: > >> Dear João: >> >> The default setting of Rsubread alignment requires at least 3 subreads >> (16bp substrings extracted from the read) to be mapped to the same location >> to report a hit. Subreads are extracted at a gap of 2bp from the read >> sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will >> be performed for mapping three sets of subreads respectively, which are >> extracted from different start positions of the read (1st base, 2nd base and >> 3rd base). >> >> Therefore, for your trimmed reads which are 20bp long, there will be only >> up to 2 subreads extracted from each read and they were used for mapping. >> However, because the consensus cutoff (minimal number of consensus subreads >> needed to report a hit) is 3 by default (which I believe is the value used >> in your analysis), no hits can be reported because of insufficient number of >> extracted subreads. >> >> DNA sequences of 20 bases are highly repetitive in human or mouse genome. >> But I do not know if this is the case for yeast genome. However with shorter >> read, you are more likely to get ambiguous mapping locations. >> >> I would suggest you to use the full read sequences for mapping to take >> advantage of all the base information in the reads. When mapping 36bp reads, >> Rsubread can extract 7 subreads from each read. This gives a lot more power >> for finding mapping locations in a sensitive and accurate way. You can use a >> consensus threshold of 2 (TH1=2) to call mapping locations, which we found >> works quite well. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Apr 21, 2011, at 12:31 AM, João Moura wrote: >> >> Dear Wei Shi, >> >> Indeed I can run the example of the vignette but not my own trimmed reads. >> With bowtie I get more than 80% but with RSubread i get 0%. I think I got >> why. the thing is I'm trimming the reads in R and trying to realign them >> afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, >> a line like this: >> >> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >> >> >> Will contain "length=36" all the time, even if my reads and length 20. Do >> you think this might be a problem? Because if I run with untrimmed reads I >> get some reasonable results. >> >> I sent you a untrimmed version and a trimmed version. Maybe you can try... >> >> >> Thanks! >> >> >> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: >> >>> Dear Joao: >>> >>> Can you run the example enclosed with Rsubread package? >>> >>> Cheers, >>> Wei >>> >>> On Apr 20, 2011, at 3:23 AM, João Moura wrote: >>> >>> Dear Wei Shi, >>> >>> I'm trying out the Rsubread and I'm having strange results. That is I >>> firstly built the reference genome, use a fasta file like this: >>> >>> *[joao@goedel genomes]$ grep ">" yeast.fna * >>> *>Scchr01 1 - 230208* >>> *>Scchr02 1 - 813178* >>> *>Scchr03 1 - 316617* >>> *>Scchr04 1 - 1531917* >>> *>Scchr05 1 - 576869* >>> *>Scchr06 1 - 270148* >>> *>Scchr07 1 - 1090946* >>> *>Scchr08 1 - 562643* >>> *>Scchr09 1 - 439885* >>> *>Scchr10 1 - 745667* >>> *>Scchr11 1 - 666454* >>> *>Scchr12 1 - 1078175* >>> *>Scchr13 1 - 924429* >>> *>Scchr14 1 - 784333* >>> *>Scchr15 1 - 1091289* >>> *>Scchr16 1 - 948062* >>> *>Scmito 1 - 85779* >>> * >>> * >>> >>> In R I ran the following code: >>> * >>> * >>> >>> *ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")* >>> *path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")* >>> *buildindex(basename = file.path(path, "reference_index"), reference = >>> ref)* >>> *reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", >>> full=TRUE)* >>> *align(index = file.path(path, "reference_index"), readfile1 = >>> reads[[1]],output_file = file.path(path, "alignResults.SAM"))* >>> * >>> >>> * >>> * >>> 11152683 reads were processed in 121.8 seconds. >>> * >>> * >>> Percentage of successfully mapped reads is 0.00%. >>> * >>> * >>> >>> * >>> * >>> >>> * >>> * >>> >>> * >>> * >>> Completed successfully. >>> * >>> >>> * >>> >>> * >>> >>> If I do the same with bowtie I get more than 70%..What am I doing wrong >>> here? Thanks a lot! >>> >>> -- >>> João Moura >>> >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >> >> >> >> -- >> João Moura >> <thousand.fastq><untri_thousand.fastq> >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > > > -- > João Moura > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:15}}

ADD REPLY • link 14.0 years ago João Moura ▴ 160

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 2 days ago

Australia/Melbourne

Hi Jo?o: You should use a lower consensus cutoff when using Rsubread to map your 36bp reads. The default setting of requiring at least 3 consensus subreads should only be used when reads are at least 45 bases long (so that at least 10 subreads can be extracted from each read). You can use a consensus threshold of 2 (TH1=2) to map your data. This should be able to get more reads mapped at a quite low mapping error rate. Attached is the C program for retrieving substrings from each read. Use this command to compile it on a linux computer: > gcc -g -o get_substrings get_substrings.c Typing ./get_substrings will tell you help to use it. It only works with a FASTQ file. The output will be a text file in which each line is a extracted substring. After you got the output file, for example a file called "output.txt", issue the following command to get the frequencies of distinct substrings: > sort output.txt | uniq -c > frequency.txt If your reads include adapter sequence, you should see a 14mer sequence in your "frequency.txt" file which has a very big number. Hope these will work for you, but let me know they don't. Cheers, Wei On Apr 22, 2011, at 4:11 AM, Jo?o Moura wrote: > Dear Wei Shi, > > Is it possible to send me the C code you talked about? > > Just another note, using bowtie with the full reads I get 66% while using Rsubread I get only 55%. Maybe the reason is that bowtie is reporting the alignments saying: > > "reads with at least one reported alignment" > > While maybe Rsubread just gets one hit per read? > > Thanks a lot for your help, > > On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > Dear Jo?o: > > You got to be certain whether the first 14 bases of your reads are adapters or not before you can trim them. You can do this by visual inspection on a small set of reads. But a better way will be to extract all first 14 bases from each read and and then look at the frequency of these 14mer sequences. I wrote a C program before to retrieve barcode sequences for each read, which might be useful for your data as well. This program has not been included in Rsubread package yet, but I will be happy to send you a copy of it if you want to use it. > > Cheers, > Wei > > On Apr 21, 2011, at 10:44 AM, Jo?o Moura wrote: > >> Dear Wei Shi, >> >> First of all, thanks a lot for you help. Maybe I was doing it wrong also with bowtie, but the truth is I'm a master student, thus quite armature in all of this. >> >> Well, the reason why I trimmed my reads in first place was that sequence content across all bases is biased (which i think is a common problem in RNAseq). As you can see in the picture I send you, the first 14/15 position are biased (which I think is due to sequencing adapters?) So, the reason I followed was to trim the reads in order to be able to map the 22 bases that otherwise would be unmapped because of the adapters or other artifacts in the first 14/15 of the reads of the reads. But, since you say that 16bp is the minimum requirement for Rsubread, it seems that this problem will be solved using your software, right? >> >> Do you know if bowtie also uses this approach of subseting the reads? Because if I trim the reads I get an improvement of ~15% on the alignments. >> >> Once again, thank you very much for your help. >> >> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> Dear Jo?o: >> >> The default setting of Rsubread alignment requires at least 3 subreads (16bp substrings extracted from the read) to be mapped to the same location to report a hit. Subreads are extracted at a gap of 2bp from the read sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will be performed for mapping three sets of subreads respectively, which are extracted from different start positions of the read (1st base, 2nd base and 3rd base). >> >> Therefore, for your trimmed reads which are 20bp long, there will be only up to 2 subreads extracted from each read and they were used for mapping. However, because the consensus cutoff (minimal number of consensus subreads needed to report a hit) is 3 by default (which I believe is the value used in your analysis), no hits can be reported because of insufficient number of extracted subreads. >> >> DNA sequences of 20 bases are highly repetitive in human or mouse genome. But I do not know if this is the case for yeast genome. However with shorter read, you are more likely to get ambiguous mapping locations. >> >> I would suggest you to use the full read sequences for mapping to take advantage of all the base information in the reads. When mapping 36bp reads, Rsubread can extract 7 subreads from each read. This gives a lot more power for finding mapping locations in a sensitive and accurate way. You can use a consensus threshold of 2 (TH1=2) to call mapping locations, which we found works quite well. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Apr 21, 2011, at 12:31 AM, Jo?o Moura wrote: >> >>> Dear Wei Shi, >>> >>> Indeed I can run the example of the vignette but not my own trimmed reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I got why. the thing is I'm trimming the reads in R and trying to realign them afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, a line like this: >>> >>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >>> >>> >>> Will contain "length=36" all the time, even if my reads and length 20. Do you think this might be a problem? Because if I run with untrimmed reads I get some reasonable results. >>> >>> I sent you a untrimmed version and a trimmed version. Maybe you can try... >>> >>> >>> Thanks! >>> >>> >>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >>> Dear Joao: >>> >>> Can you run the example enclosed with Rsubread package? >>> >>> Cheers, >>> Wei >>> >>> On Apr 20, 2011, at 3:23 AM, Jo?o Moura wrote: >>> >>>> Dear Wei Shi, >>>> >>>> I'm trying out the Rsubread and I'm having strange results. That is I firstly built the reference genome, use a fasta file like this: >>>> >>>> [joao at goedel genomes]$ grep ">" yeast.fna >>>> >Scchr01 1 - 230208 >>>> >Scchr02 1 - 813178 >>>> >Scchr03 1 - 316617 >>>> >Scchr04 1 - 1531917 >>>> >Scchr05 1 - 576869 >>>> >Scchr06 1 - 270148 >>>> >Scchr07 1 - 1090946 >>>> >Scchr08 1 - 562643 >>>> >Scchr09 1 - 439885 >>>> >Scchr10 1 - 745667 >>>> >Scchr11 1 - 666454 >>>> >Scchr12 1 - 1078175 >>>> >Scchr13 1 - 924429 >>>> >Scchr14 1 - 784333 >>>> >Scchr15 1 - 1091289 >>>> >Scchr16 1 - 948062 >>>> >Scmito 1 - 85779 >>>> >>>> In R I ran the following code: >>>> >>>> ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna") >>>> path<-file.path("/home/joao/bowtie-0.12.7/Rsubread") >>>> buildindex(basename = file.path(path, "reference_index"), reference = ref) >>>> reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", full=TRUE) >>>> align(index = file.path(path, "reference_index"), readfile1 = reads[[1]],output_file = file.path(path, "alignResults.SAM")) >>>> >>>> 11152683 reads were processed in 121.8 seconds. >>>> Percentage of successfully mapped reads is 0.00%. >>>> >>>> >>>> >>>> Completed successfully. >>>> >>>> >>>> If I do the same with bowtie I get more than 70%..What am I doing wrong here? Thanks a lot! >>>> >>>> -- >>>> Jo?o Moura >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the addressee. >>> You must not disclose, forward, print or use it without the permission of the sender. >>> ______________________________________________________________________ >>> >>> >>> >>> -- >>> Jo?o Moura >>> <thousand.fastq><untri_thousand.fastq> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ >> >> >> >> -- >> Jo?o Moura > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the addressee. > You must not disclose, forward, print or use it without the permission of the sender. > ______________________________________________________________________ > > > > -- > Jo?o Moura ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________

ADD COMMENT • link 14.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei Shi, Thanks a lot for your script and help. I have two more questions regarding Rsubread: 1) is it possible to use the align() but using fastq file previously loaded in memory? 2) is it possible to save the alignment file in memory instead of in a file also? Both questions can be answered indirectly - something like a sequence of loading/save procedures before using each one - but since you align() saves and loads the files, maybe there is a better/quicker way of doing it. Thanks a lot! On Fri, Apr 22, 2011 at 8:54 AM, Wei Shi <shi@wehi.edu.au> wrote: > Hi João: > > You should use a lower consensus cutoff when using Rsubread to map your > 36bp reads. The default setting of requiring at least 3 consensus subreads > should only be used when reads are at least 45 bases long (so that at least > 10 subreads can be extracted from each read). You can use a consensus > threshold of 2 (TH1=2) to map your data. This should be able to get more > reads mapped at a quite low mapping error rate. > > Attached is the C program for retrieving substrings from each read. Use > this command to compile it on a linux computer: > > > gcc -g -o get_substrings get_substrings.c > > Typing ./get_substrings will tell you help to use it. It only works with a > FASTQ file. The output will be a text file in which each line is a extracted > substring. > > After you got the output file, for example a file called "output.txt", > issue the following command to get the frequencies of distinct substrings: > > > sort output.txt | uniq -c > frequency.txt > > If your reads include adapter sequence, you should see a 14mer sequence in > your "frequency.txt" file which has a very big number. > > Hope these will work for you, but let me know they don't. > > Cheers, > Wei > > > > > > On Apr 22, 2011, at 4:11 AM, João Moura wrote: > > Dear Wei Shi, > > Is it possible to send me the C code you talked about? > > Just another note, using bowtie with the full reads I get 66% while using > Rsubread I get only 55%. Maybe the reason is that bowtie is reporting the > alignments saying: > > "reads with at least one reported alignment" > > While maybe Rsubread just gets one hit per read? > > Thanks a lot for your help, > > On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi@wehi.edu.au> wrote: > >> Dear João: >> >> You got to be certain whether the first 14 bases of your reads are >> adapters or not before you can trim them. You can do this by visual >> inspection on a small set of reads. But a better way will be to extract all >> first 14 bases from each read and and then look at the frequency of these >> 14mer sequences. I wrote a C program before to retrieve barcode sequences >> for each read, which might be useful for your data as well. This program has >> not been included in Rsubread package yet, but I will be happy to send you a >> copy of it if you want to use it. >> >> Cheers, >> Wei >> >> On Apr 21, 2011, at 10:44 AM, João Moura wrote: >> >> Dear Wei Shi, >> >> First of all, thanks a lot for you help. Maybe I was doing it wrong also >> with bowtie, but the truth is I'm a master student, thus quite armature in >> all of this. >> >> Well, the reason why I trimmed my reads in first place was that sequence >> content across all bases is biased (which i think is a common problem in >> RNAseq). As you can see in the picture I send you, the first 14/15 position >> are biased (which I think is due to sequencing adapters?) So, the reason I >> followed was to trim the reads in order to be able to map the 22 bases that >> otherwise would be unmapped because of the adapters or other artifacts in >> the first 14/15 of the reads of the reads. But, since you say that 16bp is >> the minimum requirement for Rsubread, it seems that this problem will be >> solved using your software, right? >> >> Do you know if bowtie also uses this approach of subseting the reads? >> Because if I trim the reads I get an improvement of ~15% on the alignments. >> >> Once again, thank you very much for your help. >> >> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi@wehi.edu.au> wrote: >> >>> Dear João: >>> >>> The default setting of Rsubread alignment requires at least 3 subreads >>> (16bp substrings extracted from the read) to be mapped to the same location >>> to report a hit. Subreads are extracted at a gap of 2bp from the read >>> sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will >>> be performed for mapping three sets of subreads respectively, which are >>> extracted from different start positions of the read (1st base, 2nd base and >>> 3rd base). >>> >>> Therefore, for your trimmed reads which are 20bp long, there will be only >>> up to 2 subreads extracted from each read and they were used for mapping. >>> However, because the consensus cutoff (minimal number of consensus subreads >>> needed to report a hit) is 3 by default (which I believe is the value used >>> in your analysis), no hits can be reported because of insufficient number of >>> extracted subreads. >>> >>> DNA sequences of 20 bases are highly repetitive in human or mouse genome. >>> But I do not know if this is the case for yeast genome. However with shorter >>> read, you are more likely to get ambiguous mapping locations. >>> >>> I would suggest you to use the full read sequences for mapping to take >>> advantage of all the base information in the reads. When mapping 36bp reads, >>> Rsubread can extract 7 subreads from each read. This gives a lot more power >>> for finding mapping locations in a sensitive and accurate way. You can use a >>> consensus threshold of 2 (TH1=2) to call mapping locations, which we found >>> works quite well. >>> >>> Hope this helps. >>> >>> Cheers, >>> Wei >>> >>> On Apr 21, 2011, at 12:31 AM, João Moura wrote: >>> >>> Dear Wei Shi, >>> >>> Indeed I can run the example of the vignette but not my own trimmed >>> reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I >>> got why. the thing is I'm trimming the reads in R and trying to realign them >>> afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, >>> a line like this: >>> >>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >>> >>> >>> Will contain "length=36" all the time, even if my reads and length 20. Do >>> you think this might be a problem? Because if I run with untrimmed reads I >>> get some reasonable results. >>> >>> I sent you a untrimmed version and a trimmed version. Maybe you can >>> try... >>> >>> >>> Thanks! >>> >>> >>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: >>> >>>> Dear Joao: >>>> >>>> Can you run the example enclosed with Rsubread package? >>>> >>>> Cheers, >>>> Wei >>>> >>>> On Apr 20, 2011, at 3:23 AM, João Moura wrote: >>>> >>>> Dear Wei Shi, >>>> >>>> I'm trying out the Rsubread and I'm having strange results. That is I >>>> firstly built the reference genome, use a fasta file like this: >>>> >>>> *[joao@goedel genomes]$ grep ">" yeast.fna * >>>> *>Scchr01 1 - 230208* >>>> *>Scchr02 1 - 813178* >>>> *>Scchr03 1 - 316617* >>>> *>Scchr04 1 - 1531917* >>>> *>Scchr05 1 - 576869* >>>> *>Scchr06 1 - 270148* >>>> *>Scchr07 1 - 1090946* >>>> *>Scchr08 1 - 562643* >>>> *>Scchr09 1 - 439885* >>>> *>Scchr10 1 - 745667* >>>> *>Scchr11 1 - 666454* >>>> *>Scchr12 1 - 1078175* >>>> *>Scchr13 1 - 924429* >>>> *>Scchr14 1 - 784333* >>>> *>Scchr15 1 - 1091289* >>>> *>Scchr16 1 - 948062* >>>> *>Scmito 1 - 85779* >>>> * >>>> * >>>> >>>> In R I ran the following code: >>>> * >>>> * >>>> >>>> *ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")* >>>> *path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")* >>>> *buildindex(basename = file.path(path, "reference_index"), reference = >>>> ref)* >>>> *reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", >>>> full=TRUE)* >>>> *align(index = file.path(path, "reference_index"), readfile1 = >>>> reads[[1]],output_file = file.path(path, "alignResults.SAM"))* >>>> * >>>> >>>> * >>>> * >>>> 11152683 reads were processed in 121.8 seconds. >>>> * >>>> * >>>> Percentage of successfully mapped reads is 0.00%. >>>> * >>>> * >>>> >>>> * >>>> * >>>> >>>> * >>>> * >>>> >>>> * >>>> * >>>> Completed successfully. >>>> * >>>> >>>> * >>>> >>>> * >>>> >>>> If I do the same with bowtie I get more than 70%..What am I doing wrong >>>> here? Thanks a lot! >>>> >>>> -- >>>> João Moura >>>> >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for >>>> the addressee. >>>> You must not disclose, forward, print or use it without the permission >>>> of the sender. >>>> ______________________________________________________________________ >>>> >>> >>> >>> >>> -- >>> João Moura >>> <thousand.fastq><untri_thousand.fastq> >>> >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >> >> >> >> -- >> João Moura >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > > > -- > João Moura > > > > -- João Moura [[alternative HTML version deleted]]

ADD REPLY • link 14.0 years ago João Moura ▴ 160

0

Entering edit mode

Hi João: Currently, align() only operates on files. Why do you want to use loaded fastq file for mapping? Align() does not load the entire fastq file into memory. Only one single read is present in the memory at any time, so the read data almost have no consumption of memory during mapping. Cheers, Wei On Apr 26, 2011, at 9:47 PM, João Moura wrote: > Hi Wei Shi, > > Thanks a lot for your script and help. > > I have two more questions regarding Rsubread: > > 1) is it possible to use the align() but using fastq file previously loaded in memory? > 2) is it possible to save the alignment file in memory instead of in a file also? > > Both questions can be answered indirectly - something like a sequence of loading/save procedures before using each one - but since you align() saves and loads the files, maybe there is a better/quicker way of doing it. > > Thanks a lot! > > > > On Fri, Apr 22, 2011 at 8:54 AM, Wei Shi <shi@wehi.edu.au> wrote: > Hi João: > > You should use a lower consensus cutoff when using Rsubread to map your 36bp reads. The default setting of requiring at least 3 consensus subreads should only be used when reads are at least 45 bases long (so that at least 10 subreads can be extracted from each read). You can use a consensus threshold of 2 (TH1=2) to map your data. This should be able to get more reads mapped at a quite low mapping error rate. > > Attached is the C program for retrieving substrings from each read. Use this command to compile it on a linux computer: > > > gcc -g -o get_substrings get_substrings.c > > Typing ./get_substrings will tell you help to use it. It only works with a FASTQ file. The output will be a text file in which each line is a extracted substring. > > After you got the output file, for example a file called "output.txt", issue the following command to get the frequencies of distinct substrings: > > > sort output.txt | uniq -c > frequency.txt > > If your reads include adapter sequence, you should see a 14mer sequence in your "frequency.txt" file which has a very big number. > > Hope these will work for you, but let me know they don't. > > Cheers, > Wei > > > > > > On Apr 22, 2011, at 4:11 AM, João Moura wrote: > >> Dear Wei Shi, >> >> Is it possible to send me the C code you talked about? >> >> Just another note, using bowtie with the full reads I get 66% while using Rsubread I get only 55%. Maybe the reason is that bowtie is reporting the alignments saying: >> >> "reads with at least one reported alignment" >> >> While maybe Rsubread just gets one hit per read? >> >> Thanks a lot for your help, >> >> On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi@wehi.edu.au> wrote: >> Dear João: >> >> You got to be certain whether the first 14 bases of your reads are adapters or not before you can trim them. You can do this by visual inspection on a small set of reads. But a better way will be to extract all first 14 bases from each read and and then look at the frequency of these 14mer sequences. I wrote a C program before to retrieve barcode sequences for each read, which might be useful for your data as well. This program has not been included in Rsubread package yet, but I will be happy to send you a copy of it if you want to use it. >> >> Cheers, >> Wei >> >> On Apr 21, 2011, at 10:44 AM, João Moura wrote: >> >>> Dear Wei Shi, >>> >>> First of all, thanks a lot for you help. Maybe I was doing it wrong also with bowtie, but the truth is I'm a master student, thus quite armature in all of this. >>> >>> Well, the reason why I trimmed my reads in first place was that sequence content across all bases is biased (which i think is a common problem in RNAseq). As you can see in the picture I send you, the first 14/15 position are biased (which I think is due to sequencing adapters?) So, the reason I followed was to trim the reads in order to be able to map the 22 bases that otherwise would be unmapped because of the adapters or other artifacts in the first 14/15 of the reads of the reads. But, since you say that 16bp is the minimum requirement for Rsubread, it seems that this problem will be solved using your software, right? >>> >>> Do you know if bowtie also uses this approach of subseting the reads? Because if I trim the reads I get an improvement of ~15% on the alignments. >>> >>> Once again, thank you very much for your help. >>> >>> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi@wehi.edu.au> wrote: >>> Dear João: >>> >>> The default setting of Rsubread alignment requires at least 3 subreads (16bp substrings extracted from the read) to be mapped to the same location to report a hit. Subreads are extracted at a gap of 2bp from the read sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will be performed for mapping three sets of subreads respectively, which are extracted from different start positions of the read (1st base, 2nd base and 3rd base). >>> >>> Therefore, for your trimmed reads which are 20bp long, there will be only up to 2 subreads extracted from each read and they were used for mapping. However, because the consensus cutoff (minimal number of consensus subreads needed to report a hit) is 3 by default (which I believe is the value used in your analysis), no hits can be reported because of insufficient number of extracted subreads. >>> >>> DNA sequences of 20 bases are highly repetitive in human or mouse genome. But I do not know if this is the case for yeast genome. However with shorter read, you are more likely to get ambiguous mapping locations. >>> >>> I would suggest you to use the full read sequences for mapping to take advantage of all the base information in the reads. When mapping 36bp reads, Rsubread can extract 7 subreads from each read. This gives a lot more power for finding mapping locations in a sensitive and accurate way. You can use a consensus threshold of 2 (TH1=2) to call mapping locations, which we found works quite well. >>> >>> Hope this helps. >>> >>> Cheers, >>> Wei >>> >>> On Apr 21, 2011, at 12:31 AM, João Moura wrote: >>> >>>> Dear Wei Shi, >>>> >>>> Indeed I can run the example of the vignette but not my own trimmed reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I got why. the thing is I'm trimming the reads in R and trying to realign them afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, a line like this: >>>> >>>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >>>> >>>> >>>> Will contain "length=36" all the time, even if my reads and length 20. Do you think this might be a problem? Because if I run with untrimmed reads I get some reasonable results. >>>> >>>> I sent you a untrimmed version and a trimmed version. Maybe you can try... >>>> >>>> >>>> Thanks! >>>> >>>> >>>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: >>>> Dear Joao: >>>> >>>> Can you run the example enclosed with Rsubread package? >>>> >>>> Cheers, >>>> Wei >>>> >>>> On Apr 20, 2011, at 3:23 AM, João Moura wrote: >>>> >>>>> Dear Wei Shi, >>>>> >>>>> I'm trying out the Rsubread and I'm having strange results. That is I firstly built the reference genome, use a fasta file like this: >>>>> >>>>> [joao@goedel genomes]$ grep ">" yeast.fna >>>>> >Scchr01 1 - 230208 >>>>> >Scchr02 1 - 813178 >>>>> >Scchr03 1 - 316617 >>>>> >Scchr04 1 - 1531917 >>>>> >Scchr05 1 - 576869 >>>>> >Scchr06 1 - 270148 >>>>> >Scchr07 1 - 1090946 >>>>> >Scchr08 1 - 562643 >>>>> >Scchr09 1 - 439885 >>>>> >Scchr10 1 - 745667 >>>>> >Scchr11 1 - 666454 >>>>> >Scchr12 1 - 1078175 >>>>> >Scchr13 1 - 924429 >>>>> >Scchr14 1 - 784333 >>>>> >Scchr15 1 - 1091289 >>>>> >Scchr16 1 - 948062 >>>>> >Scmito 1 - 85779 >>>>> >>>>> In R I ran the following code: >>>>> >>>>> ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna") >>>>> path<-file.path("/home/joao/bowtie-0.12.7/Rsubread") >>>>> buildindex(basename = file.path(path, "reference_index"), reference = ref) >>>>> reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", full=TRUE) >>>>> align(index = file.path(path, "reference_index"), readfile1 = reads[[1]],output_file = file.path(path, "alignResults.SAM")) >>>>> >>>>> 11152683 reads were processed in 121.8 seconds. >>>>> Percentage of successfully mapped reads is 0.00%. >>>>> >>>>> >>>>> >>>>> Completed successfully. >>>>> >>>>> >>>>> If I do the same with bowtie I get more than 70%..What am I doing wrong here? Thanks a lot! >>>>> >>>>> -- >>>>> João Moura >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for the addressee. >>>> You must not disclose, forward, print or use it without the permission of the sender. >>>> ______________________________________________________________________ >>>> >>>> >>>> >>>> -- >>>> João Moura >>>> <thousand.fastq><untri_thousand.fastq> >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the addressee. >>> You must not disclose, forward, print or use it without the permission of the sender. >>> ______________________________________________________________________ >>> >>> >>> >>> -- >>> João Moura >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ >> >> >> >> -- >> João Moura > > > > > > -- > João Moura ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}

ADD REPLY • link 14.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei, Well, the reason I wanted to process preloaded fasta files is because I'm running Rsubread inside a workflow and I already have each fasta loaded. In order to use Rsubread I need to write the file and read from it. Same thing with the SAM files, but anyway actually Rsubread is quite fast. Thanks alot for all! On Wed, Apr 27, 2011 at 2:49 AM, Wei Shi <shi@wehi.edu.au> wrote: > Hi João: > > Currently, align() only operates on files. Why do you want to use loaded > fastq file for mapping? Align() does not load the entire fastq file into > memory. Only one single read is present in the memory at any time, so the > read data almost have no consumption of memory during mapping. > > Cheers, > Wei > > On Apr 26, 2011, at 9:47 PM, João Moura wrote: > > Hi Wei Shi, > > Thanks a lot for your script and help. > > I have two more questions regarding Rsubread: > > 1) is it possible to use the align() but using fastq file previously loaded > in memory? > 2) is it possible to save the alignment file in memory instead of in a file > also? > > Both questions can be answered indirectly - something like a sequence of > loading/save procedures before using each one - but since you align() saves > and loads the files, maybe there is a better/quicker way of doing it. > > Thanks a lot! > > > > On Fri, Apr 22, 2011 at 8:54 AM, Wei Shi <shi@wehi.edu.au> wrote: > >> Hi João: >> >> You should use a lower consensus cutoff when using Rsubread to map your >> 36bp reads. The default setting of requiring at least 3 consensus subreads >> should only be used when reads are at least 45 bases long (so that at least >> 10 subreads can be extracted from each read). You can use a consensus >> threshold of 2 (TH1=2) to map your data. This should be able to get more >> reads mapped at a quite low mapping error rate. >> >> Attached is the C program for retrieving substrings from each read. Use >> this command to compile it on a linux computer: >> >> > gcc -g -o get_substrings get_substrings.c >> >> Typing ./get_substrings will tell you help to use it. It only works with a >> FASTQ file. The output will be a text file in which each line is a extracted >> substring. >> >> After you got the output file, for example a file called "output.txt", >> issue the following command to get the frequencies of distinct substrings: >> >> > sort output.txt | uniq -c > frequency.txt >> >> If your reads include adapter sequence, you should see a 14mer sequence in >> your "frequency.txt" file which has a very big number. >> >> Hope these will work for you, but let me know they don't. >> >> Cheers, >> Wei >> >> >> >> >> >> On Apr 22, 2011, at 4:11 AM, João Moura wrote: >> >> Dear Wei Shi, >> >> Is it possible to send me the C code you talked about? >> >> Just another note, using bowtie with the full reads I get 66% while using >> Rsubread I get only 55%. Maybe the reason is that bowtie is reporting the >> alignments saying: >> >> "reads with at least one reported alignment" >> >> While maybe Rsubread just gets one hit per read? >> >> Thanks a lot for your help, >> >> On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi@wehi.edu.au> wrote: >> >>> Dear João: >>> >>> You got to be certain whether the first 14 bases of your reads are >>> adapters or not before you can trim them. You can do this by visual >>> inspection on a small set of reads. But a better way will be to extract all >>> first 14 bases from each read and and then look at the frequency of these >>> 14mer sequences. I wrote a C program before to retrieve barcode sequences >>> for each read, which might be useful for your data as well. This program has >>> not been included in Rsubread package yet, but I will be happy to send you a >>> copy of it if you want to use it. >>> >>> Cheers, >>> Wei >>> >>> On Apr 21, 2011, at 10:44 AM, João Moura wrote: >>> >>> Dear Wei Shi, >>> >>> First of all, thanks a lot for you help. Maybe I was doing it wrong also >>> with bowtie, but the truth is I'm a master student, thus quite armature in >>> all of this. >>> >>> Well, the reason why I trimmed my reads in first place was that sequence >>> content across all bases is biased (which i think is a common problem in >>> RNAseq). As you can see in the picture I send you, the first 14/15 position >>> are biased (which I think is due to sequencing adapters?) So, the reason I >>> followed was to trim the reads in order to be able to map the 22 bases that >>> otherwise would be unmapped because of the adapters or other artifacts in >>> the first 14/15 of the reads of the reads. But, since you say that 16bp is >>> the minimum requirement for Rsubread, it seems that this problem will be >>> solved using your software, right? >>> >>> Do you know if bowtie also uses this approach of subseting the reads? >>> Because if I trim the reads I get an improvement of ~15% on the alignments. >>> >>> Once again, thank you very much for your help. >>> >>> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi@wehi.edu.au> wrote: >>> >>>> Dear João: >>>> >>>> The default setting of Rsubread alignment requires at least 3 subreads >>>> (16bp substrings extracted from the read) to be mapped to the same location >>>> to report a hit. Subreads are extracted at a gap of 2bp from the read >>>> sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will >>>> be performed for mapping three sets of subreads respectively, which are >>>> extracted from different start positions of the read (1st base, 2nd base and >>>> 3rd base). >>>> >>>> Therefore, for your trimmed reads which are 20bp long, there will be >>>> only up to 2 subreads extracted from each read and they were used for >>>> mapping. However, because the consensus cutoff (minimal number of consensus >>>> subreads needed to report a hit) is 3 by default (which I believe is the >>>> value used in your analysis), no hits can be reported because of >>>> insufficient number of extracted subreads. >>>> >>>> DNA sequences of 20 bases are highly repetitive in human or mouse >>>> genome. But I do not know if this is the case for yeast genome. However with >>>> shorter read, you are more likely to get ambiguous mapping locations. >>>> >>>> I would suggest you to use the full read sequences for mapping to take >>>> advantage of all the base information in the reads. When mapping 36bp reads, >>>> Rsubread can extract 7 subreads from each read. This gives a lot more power >>>> for finding mapping locations in a sensitive and accurate way. You can use a >>>> consensus threshold of 2 (TH1=2) to call mapping locations, which we found >>>> works quite well. >>>> >>>> Hope this helps. >>>> >>>> Cheers, >>>> Wei >>>> >>>> On Apr 21, 2011, at 12:31 AM, João Moura wrote: >>>> >>>> Dear Wei Shi, >>>> >>>> Indeed I can run the example of the vignette but not my own trimmed >>>> reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I >>>> got why. the thing is I'm trimming the reads in R and trying to realign them >>>> afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, >>>> a line like this: >>>> >>>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36 >>>> >>>> >>>> Will contain "length=36" all the time, even if my reads and length 20. >>>> Do you think this might be a problem? Because if I run with untrimmed reads >>>> I get some reasonable results. >>>> >>>> I sent you a untrimmed version and a trimmed version. Maybe you can >>>> try... >>>> >>>> >>>> Thanks! >>>> >>>> >>>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi@wehi.edu.au> wrote: >>>> >>>>> Dear Joao: >>>>> >>>>> Can you run the example enclosed with Rsubread package? >>>>> >>>>> Cheers, >>>>> Wei >>>>> >>>>> On Apr 20, 2011, at 3:23 AM, João Moura wrote: >>>>> >>>>> Dear Wei Shi, >>>>> >>>>> I'm trying out the Rsubread and I'm having strange results. That is I >>>>> firstly built the reference genome, use a fasta file like this: >>>>> >>>>> *[joao@goedel genomes]$ grep ">" yeast.fna * >>>>> *>Scchr01 1 - 230208* >>>>> *>Scchr02 1 - 813178* >>>>> *>Scchr03 1 - 316617* >>>>> *>Scchr04 1 - 1531917* >>>>> *>Scchr05 1 - 576869* >>>>> *>Scchr06 1 - 270148* >>>>> *>Scchr07 1 - 1090946* >>>>> *>Scchr08 1 - 562643* >>>>> *>Scchr09 1 - 439885* >>>>> *>Scchr10 1 - 745667* >>>>> *>Scchr11 1 - 666454* >>>>> *>Scchr12 1 - 1078175* >>>>> *>Scchr13 1 - 924429* >>>>> *>Scchr14 1 - 784333* >>>>> *>Scchr15 1 - 1091289* >>>>> *>Scchr16 1 - 948062* >>>>> *>Scmito 1 - 85779* >>>>> * >>>>> * >>>>> >>>>> In R I ran the following code: >>>>> * >>>>> * >>>>> >>>>> *ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")* >>>>> *path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")* >>>>> *buildindex(basename = file.path(path, "reference_index"), reference = >>>>> ref)* >>>>> *reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", >>>>> patt="fastq$", full=TRUE)* >>>>> *align(index = file.path(path, "reference_index"), readfile1 = >>>>> reads[[1]],output_file = file.path(path, "alignResults.SAM"))* >>>>> * >>>>> >>>>> * >>>>> * >>>>> 11152683 reads were processed in 121.8 seconds. >>>>> * >>>>> * >>>>> Percentage of successfully mapped reads is 0.00%. >>>>> * >>>>> * >>>>> >>>>> * >>>>> * >>>>> >>>>> * >>>>> * >>>>> >>>>> * >>>>> * >>>>> Completed successfully. >>>>> * >>>>> >>>>> * >>>>> >>>>> * >>>>> >>>>> If I do the same with bowtie I get more than 70%..What am I doing wrong >>>>> here? Thanks a lot! >>>>> >>>>> -- >>>>> João Moura >>>>> >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> The information in this email is confidential and intended solely for >>>>> the addressee. >>>>> You must not disclose, forward, print or use it without the permission >>>>> of the sender. >>>>> ______________________________________________________________________ >>>>> >>>> >>>> >>>> >>>> -- >>>> João Moura >>>> <thousand.fastq><untri_thousand.fastq> >>>> >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for >>>> the addressee. >>>> You must not disclose, forward, print or use it without the permission >>>> of the sender. >>>> ______________________________________________________________________ >>>> >>> >>> >>> >>> -- >>> João Moura >>> >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >> >> >> >> -- >> João Moura >> >> >> >> > > > -- > João Moura > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:15}}

ADD REPLY • link 14.0 years ago João Moura ▴ 160

Login before adding your answer.