QuasR special case alignment

0

Entering edit mode

Michael Stadler ▴ 350

@michael-stadler-5887

Last seen 5 months ago

Switzerland

Hi Ugo, I can see one problem in the auxiliaries.txt file: You are referring the same sequence file (seq.fa) twice, so all auxiliary alignments are done twice. The speed of QuasR is primarly defined by the speed of bowtie/SpliceMap. The fact that the alignments against chr1 are fast may be due to only few reads producing hits on chr1, or at last much fewer hits than against the auxiliaries. A way to speed up the analysis would be to do unspliced alignments (spliceAlignment=FALSE), which may not be necessary if your reporter genes in seq.fa are spliced cDNAs. Also, it would help to use more than just 4 CPU cores, if you have additional hardware available to you. Michael On 26.04.2013 14:58, Ugo Borello wrote: > Thank you Michael, > The reporter genes are not part of the genome, so I am aligning my short > reads to the genome (actually chr1only in the following test) and to a fasta > file (seq.fa) containing the sequences of the two reporter genes. > But, while alignment to chr1 was relatively fast, the alignment to the fasta > file it is taking forever. > Here the code: > > library(QuasR) > library(parallel) > > cl <- makeCluster(4) > sampleFile <- "test.txt" > genomeName <- "chr1.fa" > auxFile <- "auxiliaries.txt" > > proj1 <- qAlign(sampleFile, genome= genomeName, auxiliaryFile= auxFile, > splicedAlignment=TRUE, clObj=cl) > > > text.txt is formatted this way: > FileName SampleName > 31omo.fastq Sample1 > > > auxiliaries.txt is formatted this way: > FileName AuxName > seq.fa egfp > seq.fa cre > > > Am I missing something here? Is there a faster way to complete my analysis? > > Thank you in advance for your help > > Ugo > > >> From: Michael Stadler <michael.stadler at="" fmi.ch=""> >> Date: Fri, 26 Apr 2013 08:40:50 +0200 >> To: <bioconductor at="" r-project.org=""> >> Cc: <ugo.borello at="" inserm.fr=""> >> Subject: Re: [BioC] QuasR special case alignment >> >> Dear Ugo, >> >> One way to accomplish would be to put your two reporter genes into a >> fasta file, and use this as your "genome" (see ?qAlign or the vignette >> for examples of genomes in fasta format). >> >> However, I think it would be better to align against the full genome, as >> this will avoid certain problems, such as reads being suboptimally >> aligned to your reporter genes which would align perfectly to some other >> genomic locus. >> >> Assuming that the reporter genes are part of the genome, you could then >> count alignments using qCount() with a GRanges query containing the >> ranges of your reporter genes. >> >> If the reporter genes are not part of the genome, you can provide your >> reporter gene sequences (as a fasta file) to qAlign, and all reads not >> aligning to the genome will be further aligned to them. The format of >> the required auxiliary file is very simple and described in section 5.2 >> of the vignette. >> >> You can open the QuasR vignette using: >> library(QuasR) >> vignette("QuasR-Overview") >> >> Best wishes, >> Michael >> >> On 25.04.2013 16:52, Ugo Borello wrote: >>> Good morning, >>> >>> I am doing RNA-Seq analysis and I would like to perform an exploratory >>> analysis to first verify that 2 reporter genes are expressed in my samples. >>> >>> Could the qAlign function perform an alignment of my short reads with two >>> reporter gene sequences, excluding the genome. >>> It seems, from my first attempt, that the 'genome' parameter is essential; >>> is this true? >>> Anyway how do I have to feed those sequences to qAlign and how do I have to >>> format my auxiliary file/files to see in the alignment statistics the number >>> of mapped reads for each reporter gene sequence separately? >>> >>> Thank you in advance for your help. >>> >>> Ugo >>> >>> P.S. Let me say that QuasR is a fantastic package! >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > > -- -------------------------------------------- Michael Stadler, PhD Head of Computational Biology Friedrich Miescher Institute Basel (Switzerland) Phone : +41 61 697 6492 Fax : +41 61 697 3976 Mail : michael.stadler at fmi.ch

Alignment genomes QuasR Alignment genomes QuasR • 1.9k views

ADD COMMENT • link updated 12.0 years ago by Ugo Borello ▴ 340 • written 12.0 years ago by Michael Stadler ▴ 350

0

Entering edit mode

Ugo Borello ▴ 340

@ugo-borello-5753

Last seen 6.8 years ago

France

Dear Michael, Regarding the auxiliary alignments, if I understand correctly, having two sequences and because I want to count the alignments separately for each of them, I have to format the auxiliaries.txt file this way: FileName AuxName seq1.fa egfp seq2.fa cre So in this way I will have to copy the two sequences in two different files ( seq1.fa and seq2.fa), correct? Then, I have two more brief questions: 1) is there a special method to save the qProject object? 2) and regarding the matrix obtained using qCount, I wanted to add gene ids. So I ran: > query<- exons(TxDb.Mmusculus.UCSC.mm10.knownGene, columns='gene_id') > geneLevels <- qCount(proj1, query, reportLevel = "gene", clObj=cl) No gene ids as expected. If I run this line before calling qCount names(query)<-mcols(query)$gene_id I get this error message: Error in as.vector(x, mode = "character") : coercing an Atomic List object to an atomic vector is supported only for objects with top-level elements of length <= 1 And when I change it to names(query)<-as.vector(mcols(query)$gene_id) > geneLevels <- qCount(proj2, query, reportLevel = "gene", clObj=cl) Error in reduce(split(query, names(query))[unique(names(query))]) : error in evaluating the argument 'x' in selecting a method for function 'reduce': Error in .nextMethod(x = x, i = i) : subscript contains NAs qCount complains. What is then the correct method to add gene ids to the qCount matrix? And what about adding gene names also? Thank you in advance for your help. Ugo > From: Michael Stadler <michael.stadler at="" fmi.ch=""> > Date: Thu, 2 May 2013 08:38:09 +0200 > To: Ugo Borello <ugo.borello at="" inserm.fr=""> > Cc: <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] QuasR special case alignment > > Hi Ugo, > > I can see one problem in the auxiliaries.txt file: You are referring the > same sequence file (seq.fa) twice, so all auxiliary alignments are done > twice. > > The speed of QuasR is primarly defined by the speed of bowtie/SpliceMap. > The fact that the alignments against chr1 are fast may be due to only > few reads producing hits on chr1, or at last much fewer hits than > against the auxiliaries. > > A way to speed up the analysis would be to do unspliced alignments > (spliceAlignment=FALSE), which may not be necessary if your reporter > genes in seq.fa are spliced cDNAs. Also, it would help to use more than > just 4 CPU cores, if you have additional hardware available to you. > > Michael > > On 26.04.2013 14:58, Ugo Borello wrote: >> Thank you Michael, >> The reporter genes are not part of the genome, so I am aligning my short >> reads to the genome (actually chr1only in the following test) and to a fasta >> file (seq.fa) containing the sequences of the two reporter genes. >> But, while alignment to chr1 was relatively fast, the alignment to the fasta >> file it is taking forever. >> Here the code: >> >> library(QuasR) >> library(parallel) >> >> cl <- makeCluster(4) >> sampleFile <- "test.txt" >> genomeName <- "chr1.fa" >> auxFile <- "auxiliaries.txt" >> >> proj1 <- qAlign(sampleFile, genome= genomeName, auxiliaryFile= auxFile, >> splicedAlignment=TRUE, clObj=cl) >> >> >> text.txt is formatted this way: >> FileName SampleName >> 31omo.fastq Sample1 >> >> >> auxiliaries.txt is formatted this way: >> FileName AuxName >> seq.fa egfp >> seq.fa cre >> >> >> Am I missing something here? Is there a faster way to complete my analysis? >> >> Thank you in advance for your help >> >> Ugo >> >> >>> From: Michael Stadler <michael.stadler at="" fmi.ch=""> >>> Date: Fri, 26 Apr 2013 08:40:50 +0200 >>> To: <bioconductor at="" r-project.org=""> >>> Cc: <ugo.borello at="" inserm.fr=""> >>> Subject: Re: [BioC] QuasR special case alignment >>> >>> Dear Ugo, >>> >>> One way to accomplish would be to put your two reporter genes into a >>> fasta file, and use this as your "genome" (see ?qAlign or the vignette >>> for examples of genomes in fasta format). >>> >>> However, I think it would be better to align against the full genome, as >>> this will avoid certain problems, such as reads being suboptimally >>> aligned to your reporter genes which would align perfectly to some other >>> genomic locus. >>> >>> Assuming that the reporter genes are part of the genome, you could then >>> count alignments using qCount() with a GRanges query containing the >>> ranges of your reporter genes. >>> >>> If the reporter genes are not part of the genome, you can provide your >>> reporter gene sequences (as a fasta file) to qAlign, and all reads not >>> aligning to the genome will be further aligned to them. The format of >>> the required auxiliary file is very simple and described in section 5.2 >>> of the vignette. >>> >>> You can open the QuasR vignette using: >>> library(QuasR) >>> vignette("QuasR-Overview") >>> >>> Best wishes, >>> Michael >>> >>> On 25.04.2013 16:52, Ugo Borello wrote: >>>> Good morning, >>>> >>>> I am doing RNA-Seq analysis and I would like to perform an exploratory >>>> analysis to first verify that 2 reporter genes are expressed in my samples. >>>> >>>> Could the qAlign function perform an alignment of my short reads with two >>>> reporter gene sequences, excluding the genome. >>>> It seems, from my first attempt, that the 'genome' parameter is essential; >>>> is this true? >>>> Anyway how do I have to feed those sequences to qAlign and how do I have to >>>> format my auxiliary file/files to see in the alignment statistics the >>>> number >>>> of mapped reads for each reporter gene sequence separately? >>>> >>>> Thank you in advance for your help. >>>> >>>> Ugo >>>> >>>> P.S. Let me say that QuasR is a fantastic package! >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >> > > -- > -------------------------------------------- > Michael Stadler, PhD > Head of Computational Biology > Friedrich Miescher Institute > Basel (Switzerland) > Phone : +41 61 697 6492 > Fax : +41 61 697 3976 > Mail : michael.stadler at fmi.ch > -------------------------------------------- >

ADD COMMENT • link 12.0 years ago Ugo Borello ▴ 340

0

Entering edit mode

Dear Ugo, On 08.05.2013 10:33, Ugo Borello wrote: > Dear Michael, > Regarding the auxiliary alignments, if I understand correctly, > having two sequences and because I want to count the alignments separately > for each of them, I have to format the auxiliaries.txt file this way: > FileName AuxName > seq1.fa egfp > seq2.fa cre > So in this way I will have to copy the two sequences in two different files > ( seq1.fa and seq2.fa), correct? That's one possibility. An alternative would be to keep them in the same fasta library (e.g. seq.fa), and do the counting by using the query sequence name to specify what to count. In that case, the auxiliary file would look like this: FileName AuxName seq.fa myAuxSeqs and the counting could e.g. look like this (assuming that "egfp" is one of the sequence identifier in seq.fa): qCount(proj, GRanges("egfp", IRanges(...))) > Then, I have two more brief questions: > 1) is there a special method to save the qProject object? It is a plain S4 R object that could be saved as other R objects (e.g. using saveRDS). However, as mentioned earlier on the list it is not recommended to save the object and reload them, as you could end up with unconsistent states. For example, one of the bam or fastq files could have been modified in between saving and reloading. It is recommended to always re-create the qProject object with qAlign - it will check for everything to be in order and will only take a second if the alignment files are already existing. > 2) and regarding the matrix obtained using qCount, I wanted to add gene ids. > So I ran: >> query<- exons(TxDb.Mmusculus.UCSC.mm10.knownGene, columns='gene_id') >> geneLevels <- qCount(proj1, query, reportLevel = "gene", clObj=cl) > No gene ids as expected. > If I run this line before calling qCount > names(query)<-mcols(query)$gene_id > I get this error message: > Error in as.vector(x, mode = "character") : coercing an Atomic List object > to an atomic vector is supported only for objects with top-level elements of > length <= 1 > And when I change it to > names(query)<-as.vector(mcols(query)$gene_id) >> geneLevels <- qCount(proj2, query, reportLevel = "gene", clObj=cl) > Error in reduce(split(query, names(query))[unique(names(query))]) : > error in evaluating the argument 'x' in selecting a method for function > 'reduce': Error in .nextMethod(x = x, i = i) : subscript contains NAs > > qCount complains. > > What is then the correct method to add gene ids to the qCount matrix? > And what about adding gene names also? You are mixing up two possible approaches to get gene levels. Either you use: * a GRanges query with gene identifiers as names query <- exons(TxDb.Mmusculus.UCSC.mm10.knownGene, columns='gene_id') names(query) <- as.character(query$gene_id) geneLevels <- qCount(proj1, query) or you use * a TranscriptDb query and set reportLevel: geneLevels <- qCount(proj1, TxDb.Mmusculus.UCSC.mm10.knownGene, reportLevel="gene") Either one of these should give you the same gene levels. Best wishes, Michael

ADD REPLY • link 12.0 years ago Michael Stadler ▴ 350

Login before adding your answer.