EasyRNAseq - error in building count table - Error in unlist
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.3 years ago
Hello, I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from: # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0). I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR. Some months ago, I did following analysis with easyRNASeq: chr.map <- data.frame( from=c("chr1","chr2","chr3","chr4","chr5","ch r6","chr7","chr8","chr9","chr10"), to=c("1","2","3","4","5","6","7","8","9","10")) ensembl <- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene") exon.annotation <- getBM(c("ensembl_gene_id", "ensembl_transcript_id", "chromosome_name", "ensembl_gene_id", "ensembl_exon_id", "exon_chrom_start", "exon_chrom_end"), mart=ensembl, filters="chromosome_name", values=chrm) exon.annotation$chromosome <- paste("chr",exon.annotation$chromosome_name,sep="") exon.range <- RangedData(IRanges( start=exon.annotation$exon_chrom_start, end=exon.annotation$exon_chrom_end), space=exon.annotation$chromosome, strand=exon.annotation$strand, transcript=exon.annotation$ensembl_transcript_id, gene=exon.annotation$ensembl_gene_id, exon=exon.annotation$ensembl_exon_id, universe = "NULL") files <- c( "accepted_hits_12a_sorted.bam", "accepted_hits_13a_sorted.bam",...) rnaSeq <- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata", gapped=F, validity.check=TRUE, chr.map=chr.map, organism="custom", annotationMethod="env", annotationObject=exon.range, count="genes", filenames=files, summarization="geneModels", outputFormat="RNAseq") I got an output gene count table. Now, I wanted to repeat the analysis, but leaving out the non-protein- coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object. To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely: OUTPUT: > head( exon.annotation) ensembl_gene_id strand ensembl_transcript_id chromosome_name 1 GRMZM2G439951 1 GRMZM2G439951_T01 5 2 GRMZM2G094632 1 GRMZM2G094632_T02 5 3 GRMZM2G094632 1 GRMZM2G094632_T01 5 4 GRMZM2G094632 1 GRMZM2G094632_T01 5 5 GRMZM2G095090 1 GRMZM2G095090_T01 5 6 GRMZM2G095094 1 GRMZM2G095094_T01 5 ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105 gene_biotype chromosome 1 protein_coding chr5 2 protein_coding chr5 3 protein_coding chr5 4 protein_coding chr5 5 protein_coding chr5 6 protein_coding chr5 > head( exon.range) RangedData with 6 rows and 5 value columns across 10 spaces space ranges | strand transcript <factor> <iranges> | <integer> <character> 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03 gene exon biotype <character> <character> <character> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding > > rnaSeq_new <- easyRNASeq(filesDirectory="/projects/irg/grp_stich/pe rsonal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/topha t/EASYRNASEQ/countdata", + gapped=F, + validity.check=TRUE, + chr.map=chr.map, + organism="custom", + annotationMethod="env", + annotationObject=exon.range, + count="genes", + filenames=files, + summarization="geneModels", + outputFormat="RNAseq") Checking arguments... Fetching annotations... Computing gene models... Summarizing counts... Processing accepted_hits_12a_sorted.bam Updating the read length information. The reads are of 95 bp. Warning message: In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_folde rs/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ /countdata", : There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], : error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") : 'x' values larger than vector length 'sum(width)' > Could you help me with this problem? Best, Felix -- Felix Frey Max Planck Institut f??r Pflanzenz??chtungsforschung Carl-von-Linn??-Weg 10 D-50829 K??ln +49 (0) 221-5062 405 -- output of sessionInfo(): > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1 [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0 [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3 [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0 [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0 [19] intervals_0.14.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1 [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0 [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0 [10] grid_3.0.1 hwriter_1.3 splines_3.0.1 [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1 [16] xtable_1.7-1 zlibbioc_1.6.0 > -- Sent via the guest posting facility at bioconductor.org.
Sequencing RNASeq edgeR easyRNASeq Sequencing RNASeq edgeR easyRNASeq • 1.9k views
ADD COMMENT
0
Entering edit mode
@delhommeemblde-3232
Last seen 10.3 years ago
Hej Felix! This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote: > > Hello, > I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from: > # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0). > I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR. > Some months ago, I did following analysis with easyRNASeq: > > chr.map <- data.frame( from=c("chr1","chr2","chr3","chr4","chr5","c hr6","chr7","chr8","chr9","chr10"), > to=c("1","2","3","4","5","6","7","8","9","10")) > ensembl <- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene") > exon.annotation <- getBM(c("ensembl_gene_id", > "ensembl_transcript_id", > "chromosome_name", > "ensembl_gene_id", > "ensembl_exon_id", > "exon_chrom_start", > "exon_chrom_end"), > mart=ensembl, > filters="chromosome_name", > values=chrm) > exon.annotation$chromosome <- paste("chr",exon.annotation$chromosome_name,sep="") > exon.range <- RangedData(IRanges( > start=exon.annotation$exon_chrom_start, > end=exon.annotation$exon_chrom_end), > space=exon.annotation$chromosome, > strand=exon.annotation$strand, > transcript=exon.annotation$ensembl_transcript_id, > gene=exon.annotation$ensembl_gene_id, > exon=exon.annotation$ensembl_exon_id, > universe = "NULL") > > files <- c( "accepted_hits_12a_sorted.bam", > "accepted_hits_13a_sorted.bam",...) > > rnaSeq <- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata", > gapped=F, > validity.check=TRUE, > chr.map=chr.map, > organism="custom", > annotationMethod="env", > annotationObject=exon.range, > count="genes", > filenames=files, > summarization="geneModels", > outputFormat="RNAseq") > > I got an output gene count table. > Now, I wanted to repeat the analysis, but leaving out the non- protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object. > To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely: > > > OUTPUT: > >> head( exon.annotation) > ensembl_gene_id strand ensembl_transcript_id chromosome_name > 1 GRMZM2G439951 1 GRMZM2G439951_T01 5 > 2 GRMZM2G094632 1 GRMZM2G094632_T02 5 > 3 GRMZM2G094632 1 GRMZM2G094632_T01 5 > 4 GRMZM2G094632 1 GRMZM2G094632_T01 5 > 5 GRMZM2G095090 1 GRMZM2G095090_T01 5 > 6 GRMZM2G095094 1 GRMZM2G095094_T01 5 > ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end > 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325 > 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448 > 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317 > 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134 > 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706 > 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105 > gene_biotype chromosome > 1 protein_coding chr5 > 2 protein_coding chr5 > 3 protein_coding chr5 > 4 protein_coding chr5 > 5 protein_coding chr5 > 6 protein_coding chr5 > >> head( exon.range) > RangedData with 6 rows and 5 value columns across 10 spaces > space ranges | strand transcript > <factor> <iranges> | <integer> <character> > 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006 > 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03 > 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03 > 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03 > 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03 > 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03 > gene exon biotype > <character> <character> <character> > 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding > 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding > 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding > 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding > 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding > 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding >> > >> rnaSeq_new <- easyRNASeq(filesDirectory="/projects/irg/grp_stich/pe rsonal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/topha t/EASYRNASEQ/countdata", > + gapped=F, > + validity.check=TRUE, > + chr.map=chr.map, > + organism="custom", > + annotationMethod="env", > + annotationObject=exon.range, > + count="genes", > + filenames=files, > + summarization="geneModels", > + outputFormat="RNAseq") > Checking arguments... > Fetching annotations... > Computing gene models... > Summarizing counts... > Processing accepted_hits_12a_sorted.bam > Updating the read length information. > The reads are of 95 bp. > Warning message: > In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_fol ders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNAS EQ/countdata", : > There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? > Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], : > error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") : > 'x' values larger than vector length 'sum(width)' >> > > > > > > Could you help me with this problem? > > Best, > Felix > > -- > Felix Frey > > Max Planck Institut f?r Pflanzenz?chtungsforschung > Carl-von-Linn?-Weg 10 > D-50829 K?ln > +49 (0) 221-5062 405 > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 > [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1 > [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0 > [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3 > [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0 > [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0 > [19] intervals_0.14.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1 > [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0 > [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0 > [10] grid_3.0.1 hwriter_1.3 splines_3.0.1 > [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1 > [16] xtable_1.7-1 zlibbioc_1.6.0 >> > > > -- > Sent via the guest posting facility at bioconductor.org.
ADD COMMENT
0
Entering edit mode
Thank you for your answer, Nico! I can upload the data. I have 48 .bam files with each about 15GB. I don't know if this is possible in dropbox. If so, I could just upload 2 files, which are also mentionned below in the code. Best, Felix On 12.09.2013 17:38, Nicolas Delhomme wrote: > Hej Felix! > > This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder. > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote: > >> Hello, >> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from: >> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0). >> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR. >> Some months ago, I did following analysis with easyRNASeq: >> >> chr.map<- data.frame( from=c("chr1","chr2","chr3","chr4","chr5","c hr6","chr7","chr8","chr9","chr10"), >> to=c("1","2","3","4","5","6","7","8","9","10")) >> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene") >> exon.annotation<- getBM(c("ensembl_gene_id", >> "ensembl_transcript_id", >> "chromosome_name", >> "ensembl_gene_id", >> "ensembl_exon_id", >> "exon_chrom_start", >> "exon_chrom_end"), >> mart=ensembl, >> filters="chromosome_name", >> values=chrm) >> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="") >> exon.range<- RangedData(IRanges( >> start=exon.annotation$exon_chrom_start, >> end=exon.annotation$exon_chrom_end), >> space=exon.annotation$chromosome, >> strand=exon.annotation$strand, >> transcript=exon.annotation$ensembl_transcript_id, >> gene=exon.annotation$ensembl_gene_id, >> exon=exon.annotation$ensembl_exon_id, >> universe = "NULL") >> >> files<- c( "accepted_hits_12a_sorted.bam", >> "accepted_hits_13a_sorted.bam",...) >> >> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata", >> gapped=F, >> validity.check=TRUE, >> chr.map=chr.map, >> organism="custom", >> annotationMethod="env", >> annotationObject=exon.range, >> count="genes", >> filenames=files, >> summarization="geneModels", >> outputFormat="RNAseq") >> >> I got an output gene count table. >> Now, I wanted to repeat the analysis, but leaving out the non- protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object. >> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely: >> >> >> OUTPUT: >> >>> head( exon.annotation) >> ensembl_gene_id strand ensembl_transcript_id chromosome_name >> 1 GRMZM2G439951 1 GRMZM2G439951_T01 5 >> 2 GRMZM2G094632 1 GRMZM2G094632_T02 5 >> 3 GRMZM2G094632 1 GRMZM2G094632_T01 5 >> 4 GRMZM2G094632 1 GRMZM2G094632_T01 5 >> 5 GRMZM2G095090 1 GRMZM2G095090_T01 5 >> 6 GRMZM2G095094 1 GRMZM2G095094_T01 5 >> ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end >> 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325 >> 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448 >> 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317 >> 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134 >> 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706 >> 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105 >> gene_biotype chromosome >> 1 protein_coding chr5 >> 2 protein_coding chr5 >> 3 protein_coding chr5 >> 4 protein_coding chr5 >> 5 protein_coding chr5 >> 6 protein_coding chr5 >> >>> head( exon.range) >> RangedData with 6 rows and 5 value columns across 10 spaces >> space ranges | strand transcript >> <factor> <iranges> |<integer> <character> >> 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006 >> 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03 >> 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03 >> 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03 >> 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03 >> 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03 >> gene exon biotype >> <character> <character> <character> >> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding >> 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding >> 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding >> 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding >> 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding >> 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding >>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/pe rsonal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/topha t/EASYRNASEQ/countdata", >> + gapped=F, >> + validity.check=TRUE, >> + chr.map=chr.map, >> + organism="custom", >> + annotationMethod="env", >> + annotationObject=exon.range, >> + count="genes", >> + filenames=files, >> + summarization="geneModels", >> + outputFormat="RNAseq") >> Checking arguments... >> Fetching annotations... >> Computing gene models... >> Summarizing counts... >> Processing accepted_hits_12a_sorted.bam >> Updating the read length information. >> The reads are of 95 bp. >> Warning message: >> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_fo lders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNA SEQ/countdata", : >> There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], : >> error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") : >> 'x' values larger than vector length 'sum(width)' >> >> >> >> >> Could you help me with this problem? >> >> Best, >> Felix >> >> -- >> Felix Frey >> >> Max Planck Institut f?r Pflanzenz?chtungsforschung >> Carl-von-Linn?-Weg 10 >> D-50829 K?ln >> +49 (0) 221-5062 405 >> >> -- output of sessionInfo(): >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 >> [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1 >> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0 >> [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3 >> [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0 >> [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0 >> [19] intervals_0.14.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1 >> [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0 >> [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0 >> [10] grid_3.0.1 hwriter_1.3 splines_3.0.1 >> [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1 >> [16] xtable_1.7-1 zlibbioc_1.6.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. -- Felix Frey AG Stich Max Planck Institut f?r Pflanzenz?chtungsforschung Carl-von-Linn?-Weg 10 D-50829 K?ln +49 (0) 221-5062 405
ADD REPLY
0
Entering edit mode
Hej Felix! I figured out what the error is: Here are the end of the exons located at the extremity of the 10 chromosomes: Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end) 1 10 2 3 4 5 6 7 8 9 301409969 149550619 237864118 232184249 242011676 217905991 169336130 176810119 175341538 157013685 Here are the chromosome sizes as retrieved from your BAM file: Browse[1]> chrSize(obj) 1 10 2 3 4 5 6 7 8 9 Mt 301354135 150189435 237068873 232140174 241473504 217872852 169174353 176764762 175793759 156750706 569630 Pt Unknown 140384 7140151 As you can see below 8 of the chromosomal annotation - from biomaRt - have exons outside of the chromosomes - sizes derived from the BAM header. Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end) > chrSize(obj)[1:10] 1 10 2 3 4 5 6 7 8 9 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE In short, you are using a different version of the genome for the alignment as the one you are retrieving the annotation from. easyRNASeq is obviously not behaving gracefully here, so I'll get a proper check implemented. That won't solve your issue though. What you need to figure out is which version of the genome/annotation to use and consistently doing so for your analyses. At the moment, since it would appear that the annotation have been lifted over a new genome, you would probably end up mis-assigning reads to intronic or extra- genic regions (hence not counting them) or worse assigning them to other genes. The easiest in your case is probably to get the gff3 or gtf file corresponding to the version of the genome you used for the alignments (e.g. from phytozome.org, retrieving the gene_exon.gff3 file) and use that for the easyRNASeq function with the parameter annotationMethod="gff" and annotationFile=[the path to the gff3] file. Btw, I've come to realize that the geneModel generation is painfully slow, luckily the new version due in October should fix that. It will actually introduces a large number of improvements both computationaly and memory wise. That new version will as well provide an easier biomaRt setup, so that you don't have to fetch the annotation yourself. I've actually looked into this a bit more and got it to work using the ggf3 file Zmays_181_gene_exons.gff3 retrieved from phytozome. The code is below with some inline comments. Actually, what I think is of highest interest for you is at the bottom of that code as I've written a work around to avoid the geneModel generation bottleneck I mentioned above. ## libs library(easyRNASeq) files<- c("excerpt_12a_sorted.bam", "excerpt_13a_sorted.bam") ## easyRNASeq v1.6 and lower ## is not very flexible in its gff support ## so we need to convert the gff file ## read the gff3 file and convert it into FlyBase like gff3 <- readGff3("Zmays_181_gene_exons.gff3") ## get the transcript to gene mapping from the mRNA feature transcriptGeneMapping <- data.frame(getGffAttribute(gff3[type(gff3)=="mRNA",],"ID"), getGffAttribute(gff3[type(gff3)=="mRNA",],"Parent")) ## get the exon feature exons <- gff3[type(gff3)=="exon",] ## change their gffAttributes into FlyBase like ## where ID is geneID:exonNumber ## here, we create the new ID by retrieving the ## geneID from the mapping and appending the ## exon number extracted from the original ID ## and we paste that in front of the original ## gffAttribute which ID section has been ## removed exons$gffAttributes <- paste("ID=",transcriptGeneMapping[match(getGffA ttribute(exons,"Parent"), transcriptGeneMapping$ID),"Parent"], ":",sub(".*\\.","",getGffAttribute(exons,"ID")), ";",sub("^ID=.*;P","P",exons$gffAttributes), sep="") ## write down the modified file ## all these functionalities we used to ## play with the gff file are from the ## genomeIntervals package writeGff3(exons,file="Zmays_181_exons.gff3") ## the following now works ## I have removed the validity check ## because your BAM files and ## the above gff file use the same naming ## conventions for the chromosomes rnaSeq_excerpt<- easyRNASeq(filesDirectory=".", annotationMethod="gff", annotationFile="Zmays_181_exons.gff3", count="genes", filenames=files, summarization="geneModels", outputFormat="RNAseq") ## AND HERE is what you could be most interested in ## if you are only interested in gene models ## as the geneModel generation is at the moment ## painfully slow, here is a work around ## were we will flatten the exon structure to have ## a single "synthetic" super-transcript per gene ## get the exon ranges by gene rngList <- split(IRanges(start=exons[,1],end=exons[,2]), transcriptGeneMapping[match(getGffAttribute(exons,"Parent"), transcriptGeneMapping$ID),"Parent"]) ## flatten the gene exons ## this will create synthetic exons if two exons ## of the gene overlap rngList <- reduce(rngList) ## get the gff information ## here we simply duplicate the ## first exon of every gene ## by the number of synthetic exons ## per gene. The content will be updated ## afterwards. syntheticGeneModel <- exons[rep(match(transcriptGeneMapping$ID[match(names(rngList), transcriptGeneMapping$Parent)], getGffAttribute(exons,"Parent")),elementLengths(rngList)),] ## update the coordinates syntheticGeneModel[,1] <- unlist(start(rngList)) syntheticGeneModel[,2] <- unlist(end(rngList)) ## change the source levels(syntheticGeneModel$source) <- "inhouse" ## modify the gffAttributes ## get the exon number for the minus strand exonNumber <- sapply(elementLengths(rngList),":",1) ## reverse them on the plus strand sel <- strand(syntheticGeneModel)[cumsum(elementLengths(rngList))] == "+" exonNumber[sel] <- sapply(exonNumber[sel],rev) ## update the attributes syntheticGeneModel$gffAttributes <- paste("ID=",rep(names(rngList),elementLengths(rngList)), ":",unlist(exonNumber),";Parent=", rep(names(rngList),elementLengths(rngList)),".0",sep="") ## write the gff file containing only one transcript per gene ## this transcript consists of all existing exon loci for that gene writeGff3(syntheticGeneModel,file="Zmays_181_synthetic-gene- models.gff3") ## Now the trick is to count using transcripts ## as we have only one transcrit per gene model ## and circumvent the gene model generation ## altogether geneModel_excerpt<- easyRNASeq(filesDirectory=".", annotationMethod="gff", annotationFile="Zmays_181_synthetic-gene- models.gff3", count="transcripts", filenames=files) ## Note that this is still not the best ## but that's due to the annotation ## if you look at it you'll see that ## the original annotation contains ## exons that are 1bp in size... ## So you might want extra filtering :-) Let me know if can I help you any further, Cheers, Nico sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 [5] Rsamtools_1.12.4 DESeq_1.12.1 lattice_0.20-23 locfit_1.5-9.1 [9] BSgenome_1.28.0 GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.4 [13] edgeR_3.2.4 limma_3.16.8 biomaRt_2.16.0 Biobase_2.20.1 [17] genomeIntervals_1.16.0 BiocGenerics_0.6.0 intervals_0.14.0 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.6 bitops_1.0-6 DBI_0.2-7 genefilter_1.42.0 [6] geneplotter_1.38.0 grid_3.0.1 hwriter_1.3 RCurl_1.95-4.1 RSQLite_0.11.4 [11] splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 [16] xtable_1.7-1 zlibbioc_1.6.0 --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 13 Sep 2013, at 10:03, Felix Frey wrote: > Thank you for your answer, Nico! > > I can upload the data. I have 48 .bam files with each about 15GB. I don't know if this is possible in dropbox. If so, I could just upload 2 files, which are also mentionned below in the code. > > Best, > > Felix > > On 12.09.2013 17:38, Nicolas Delhomme wrote: >> Hej Felix! >> >> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder. >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote: >> >>> Hello, >>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from: >>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0). >>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR. >>> Some months ago, I did following analysis with easyRNASeq: >>> >>> chr.map<- data.frame( from=c("chr1","chr2","chr3","chr4","chr5"," chr6","chr7","chr8","chr9","chr10"), >>> to=c("1","2","3","4","5","6","7","8","9","10")) >>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene") >>> exon.annotation<- getBM(c("ensembl_gene_id", >>> "ensembl_transcript_id", >>> "chromosome_name", >>> "ensembl_gene_id", >>> "ensembl_exon_id", >>> "exon_chrom_start", >>> "exon_chrom_end"), >>> mart=ensembl, >>> filters="chromosome_name", >>> values=chrm) >>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="") >>> exon.range<- RangedData(IRanges( >>> start=exon.annotation$exon_chrom_start, >>> end=exon.annotation$exon_chrom_end), >>> space=exon.annotation$chromosome, >>> strand=exon.annotation$strand, >>> transcript=exon.annotation$ensembl_transcript_id, >>> gene=exon.annotation$ensembl_gene_id, >>> exon=exon.annotation$ensembl_exon_id, >>> universe = "NULL") >>> >>> files<- c( "accepted_hits_12a_sorted.bam", >>> "accepted_hits_13a_sorted.bam",...) >>> >>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata", >>> gapped=F, >>> validity.check=TRUE, >>> chr.map=chr.map, >>> organism="custom", >>> annotationMethod="env", >>> annotationObject=exon.range, >>> count="genes", >>> filenames=files, >>> summarization="geneModels", >>> outputFormat="RNAseq") >>> >>> I got an output gene count table. >>> Now, I wanted to repeat the analysis, but leaving out the non- protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object. >>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely: >>> >>> >>> OUTPUT: >>> >>>> head( exon.annotation) >>> ensembl_gene_id strand ensembl_transcript_id chromosome_name >>> 1 GRMZM2G439951 1 GRMZM2G439951_T01 5 >>> 2 GRMZM2G094632 1 GRMZM2G094632_T02 5 >>> 3 GRMZM2G094632 1 GRMZM2G094632_T01 5 >>> 4 GRMZM2G094632 1 GRMZM2G094632_T01 5 >>> 5 GRMZM2G095090 1 GRMZM2G095090_T01 5 >>> 6 GRMZM2G095094 1 GRMZM2G095094_T01 5 >>> ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end >>> 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325 >>> 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448 >>> 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317 >>> 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134 >>> 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706 >>> 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105 >>> gene_biotype chromosome >>> 1 protein_coding chr5 >>> 2 protein_coding chr5 >>> 3 protein_coding chr5 >>> 4 protein_coding chr5 >>> 5 protein_coding chr5 >>> 6 protein_coding chr5 >>> >>>> head( exon.range) >>> RangedData with 6 rows and 5 value columns across 10 spaces >>> space ranges | strand transcript >>> <factor> <iranges> |<integer> <character> >>> 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006 >>> 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03 >>> 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03 >>> 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03 >>> 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03 >>> 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03 >>> gene exon biotype >>> <character> <character> <character> >>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding >>> 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding >>> 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding >>> 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding >>> 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding >>> 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding >>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/p ersonal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/toph at/EASYRNASEQ/countdata", >>> + gapped=F, >>> + validity.check=TRUE, >>> + chr.map=chr.map, >>> + organism="custom", >>> + annotationMethod="env", >>> + annotationObject=exon.range, >>> + count="genes", >>> + filenames=files, >>> + summarization="geneModels", >>> + outputFormat="RNAseq") >>> Checking arguments... >>> Fetching annotations... >>> Computing gene models... >>> Summarizing counts... >>> Processing accepted_hits_12a_sorted.bam >>> Updating the read length information. >>> The reads are of 95 bp. >>> Warning message: >>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_f olders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRN ASEQ/countdata", : >>> There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], : >>> error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") : >>> 'x' values larger than vector length 'sum(width)' >>> >>> >>> >>> >>> Could you help me with this problem? >>> >>> Best, >>> Felix >>> >>> -- >>> Felix Frey >>> >>> Max Planck Institut f?r Pflanzenz?chtungsforschung >>> Carl-von-Linn?-Weg 10 >>> D-50829 K?ln >>> +49 (0) 221-5062 405 >>> >>> -- output of sessionInfo(): >>> >>>> sessionInfo() >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 >>> [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1 >>> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0 >>> [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3 >>> [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0 >>> [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0 >>> [19] intervals_0.14.0 >>> >>> loaded via a namespace (and not attached): >>> [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1 >>> [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0 >>> [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0 >>> [10] grid_3.0.1 hwriter_1.3 splines_3.0.1 >>> [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1 >>> [16] xtable_1.7-1 zlibbioc_1.6.0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. > > > -- > Felix Frey > > AG Stich > Max Planck Institut f?r Pflanzenz?chtungsforschung > Carl-von-Linn?-Weg 10 > D-50829 K?ln > +49 (0) 221-5062 405 >
ADD REPLY
0
Entering edit mode
Hello Nico, thank you very much for your effort! Probably the problem is that I used a different version in my alignment. I'm looking forward for the new version. I executed the code you wrote and it seems to work although I do not understand each step completely! Furthermore I have another question regarding my actual aim for this analysis: The actual reason, why I wanted to repeat this analysis, was that I wanted to include a filtering for protein-coding genes. For this reason I included "gene_biotype" and a filtering for protein_coding in my exon.annotation object: exon.annotation <- getBM(c("ensembl_gene_id", "strand", "ensembl_transcript_id", "chromosome_name", "ensembl_gene_id", "ensembl_exon_id", "exon_chrom_start", "exon_chrom_end", "gene_biotype"), mart=ensembl, filters=c("chromosome_name", "biotype"), values=list(chrm, "protein_coding")) # filter only protein coding and "chrm" in chromosome! # exon.annotation$chromosome <- paste("chr",exon.annotation$chromosome_name,sep="") exon.range <- RangedData(IRanges( start=exon.annotation$exon_chrom_start, end=exon.annotation$exon_chrom_end), space=exon.annotation$chromosome, strand=exon.annotation$strand, transcript=exon.annotation$ensembl_transcript_id, gene=exon.annotation$ensembl_gene_id, exon=exon.annotation$ensembl_exon_id, biotype=exon.annotation$gene_biotype, universe = "NULL") Do you have any idea how to filter out non-protein-coding genes in your work around? Thanks again! Felix On 26.09.2013 12:30, Nicolas Delhomme wrote: > Hej Felix! > > I figured out what the error is: > > Here are the end of the exons located at the extremity of the 10 chromosomes: > > Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end) > 1 10 2 3 4 5 6 7 8 9 > 301409969 149550619 237864118 232184249 242011676 217905991 169336130 176810119 175341538 157013685 > > Here are the chromosome sizes as retrieved from your BAM file: > > Browse[1]> chrSize(obj) > 1 10 2 3 4 5 6 7 8 9 Mt > 301354135 150189435 237068873 232140174 241473504 217872852 169174353 176764762 175793759 156750706 569630 > Pt Unknown > 140384 7140151 > > As you can see below 8 of the chromosomal annotation - from biomaRt - have exons outside of the chromosomes - sizes derived from the BAM header. > > Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end)> chrSize(obj)[1:10] > 1 10 2 3 4 5 6 7 8 9 > TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE > > In short, you are using a different version of the genome for the alignment as the one you are retrieving the annotation from. > > easyRNASeq is obviously not behaving gracefully here, so I'll get a proper check implemented. That won't solve your issue though. What you need to figure out is which version of the genome/annotation to use and consistently doing so for your analyses. At the moment, since it would appear that the annotation have been lifted over a new genome, you would probably end up mis-assigning reads to intronic or extra- genic regions (hence not counting them) or worse assigning them to other genes. The easiest in your case is probably to get the gff3 or gtf file corresponding to the version of the genome you used for the alignments (e.g. from phytozome.org, retrieving the gene_exon.gff3 file) and use that for the easyRNASeq function with the parameter annotationMethod="gff" and annotationFile=[the path to the gff3] file. > > Btw, I've come to realize that the geneModel generation is painfully slow, luckily the new version due in October should fix that. It will actually introduces a large number of improvements both computationaly and memory wise. That new version will as well provide an easier biomaRt setup, so that you don't have to fetch the annotation yourself. > > I've actually looked into this a bit more and got it to work using the ggf3 file Zmays_181_gene_exons.gff3 retrieved from phytozome. The code is below with some inline comments. Actually, what I think is of highest interest for you is at the bottom of that code as I've written a work around to avoid the geneModel generation bottleneck I mentioned above. > > ## libs > library(easyRNASeq) > files<- c("excerpt_12a_sorted.bam", > "excerpt_13a_sorted.bam") > > ## easyRNASeq v1.6 and lower > ## is not very flexible in its gff support > ## so we need to convert the gff file > ## read the gff3 file and convert it into FlyBase like > gff3<- readGff3("Zmays_181_gene_exons.gff3") > > ## get the transcript to gene mapping from the mRNA feature > transcriptGeneMapping<- data.frame(getGffAttribute(gff3[type(gff3)=="mRNA",],"ID"), > getGffAttribute(gff3[type(gff3)=="mRNA",],"Parent")) > > ## get the exon feature > exons<- gff3[type(gff3)=="exon",] > > ## change their gffAttributes into FlyBase like > ## where ID is geneID:exonNumber > ## here, we create the new ID by retrieving the > ## geneID from the mapping and appending the > ## exon number extracted from the original ID > ## and we paste that in front of the original > ## gffAttribute which ID section has been > ## removed > exons$gffAttributes<- paste("ID=",transcriptGeneMapping[match(getGff Attribute(exons,"Parent"), > transcriptGeneMapping$ID),"Parent"], > ":",sub(".*\\.","",getGffAttribute(exons,"ID")), > ";",sub("^ID=.*;P","P",exons$gffAttributes), > sep="") > > ## write down the modified file > ## all these functionalities we used to > ## play with the gff file are from the > ## genomeIntervals package > writeGff3(exons,file="Zmays_181_exons.gff3") > > ## the following now works > ## I have removed the validity check > ## because your BAM files and > ## the above gff file use the same naming > ## conventions for the chromosomes > rnaSeq_excerpt<- easyRNASeq(filesDirectory=".", > annotationMethod="gff", > annotationFile="Zmays_181_exons.gff3", > count="genes", > filenames=files, > summarization="geneModels", > outputFormat="RNAseq") > > ## AND HERE is what you could be most interested in > ## if you are only interested in gene models > ## as the geneModel generation is at the moment > ## painfully slow, here is a work around > ## were we will flatten the exon structure to have > ## a single "synthetic" super-transcript per gene > > ## get the exon ranges by gene > rngList<- split(IRanges(start=exons[,1],end=exons[,2]), > transcriptGeneMapping[match(getGffAttribute(exons,"Parent"), > transcriptGeneMapping$ID),"Parent"]) > > ## flatten the gene exons > ## this will create synthetic exons if two exons > ## of the gene overlap > rngList<- reduce(rngList) > > ## get the gff information > ## here we simply duplicate the > ## first exon of every gene > ## by the number of synthetic exons > ## per gene. The content will be updated > ## afterwards. > syntheticGeneModel<- exons[rep(match(transcriptGeneMapping$ID[match(names(rngList), > transcriptGeneMapping$Parent)], > getGffAttribute(exons,"Parent")),elementLengths(rngList)),] > > ## update the coordinates > syntheticGeneModel[,1]<- unlist(start(rngList)) > syntheticGeneModel[,2]<- unlist(end(rngList)) > > ## change the source > levels(syntheticGeneModel$source)<- "inhouse" > > ## modify the gffAttributes > ## get the exon number for the minus strand > exonNumber<- sapply(elementLengths(rngList),":",1) > ## reverse them on the plus strand > sel<- strand(syntheticGeneModel)[cumsum(elementLengths(rngList))] == "+" > exonNumber[sel]<- sapply(exonNumber[sel],rev) > ## update the attributes > syntheticGeneModel$gffAttributes<- paste("ID=",rep(names(rngList),elementLengths(rngList)), > ":",unlist(exonNumber),";Parent=", > rep(names(rngList),elementLengths(rngList)),".0",sep="") > > ## write the gff file containing only one transcript per gene > ## this transcript consists of all existing exon loci for that gene > writeGff3(syntheticGeneModel,file="Zmays_181_synthetic-gene- models.gff3") > > ## Now the trick is to count using transcripts > ## as we have only one transcrit per gene model > ## and circumvent the gene model generation > ## altogether > geneModel_excerpt<- easyRNASeq(filesDirectory=".", > annotationMethod="gff", > annotationFile="Zmays_181_synthetic- gene-models.gff3", > count="transcripts", > filenames=files) > > ## Note that this is still not the best > ## but that's due to the annotation > ## if you look at it you'll see that > ## the original annotation contains > ## exons that are 1bp in size... > ## So you might want extra filtering :-) > > Let me know if can I help you any further, > > Cheers, > > Nico > > sessionInfo() > > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 > [5] Rsamtools_1.12.4 DESeq_1.12.1 lattice_0.20-23 locfit_1.5-9.1 > [9] BSgenome_1.28.0 GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.4 > [13] edgeR_3.2.4 limma_3.16.8 biomaRt_2.16.0 Biobase_2.20.1 > [17] genomeIntervals_1.16.0 BiocGenerics_0.6.0 intervals_0.14.0 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.6 bitops_1.0-6 DBI_0.2-7 genefilter_1.42.0 > [6] geneplotter_1.38.0 grid_3.0.1 hwriter_1.3 RCurl_1.95-4.1 RSQLite_0.11.4 > [11] splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 > [16] xtable_1.7-1 zlibbioc_1.6.0 > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 13 Sep 2013, at 10:03, Felix Frey wrote: > >> Thank you for your answer, Nico! >> >> I can upload the data. I have 48 .bam files with each about 15GB. I don't know if this is possible in dropbox. If so, I could just upload 2 files, which are also mentionned below in the code. >> >> Best, >> >> Felix >> >> On 12.09.2013 17:38, Nicolas Delhomme wrote: >>> Hej Felix! >>> >>> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder. >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote: >>> >>>> Hello, >>>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from: >>>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0). >>>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR. >>>> Some months ago, I did following analysis with easyRNASeq: >>>> >>>> chr.map<- data.frame( from=c("chr1","chr2","chr3","chr4","chr5", "chr6","chr7","chr8","chr9","chr10"), >>>> to=c("1","2","3","4","5","6","7","8","9","10")) >>>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene") >>>> exon.annotation<- getBM(c("ensembl_gene_id", >>>> "ensembl_transcript_id", >>>> "chromosome_name", >>>> "ensembl_gene_id", >>>> "ensembl_exon_id", >>>> "exon_chrom_start", >>>> "exon_chrom_end"), >>>> mart=ensembl, >>>> filters="chromosome_name", >>>> values=chrm) >>>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="") >>>> exon.range<- RangedData(IRanges( >>>> start=exon.annotation$exon_chrom_start, >>>> end=exon.annotation$exon_chrom_end), >>>> space=exon.annotation$chromosome, >>>> strand=exon.annotation$strand, >>>> transcript=exon.annotation$ensembl_transcript_id, >>>> gene=exon.annotation$ensembl_gene_id, >>>> exon=exon.annotation$ensembl_exon_id, >>>> universe = "NULL") >>>> >>>> files<- c( "accepted_hits_12a_sorted.bam", >>>> "accepted_hits_13a_sorted.bam",...) >>>> >>>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata", >>>> gapped=F, >>>> validity.check=TRUE, >>>> chr.map=chr.map, >>>> organism="custom", >>>> annotationMethod="env", >>>> annotationObject=exon.range, >>>> count="genes", >>>> filenames=files, >>>> summarization="geneModels", >>>> outputFormat="RNAseq") >>>> >>>> I got an output gene count table. >>>> Now, I wanted to repeat the analysis, but leaving out the non- protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object. >>>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely: >>>> >>>> >>>> OUTPUT: >>>> >>>>> head( exon.annotation) >>>> ensembl_gene_id strand ensembl_transcript_id chromosome_name >>>> 1 GRMZM2G439951 1 GRMZM2G439951_T01 5 >>>> 2 GRMZM2G094632 1 GRMZM2G094632_T02 5 >>>> 3 GRMZM2G094632 1 GRMZM2G094632_T01 5 >>>> 4 GRMZM2G094632 1 GRMZM2G094632_T01 5 >>>> 5 GRMZM2G095090 1 GRMZM2G095090_T01 5 >>>> 6 GRMZM2G095094 1 GRMZM2G095094_T01 5 >>>> ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end >>>> 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325 >>>> 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448 >>>> 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317 >>>> 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134 >>>> 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706 >>>> 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105 >>>> gene_biotype chromosome >>>> 1 protein_coding chr5 >>>> 2 protein_coding chr5 >>>> 3 protein_coding chr5 >>>> 4 protein_coding chr5 >>>> 5 protein_coding chr5 >>>> 6 protein_coding chr5 >>>> >>>>> head( exon.range) >>>> RangedData with 6 rows and 5 value columns across 10 spaces >>>> space ranges | strand transcript >>>> <factor> <iranges> |<integer> <character> >>>> 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006 >>>> 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03 >>>> 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03 >>>> 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03 >>>> 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03 >>>> 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03 >>>> gene exon biotype >>>> <character> <character> <character> >>>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding >>>> 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding >>>> 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding >>>> 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding >>>> 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding >>>> 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding >>>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/ personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/top hat/EASYRNASEQ/countdata", >>>> + gapped=F, >>>> + validity.check=TRUE, >>>> + chr.map=chr.map, >>>> + organism="custom", >>>> + annotationMethod="env", >>>> + annotationObject=exon.range, >>>> + count="genes", >>>> + filenames=files, >>>> + summarization="geneModels", >>>> + outputFormat="RNAseq") >>>> Checking arguments... >>>> Fetching annotations... >>>> Computing gene models... >>>> Summarizing counts... >>>> Processing accepted_hits_12a_sorted.bam >>>> Updating the read length information. >>>> The reads are of 95 bp. >>>> Warning message: >>>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_ folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYR NASEQ/countdata", : >>>> There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >>>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], : >>>> error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") : >>>> 'x' values larger than vector length 'sum(width)' >>>> >>>> >>>> >>>> >>>> Could you help me with this problem? >>>> >>>> Best, >>>> Felix >>>> >>>> -- >>>> Felix Frey >>>> >>>> Max Planck Institut f?r Pflanzenz?chtungsforschung >>>> Carl-von-Linn?-Weg 10 >>>> D-50829 K?ln >>>> +49 (0) 221-5062 405 >>>> >>>> -- output of sessionInfo(): >>>> >>>>> sessionInfo() >>>> R version 3.0.1 (2013-05-16) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 >>>> [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1 >>>> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0 >>>> [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3 >>>> [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0 >>>> [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0 >>>> [19] intervals_0.14.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1 >>>> [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0 >>>> [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0 >>>> [10] grid_3.0.1 hwriter_1.3 splines_3.0.1 >>>> [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1 >>>> [16] xtable_1.7-1 zlibbioc_1.6.0 >>>> >>>> -- >>>> Sent via the guest posting facility at bioconductor.org. >>
ADD REPLY
0
Entering edit mode
Hej Felix! I suppose that you could use the gene_id for that. From what I've seen the biomaRt "ensembl_gene_id" and the "Parent" in the gff file below are similar. You could therefore from your query extract the "protein_coding" only and use these to subset the synthetic gene model gff file to contain only these. E.g. something along those lines should work: rngList <- rngList[names(rngList) %in% unique(exon.annotation$ensembl_gene_id)] You would then get an IRangeList that contains only the protein_coding gene and use that to create your synthetic gene model file. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Sep 26, 2013, at 3:12 PM, Felix Frey wrote: > Hello Nico, > thank you very much for your effort! Probably the problem is that I used a different version in my alignment. I'm looking forward for the new version. > I executed the code you wrote and it seems to work although I do not understand each step completely! > > Furthermore I have another question regarding my actual aim for this analysis: > The actual reason, why I wanted to repeat this analysis, was that I wanted to include a filtering for protein-coding genes. For this reason I included "gene_biotype" and a filtering for protein_coding in my exon.annotation object: > > exon.annotation <- getBM(c("ensembl_gene_id", > "strand", > "ensembl_transcript_id", > "chromosome_name", > "ensembl_gene_id", > "ensembl_exon_id", > "exon_chrom_start", > "exon_chrom_end", > "gene_biotype"), > mart=ensembl, > filters=c("chromosome_name", "biotype"), > values=list(chrm, "protein_coding")) # filter only protein coding and "chrm" in chromosome! > # > exon.annotation$chromosome <- paste("chr",exon.annotation$chromosome_name,sep="") > exon.range <- RangedData(IRanges( > start=exon.annotation$exon_chrom_start, > end=exon.annotation$exon_chrom_end), > space=exon.annotation$chromosome, > strand=exon.annotation$strand, > transcript=exon.annotation$ensembl_transcript_id, > gene=exon.annotation$ensembl_gene_id, > exon=exon.annotation$ensembl_exon_id, > biotype=exon.annotation$gene_biotype, > universe = "NULL") > > Do you have any idea how to filter out non-protein-coding genes in your work around? > > Thanks again! > > Felix > > > > > On 26.09.2013 12:30, Nicolas Delhomme wrote: >> Hej Felix! >> >> I figured out what the error is: >> >> Here are the end of the exons located at the extremity of the 10 chromosomes: >> >> Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end) >> 1 10 2 3 4 5 6 7 8 9 >> 301409969 149550619 237864118 232184249 242011676 217905991 169336130 176810119 175341538 157013685 >> >> Here are the chromosome sizes as retrieved from your BAM file: >> >> Browse[1]> chrSize(obj) >> 1 10 2 3 4 5 6 7 8 9 Mt >> 301354135 150189435 237068873 232140174 241473504 217872852 169174353 176764762 175793759 156750706 569630 >> Pt Unknown >> 140384 7140151 >> >> As you can see below 8 of the chromosomal annotation - from biomaRt - have exons outside of the chromosomes - sizes derived from the BAM header. >> >> Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end)> chrSize(obj)[1:10] >> 1 10 2 3 4 5 6 7 8 9 >> TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE >> >> In short, you are using a different version of the genome for the alignment as the one you are retrieving the annotation from. >> >> easyRNASeq is obviously not behaving gracefully here, so I'll get a proper check implemented. That won't solve your issue though. What you need to figure out is which version of the genome/annotation to use and consistently doing so for your analyses. At the moment, since it would appear that the annotation have been lifted over a new genome, you would probably end up mis-assigning reads to intronic or extra- genic regions (hence not counting them) or worse assigning them to other genes. The easiest in your case is probably to get the gff3 or gtf file corresponding to the version of the genome you used for the alignments (e.g. from phytozome.org, retrieving the gene_exon.gff3 file) and use that for the easyRNASeq function with the parameter annotationMethod="gff" and annotationFile=[the path to the gff3] file. >> >> Btw, I've come to realize that the geneModel generation is painfully slow, luckily the new version due in October should fix that. It will actually introduces a large number of improvements both computationaly and memory wise. That new version will as well provide an easier biomaRt setup, so that you don't have to fetch the annotation yourself. >> >> I've actually looked into this a bit more and got it to work using the ggf3 file Zmays_181_gene_exons.gff3 retrieved from phytozome. The code is below with some inline comments. Actually, what I think is of highest interest for you is at the bottom of that code as I've written a work around to avoid the geneModel generation bottleneck I mentioned above. >> >> ## libs >> library(easyRNASeq) >> files<- c("excerpt_12a_sorted.bam", >> "excerpt_13a_sorted.bam") >> >> ## easyRNASeq v1.6 and lower >> ## is not very flexible in its gff support >> ## so we need to convert the gff file >> ## read the gff3 file and convert it into FlyBase like >> gff3<- readGff3("Zmays_181_gene_exons.gff3") >> >> ## get the transcript to gene mapping from the mRNA feature >> transcriptGeneMapping<- data.frame(getGffAttribute(gff3[type(gff3)=="mRNA",],"ID"), >> getGffAttribute(gff3[type(gff3)=="mRNA",],"Parent")) >> >> ## get the exon feature >> exons<- gff3[type(gff3)=="exon",] >> >> ## change their gffAttributes into FlyBase like >> ## where ID is geneID:exonNumber >> ## here, we create the new ID by retrieving the >> ## geneID from the mapping and appending the >> ## exon number extracted from the original ID >> ## and we paste that in front of the original >> ## gffAttribute which ID section has been >> ## removed >> exons$gffAttributes<- paste("ID=",transcriptGeneMapping[match(getGf fAttribute(exons,"Parent"), >> transcriptGeneMapping$ID),"Parent"], >> ":",sub(".*\\.","",getGffAttribute(exons,"ID")), >> ";",sub("^ID=.*;P","P",exons$gffAttributes), >> sep="") >> >> ## write down the modified file >> ## all these functionalities we used to >> ## play with the gff file are from the >> ## genomeIntervals package >> writeGff3(exons,file="Zmays_181_exons.gff3") >> >> ## the following now works >> ## I have removed the validity check >> ## because your BAM files and >> ## the above gff file use the same naming >> ## conventions for the chromosomes >> rnaSeq_excerpt<- easyRNASeq(filesDirectory=".", >> annotationMethod="gff", >> annotationFile="Zmays_181_exons.gff3", >> count="genes", >> filenames=files, >> summarization="geneModels", >> outputFormat="RNAseq") >> >> ## AND HERE is what you could be most interested in >> ## if you are only interested in gene models >> ## as the geneModel generation is at the moment >> ## painfully slow, here is a work around >> ## were we will flatten the exon structure to have >> ## a single "synthetic" super-transcript per gene >> >> ## get the exon ranges by gene >> rngList<- split(IRanges(start=exons[,1],end=exons[,2]), >> transcriptGeneMapping[match(getGffAttribute(exons,"Parent"), >> transcriptGeneMapping$ID),"Parent"]) >> >> ## flatten the gene exons >> ## this will create synthetic exons if two exons >> ## of the gene overlap >> rngList<- reduce(rngList) >> >> ## get the gff information >> ## here we simply duplicate the >> ## first exon of every gene >> ## by the number of synthetic exons >> ## per gene. The content will be updated >> ## afterwards. >> syntheticGeneModel<- exons[rep(match(transcriptGeneMapping$ID[match(names(rngList), >> transcriptGeneMapping$Parent)], >> getGffAttribute(exons,"Parent")),elementLengths(rngList)),] >> >> ## update the coordinates >> syntheticGeneModel[,1]<- unlist(start(rngList)) >> syntheticGeneModel[,2]<- unlist(end(rngList)) >> >> ## change the source >> levels(syntheticGeneModel$source)<- "inhouse" >> >> ## modify the gffAttributes >> ## get the exon number for the minus strand >> exonNumber<- sapply(elementLengths(rngList),":",1) >> ## reverse them on the plus strand >> sel<- strand(syntheticGeneModel)[cumsum(elementLengths(rngList))] == "+" >> exonNumber[sel]<- sapply(exonNumber[sel],rev) >> ## update the attributes >> syntheticGeneModel$gffAttributes<- paste("ID=",rep(names(rngList),elementLengths(rngList)), >> ":",unlist(exonNumber),";Parent=", >> rep(names(rngList),elementLengths(rngList)),".0",sep="") >> >> ## write the gff file containing only one transcript per gene >> ## this transcript consists of all existing exon loci for that gene >> writeGff3(syntheticGeneModel,file="Zmays_181_synthetic-gene- models.gff3") >> >> ## Now the trick is to count using transcripts >> ## as we have only one transcrit per gene model >> ## and circumvent the gene model generation >> ## altogether >> geneModel_excerpt<- easyRNASeq(filesDirectory=".", >> annotationMethod="gff", >> annotationFile="Zmays_181_synthetic- gene-models.gff3", >> count="transcripts", >> filenames=files) >> >> ## Note that this is still not the best >> ## but that's due to the annotation >> ## if you look at it you'll see that >> ## the original annotation contains >> ## exons that are 1bp in size... >> ## So you might want extra filtering :-) >> >> Let me know if can I help you any further, >> >> Cheers, >> >> Nico >> >> sessionInfo() >> >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 RColorBrewer_1.0-5 >> [5] Rsamtools_1.12.4 DESeq_1.12.1 lattice_0.20-23 locfit_1.5-9.1 >> [9] BSgenome_1.28.0 GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.4 >> [13] edgeR_3.2.4 limma_3.16.8 biomaRt_2.16.0 Biobase_2.20.1 >> [17] genomeIntervals_1.16.0 BiocGenerics_0.6.0 intervals_0.14.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.38.0 AnnotationDbi_1.22.6 bitops_1.0-6 DBI_0.2-7 genefilter_1.42.0 >> [6] geneplotter_1.38.0 grid_3.0.1 hwriter_1.3 RCurl_1.95-4.1 RSQLite_0.11.4 >> [11] splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 XML_3.98-1.1 >> [16] xtable_1.7-1 zlibbioc_1.6.0 >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 13 Sep 2013, at 10:03, Felix Frey wrote: >> >>> Thank you for your answer, Nico! >>> >>> I can upload the data. I have 48 .bam files with each about 15GB. I don't know if this is possible in dropbox. If so, I could just upload 2 files, which are also mentionned below in the code. >>> >>> Best, >>> >>> Felix >>> >>> On 12.09.2013 17:38, Nicolas Delhomme wrote: >>>> Hej Felix! >>>> >>>> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder. >>>> >>>> Cheers, >>>> >>>> Nico >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delhomme at embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> --------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> >>>> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote: >>>> >>>>> Hello, >>>>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from: >>>>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0). >>>>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR. >>>>> Some months ago, I did following analysis with easyRNASeq: >>>>> >>>>> chr.map<- data.frame( from=c("chr1","chr2","chr3","chr4","chr5" ,"chr6","chr7","chr8","chr9","chr10"), >>>>> to=c("1","2","3","4","5","6","7","8","9","10")) >>>>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene") >>>>> exon.annotation<- getBM(c("ensembl_gene_id", >>>>> "ensembl_transcript_id", >>>>> "chromosome_name", >>>>> "ensembl_gene_id", >>>>> "ensembl_exon_id", >>>>> "exon_chrom_start", >>>>> "exon_chrom_end"), >>>>> mart=ensembl, >>>>> filters="chromosome_name", >>>>> values=chrm) >>>>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="") >>>>> exon.range<- RangedData(IRanges( >>>>> start=exon.annotation$exon_chrom_start, >>>>> end=exon.annotation$exon_chrom_end), >>>>> space=exon.annotation$chromosome, >>>>> strand=exon.annotation$strand, >>>>> transcript=exon.annotation$ensembl_transcript_id, >>>>> gene=exon.annotation$ensembl_gene_id, >>>>> exon=exon.annotation$ensembl_exon_id, >>>>> universe = "NULL") >>>>> >>>>> files<- c( "accepted_hits_12a_sorted.bam", >>>>> "accepted_hits_13a_sorted.bam",...) >>>>> >>>>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata", >>>>> gapped=F, >>>>> validity.check=TRUE, >>>>> chr.map=chr.map, >>>>> organism="custom", >>>>> annotationMethod="env", >>>>> annotationObject=exon.range, >>>>> count="genes", >>>>> filenames=files, >>>>> summarization="geneModels", >>>>> outputFormat="RNAseq") >>>>> >>>>> I got an output gene count table. >>>>> Now, I wanted to repeat the analysis, but leaving out the non- protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object. >>>>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely: >>>>> >>>>> >>>>> OUTPUT: >>>>> >>>>>> head( exon.annotation) >>>>> ensembl_gene_id strand ensembl_transcript_id chromosome_name >>>>> 1 GRMZM2G439951 1 GRMZM2G439951_T01 5 >>>>> 2 GRMZM2G094632 1 GRMZM2G094632_T02 5 >>>>> 3 GRMZM2G094632 1 GRMZM2G094632_T01 5 >>>>> 4 GRMZM2G094632 1 GRMZM2G094632_T01 5 >>>>> 5 GRMZM2G095090 1 GRMZM2G095090_T01 5 >>>>> 6 GRMZM2G095094 1 GRMZM2G095094_T01 5 >>>>> ensembl_gene_id.1 ensembl_exon_id exon_chrom_start exon_chrom_end >>>>> 1 GRMZM2G439951 GRMZM2G439951_E01 39542856 39544325 >>>>> 2 GRMZM2G094632 GRMZM2G094632_E01 39435878 39436448 >>>>> 3 GRMZM2G094632 GRMZM2G094632_E03 39435878 39436317 >>>>> 4 GRMZM2G094632 GRMZM2G094632_E02 39436730 39437134 >>>>> 5 GRMZM2G095090 GRMZM2G095090_E01 39427195 39427706 >>>>> 6 GRMZM2G095094 GRMZM2G095094_E01 39424253 39427105 >>>>> gene_biotype chromosome >>>>> 1 protein_coding chr5 >>>>> 2 protein_coding chr5 >>>>> 3 protein_coding chr5 >>>>> 4 protein_coding chr5 >>>>> 5 protein_coding chr5 >>>>> 6 protein_coding chr5 >>>>> >>>>>> head( exon.range) >>>>> RangedData with 6 rows and 5 value columns across 10 spaces >>>>> space ranges | strand transcript >>>>> <factor> <iranges> |<integer> <character> >>>>> 1 chr1 [301409811, 301409969] | -1 AC185612.3_FGT006 >>>>> 2 chr1 [301362751, 301363304] | -1 GRMZM5G884466_T03 >>>>> 3 chr1 [301353831, 301354022] | -1 GRMZM5G884466_T03 >>>>> 4 chr1 [301353664, 301353745] | -1 GRMZM5G884466_T03 >>>>> 5 chr1 [301353366, 301353557] | -1 GRMZM5G884466_T03 >>>>> 6 chr1 [301353147, 301353260] | -1 GRMZM5G884466_T03 >>>>> gene exon biotype >>>>> <character> <character> <character> >>>>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding >>>>> 2 GRMZM5G884466 GRMZM5G884466_E12 protein_coding >>>>> 3 GRMZM5G884466 GRMZM5G884466_E03 protein_coding >>>>> 4 GRMZM5G884466 GRMZM5G884466_E04 protein_coding >>>>> 5 GRMZM5G884466 GRMZM5G884466_E08 protein_coding >>>>> 6 GRMZM5G884466 GRMZM5G884466_E10 protein_coding >>>>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich /personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/to phat/EASYRNASEQ/countdata", >>>>> + gapped=F, >>>>> + validity.check=TRUE, >>>>> + chr.map=chr.map, >>>>> + organism="custom", >>>>> + annotationMethod="env", >>>>> + annotationObject=exon.range, >>>>> + count="genes", >>>>> + filenames=files, >>>>> + summarization="geneModels", >>>>> + outputFormat="RNAseq") >>>>> Checking arguments... >>>>> Fetching annotations... >>>>> Computing gene models... >>>>> Summarizing counts... >>>>> Processing accepted_hits_12a_sorted.bam >>>>> Updating the read length information. >>>>> The reads are of 95 bp. >>>>> Warning message: >>>>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal _folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASY RNASEQ/countdata", : >>>>> There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want? >>>>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]], : >>>>> error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") : >>>>> 'x' values larger than vector length 'sum(width)' >>>>> >>>>> >>>>> >>>>> >>>>> Could you help me with this problem? >>>>> >>>>> Best, >>>>> Felix >>>>> >>>>> -- >>>>> Felix Frey >>>>> >>>>> Max Planck Institut f?r Pflanzenz?chtungsforschung >>>>> Carl-von-Linn?-Weg 10 >>>>> D-50829 K?ln >>>>> +49 (0) 221-5062 405 >>>>> >>>>> -- output of sessionInfo(): >>>>> >>>>>> sessionInfo() >>>>> R version 3.0.1 (2013-05-16) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] C >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] easyRNASeq_1.6.0 ShortRead_1.18.0 latticeExtra_0.6-26 >>>>> [4] RColorBrewer_1.0-5 Rsamtools_1.12.4 DESeq_1.12.1 >>>>> [7] lattice_0.20-15 locfit_1.5-9.1 BSgenome_1.28.0 >>>>> [10] GenomicRanges_1.12.5 Biostrings_2.28.0 IRanges_1.18.3 >>>>> [13] edgeR_3.2.4 limma_3.16.7 biomaRt_2.16.0 >>>>> [16] Biobase_2.20.1 genomeIntervals_1.16.0 BiocGenerics_0.6.0 >>>>> [19] intervals_0.14.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] AnnotationDbi_1.22.6 DBI_0.2-7 RCurl_1.95-4.1 >>>>> [4] RSQLite_0.11.4 XML_3.98-1.1 annotate_1.38.0 >>>>> [7] bitops_1.0-6 genefilter_1.42.0 geneplotter_1.38.0 >>>>> [10] grid_3.0.1 hwriter_1.3 splines_3.0.1 >>>>> [13] stats4_3.0.1 survival_2.37-4 tools_3.0.1 >>>>> [16] xtable_1.7-1 zlibbioc_1.6.0 >>>>> >>>>> -- >>>>> Sent via the guest posting facility at bioconductor.org. >>>
ADD REPLY

Login before adding your answer.

Traffic: 282 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6