Annotation files for bacterial genome RNAseq

0

Entering edit mode

José Luis Lavín ▴ 80

@jose-luis-lavin-5797

Last seen 10.2 years ago

Dear list members, I have to analyze bacterial transcriptomic data and I have a doubt about how to proceed. I have downloaded the reference genome FASTA from the NCBI and also a gff file containing the annotation of that reference. I can map the reads to the genome and so on, but when the time comes to generate the table of counts for the Differential Expression (DE) analysis, I have no clear Idea on how to use the gff annotation file to assign reads to the genomic features. I've looked for solutions like HTSeq, but to my understanding this program will generate a table of counts per alignment file (for instance, one table per each bam file) which will require to merge all the independent tables one by one to generate the full table of count for the DE analysis... To sum up; Is there any R package that enable to generate a single Table of counts from multiple BAM files using an annotation gff file (or similar), for a genome that is not included in the UCSC catalog of reference organisms (as is the case of this bacteria I have to analyze)? Thanks in advance JL PD. I came across "Rsubread" package, but... package âRsubreadâ is not available (for R version 3.1.0) > sessionInfo()R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.14.2 loaded via a namespace (and not attached): [1] tools_3.1.0 -- -- JosÃ© Luis LavÃn Trueba, Ph.D Genome Analysis Platform CIC bioGUNE Parque TecnolÃ³gico de Bizkaia Building 502, Floor 0 48160 Derio-Spain Tel.: +34 946 572 524 Fax: +34 946 568 732 [[alternative HTML version deleted]]

Alignment Annotation ASSIGN Alignment Annotation ASSIGN • 3.4k views

ADD COMMENT • link updated 10.4 years ago by James W. MacDonald 67k • written 10.4 years ago by José Luis Lavín ▴ 80

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Hi Jose, On 6/20/2014 5:56 AM, Jos? Luis Lav?n wrote: > Dear list members, > > I have to analyze bacterial transcriptomic data and I have a doubt about > how to proceed. > > I have downloaded the reference genome FASTA from the NCBI and also a gff > file containing the annotation of that reference. I can map the reads to > the genome and so on, but when the time comes to generate the table of > counts for the Differential Expression (DE) analysis, I have no clear Idea > on how to use the gff annotation file to assign reads to the genomic > features. > > I've looked for solutions like HTSeq, but to my understanding this program > will generate a table of counts per alignment file (for instance, one table > per each bam file) which will require to merge all the independent tables > one by one to generate the full table of count for the DE analysis... > > To sum up; Is there any R package that enable to generate a single Table of > counts from multiple BAM files using an annotation gff file (or similar), > for a genome that is not included in the UCSC catalog of reference > organisms (as is the case of this bacteria I have to analyze)? As already mentioned, there is the easyRNASeq package. In addition, you could use a combination of 'base' Bioconductor packages to do this. library(GenomicFeatures) tx <- makeTranscriptDbFromGFF(<your gff="" file="">) You might need other arguments; I don't work with prokaryotes as a rule, so cannot advise, but you might need to say something about circular genomes and whatnot. align.to.this <- exonsBy(tx) or align.to.this <- transcriptsBy(tx) or align.to.this <- genes(tx) Again, you need to decide at what level you want to align, using your knowledge of prokaryotic biology to do 'the right thing'. library(Rsamtools) bfl <- BamFileList(<vector of="" bam="" files="">) olaps <- summarizeOverlaps(align.to.this, bfl) Then your counts are in assays(olaps)$counts You would be well served to read the vignettes for GenomicFeatures, Rsamtools, and probably GenomicRanges, IRanges and GenomeInfoDb. Best, Jim > > Thanks in advance > > JL > > PD. I came across "Rsubread" package, but... > > package ???Rsubread??? is not available (for R version 3.1.0) > >> sessionInfo()R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] tools_3.1.0 > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.4 years ago James W. MacDonald 67k

0

Entering edit mode

Dear Jim, I've tried your recommendation, and it worked on the main bacterial genome gff, I'm finding problems with plasmid .gff files, while trying to join the gff from the 2 plasmids of the organism. Since the gff are from prokaryotic origin, no exon feature is included in this plasmids gff, so I get en error complaining about that... Here I paste the code (derived from that you wrote) , the error I get and the plasmid gff file (it's only a few lines long). CODE: library(GenomicFeatures) txdb_Abau_p1<- makeTranscriptDbFromGFF(file= "C:/R_data/NC_009083_pAB1.gff", format="gff3", exonRankAttributeName=NA, circ_seqs = DEFAULT_CIRC_SEQS, dataSource="gff file for bacterial plasmids", species="Acinetobacter baumannii", useGenesAsTranscripts=TRUE) if(interactive()) { saveDb(txdb_Abau_p1,file="A_baumanni_main.sqlite") } ERROR: extracting transcript informationExtracting gene IDsextracting transcript informationProcessing splicing information for gff3 file.Error in .prepareGFF3Fragments(data, type = "exon") : No exon information present in gff file HEAD of the GFF file: ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter NC_009084_pAB2 RefSeq region 1 11302 . + . ID=id0;Name=pAB2;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;genome=plasm id;mol_type=genomic DNA;plasmid-name=pAB2;strain=ATCC 17978 NC_009083_pAB1 RefSeq region 1 13408 . + . ID=id0;Name=pAB1;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;genome=plasm id;mol_type=genomic DNA;plasmid-name=pAB1;strain=ATCC 17978 NC_009083_pAB1 RefSeq gene 1 957 . + . ID=gene0_pAB1;Name=A1S_3471;Dbxref=GeneID:4916888;gbkey=Gene;locus_tag =A1S_3471 NC_009083_pAB1 RefSeq CDS 1 957 . + 0 ID=cds0_pAB1;Name=YP_001083084.1;Parent=gene0_pAB1;Dbxref=Genbank:YP_0 01083084.1,GeneID:4916888;gbkey=CDS;product=hypothetical protein;protein_id=YP_001083084.1;transl_table=11 SessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.16.2 AnnotationDbi_1.26.0 Biobase_2.24.0 [4] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.9 [7] BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.1 [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 [10] DBI_0.2-7 digest_0.6.4 fail_1.2 [13] foreach_1.4.2 GenomicAlignments_1.0.1 iterators_1.0.7 [16] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 [19] Rsamtools_1.16.1 RSQLite_0.11.4 rtracklayer_1.24.2 [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 [28] zlibbioc_1.10.0 2014-06-20 16:07 GMT+02:00 James W. MacDonald <jmacdon@uw.edu>: > Hi Jose, > > > On 6/20/2014 5:56 AM, JosÃ© Luis LavÃn wrote: > >> Dear list members, >> >> I have to analyze bacterial transcriptomic data and I have a doubt about >> how to proceed. >> >> I have downloaded the reference genome FASTA from the NCBI and also a gff >> file containing the annotation of that reference. I can map the reads to >> the genome and so on, but when the time comes to generate the table of >> counts for the Differential Expression (DE) analysis, I have no clear Idea >> on how to use the gff annotation file to assign reads to the genomic >> features. >> >> I've looked for solutions like HTSeq, but to my understanding this program >> will generate a table of counts per alignment file (for instance, one >> table >> per each bam file) which will require to merge all the independent tables >> one by one to generate the full table of count for the DE analysis... >> >> To sum up; Is there any R package that enable to generate a single Table >> of >> counts from multiple BAM files using an annotation gff file (or similar), >> for a genome that is not included in the UCSC catalog of reference >> organisms (as is the case of this bacteria I have to analyze)? >> > > > As already mentioned, there is the easyRNASeq package. In addition, you > could use a combination of 'base' Bioconductor packages to do this. > > library(GenomicFeatures) > tx <- makeTranscriptDbFromGFF(<your gff="" file="">) > > You might need other arguments; I don't work with prokaryotes as a rule, > so cannot advise, but you might need to say something about circular > genomes and whatnot. > > align.to.this <- exonsBy(tx) > or > align.to.this <- transcriptsBy(tx) > or > align.to.this <- genes(tx) > > Again, you need to decide at what level you want to align, using your > knowledge of prokaryotic biology to do 'the right thing'. > > library(Rsamtools) > bfl <- BamFileList(<vector of="" bam="" files="">) > olaps <- summarizeOverlaps(align.to.this, bfl) > > Then your counts are in > > assays(olaps)$counts > > You would be well served to read the vignettes for GenomicFeatures, > Rsamtools, and probably GenomicRanges, IRanges and GenomeInfoDb. > > Best, > > Jim > > > >> Thanks in advance >> >> JL >> >> PD. I came across "Rsubread" package, but... >> >> package Ã¢â¬ËRsubreadÃ¢â¬â¢ is not available (for R version 3.1.0) >> >> sessionInfo()R version 3.1.0 (2014-04-10) >>> >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >> Kingdom.1252 >> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BiocInstaller_1.14.2 >> >> loaded via a namespace (and not attached): >> [1] tools_3.1.0 >> >> >> >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- -- JosÃ© Luis LavÃn Trueba, Ph.D Genome Analysis Platform CIC bioGUNE Parque TecnolÃ³gico de Bizkaia Building 502, Floor 0 48160 Derio-Spain Tel.: +34 946 572 524 Fax: +34 946 568 732 [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago José Luis Lavín ▴ 80

0

Entering edit mode

Hello again, I finally managed to "correct" the gff file enough as to get rid of the duplications on it (I never thought this format could be such a pain, maybe its inexperience with it, but it took me hours to correct the errors) and generate the TranscriptDB from it . But now I get some warnings after creating the TranscriptDB and when I load it there seem to be transcripts on in (created from the use of the genes, I presume) Then I generate a Genomic Ranges List and proceed to summarizeOverlaps. And to my surprise, the Counts Table has 0 read counts for every gene ... Any idea on what did I do wrong? Thanks in advance once more ;) Here's the code and the messages from R: library(GenomicFeatures) library(GenomicAlignments) library(Rsamtools) txdb_Abau <- makeTranscriptDbFromGFF(file= "C:/R_data/A_bau_full.gff", format="gff3", exonRankAttributeName=NA, circ_seqs = c("NC_009085_1","NC_009084_pAB2","NC_009083_pAB1"), dataSource="gff file fron A_baumanii", species="Acinetobaster baumannii", useGenesAsTranscripts=TRUE) if(interactive()) { saveDb(txdb_Abau,file="A_baumanni_full.sqlite") } extracting transcript information Extracting gene IDs extracting transcript information Processing splicing information for gff3 file. Deducing exon rank from relative coordinates provided Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Warning messages: 1: In .local(con, format, text, ...) : gff-version directive indicates version is 3 , not 3 2: In .deduceExonRankings(exs, format = "gff") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName 3: In .normargSplicings(splicings, transcripts_tx_id) : no CDS information for this TranscriptDb object txdb<-loadDb("A_baumanni_full2.sqlite") txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: gff file for A_baumannii | Organism: Acinetobaster baumannii | miRBase build ID: NA | transcript_nrow: 3469 | exon_nrow: 0 | cds_nrow: 0 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2014-06-24 11:04:56 +0200 (Tue, 24 Jun 2014) | GenomicFeatures version at creation time: 1.16.2 | RSQLite version at creation time: 0.11.4 | DBSCHEMAVERSION: 1.0 GRList<-transcriptsBy(txdb, by = "gene") #Create a BamFileList of files: bam_files <- list.files(pattern="*.bam") bfl <- BamFileList(bam_files) olaps <- summarizeOverlaps(GRList, bfl) ct<-assays(olaps)$counts head(ct) BIOFILM2_R1_SE_sorted.bam gene0 0 gene0_pAB1 0 gene0_pAB2 0 gene1 0 gene1_pAB1 0 gene1_pAB2 0 > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 Biostrings_2.32.0 [5] XVector_0.4.0 GenomicFeatures_1.16.2 AnnotationDbi_1.26.0 Biobase_2.24.0 [9] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.1 biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 [7] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 foreach_1.4.2 iterators_1.0.7 [13] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.24.2 sendmailR_1.1-2 [19] stats4_3.1.0 stringr_0.6.2 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.10.0 2014-06-23 15:07 GMT+02:00 JosÃ© Luis LavÃn <joluito@gmail.com>: > Dear Jim, > > I've tried your recommendation, and it worked on the main bacterial genome > gff, I'm finding problems with plasmid .gff files, while trying to join the > gff from the 2 plasmids of the organism. Since the gff are from prokaryotic > origin, no exon feature is included in this plasmids gff, so I get en error > complaining about that... > > Here I paste the code (derived from that you wrote) , the error I get and > the plasmid gff file (it's only a few lines long). > > > CODE: > > library(GenomicFeatures) > > txdb_Abau_p1<- makeTranscriptDbFromGFF(file= > "C:/R_data/NC_009083_pAB1.gff", > format="gff3", > exonRankAttributeName=NA, > circ_seqs = DEFAULT_CIRC_SEQS, > dataSource="gff file for bacterial > plasmids", > species="Acinetobacter baumannii", > useGenesAsTranscripts=TRUE) > if(interactive()) { > saveDb(txdb_Abau_p1,file="A_baumanni_main.sqlite") > } > > ERROR: > > extracting transcript informationExtracting gene IDsextracting transcript informationProcessing splicing information for gff3 file.Error in .prepareGFF3Fragments(data, type = "exon") : > No exon information present in gff file > > > > HEAD of the GFF file: > > ##gff-version 3 > #!gff-spec-version 1.20 > #!processor NCBI annotwriter > NC_009084_pAB2 RefSeq region 1 11302 . + . ID=id0;Name=pAB2;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;genome=pla smid;mol_type=genomic DNA;plasmid-name=pAB2;strain=ATCC 17978 > > NC_009083_pAB1 RefSeq region 1 13408 . + . ID=id0;Name=pAB1;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;geno me=plasmid;mol_type=genomic DNA;plasmid-name=pAB1;strain=ATCC 17978 > NC_009083_pAB1 RefSeq gene 1 957 . + . ID=gene0_pAB1;Name=A1S_3471;Dbxref=GeneID:4916888;gbkey=Gene;l ocus_tag=A1S_3471 > > NC_009083_pAB1 RefSeq CDS 1 957 . + 0 ID=cds0_pAB1;Name=YP_001083084.1;Parent=gene0_pAB1;Dbxref=Genb ank:YP_001083084.1,GeneID:4916888;gbkey=CDS;product=hypothetical protein;protein_id=YP_001083084.1;transl_table=11 > > > > SessionInfo() > > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.16.2 AnnotationDbi_1.26.0 Biobase_2.24.0 > [4] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.9 > [7] BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.1 > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 GenomicAlignments_1.0.1 iterators_1.0.7 > [16] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > [19] Rsamtools_1.16.1 RSQLite_0.11.4 rtracklayer_1.24.2 > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > [28] zlibbioc_1.10.0 > > > > 2014-06-20 16:07 GMT+02:00 James W. MacDonald <jmacdon@uw.edu>: > > Hi Jose, >> >> >> On 6/20/2014 5:56 AM, JosÃ© Luis LavÃn wrote: >> >>> Dear list members, >>> >>> I have to analyze bacterial transcriptomic data and I have a doubt about >>> how to proceed. >>> >>> I have downloaded the reference genome FASTA from the NCBI and also a gff >>> file containing the annotation of that reference. I can map the reads to >>> the genome and so on, but when the time comes to generate the table of >>> counts for the Differential Expression (DE) analysis, I have no clear >>> Idea >>> on how to use the gff annotation file to assign reads to the genomic >>> features. >>> >>> I've looked for solutions like HTSeq, but to my understanding this >>> program >>> will generate a table of counts per alignment file (for instance, one >>> table >>> per each bam file) which will require to merge all the independent tables >>> one by one to generate the full table of count for the DE analysis... >>> >>> To sum up; Is there any R package that enable to generate a single Table >>> of >>> counts from multiple BAM files using an annotation gff file (or similar), >>> for a genome that is not included in the UCSC catalog of reference >>> organisms (as is the case of this bacteria I have to analyze)? >>> >> >> >> As already mentioned, there is the easyRNASeq package. In addition, you >> could use a combination of 'base' Bioconductor packages to do this. >> >> library(GenomicFeatures) >> tx <- makeTranscriptDbFromGFF(<your gff="" file="">) >> >> You might need other arguments; I don't work with prokaryotes as a rule, >> so cannot advise, but you might need to say something about circular >> genomes and whatnot. >> >> align.to.this <- exonsBy(tx) >> or >> align.to.this <- transcriptsBy(tx) >> or >> align.to.this <- genes(tx) >> >> Again, you need to decide at what level you want to align, using your >> knowledge of prokaryotic biology to do 'the right thing'. >> >> library(Rsamtools) >> bfl <- BamFileList(<vector of="" bam="" files="">) >> olaps <- summarizeOverlaps(align.to.this, bfl) >> >> Then your counts are in >> >> assays(olaps)$counts >> >> You would be well served to read the vignettes for GenomicFeatures, >> Rsamtools, and probably GenomicRanges, IRanges and GenomeInfoDb. >> >> Best, >> >> Jim >> >> >> >>> Thanks in advance >>> >>> JL >>> >>> PD. I came across "Rsubread" package, but... >>> >>> package Ã¢â¬ËRsubreadÃ¢â¬â¢ is not available (for R version 3.1.0) >>> >>> sessionInfo()R version 3.1.0 (2014-04-10) >>>> >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >>> Kingdom.1252 >>> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C >>> [5] LC_TIME=English_United Kingdom.1252 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] BiocInstaller_1.14.2 >>> >>> loaded via a namespace (and not attached): >>> [1] tools_3.1.0 >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >>> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > > > -- > -- > JosÃ© Luis LavÃn Trueba, Ph.D > > Genome Analysis Platform > CIC bioGUNE > Parque TecnolÃ³gico de Bizkaia > Building 502, Floor 0 > 48160 Derio-Spain > > Tel.: +34 946 572 524 > Fax: +34 946 568 732 > -- -- JosÃ© Luis LavÃn Trueba, Ph.D Genome Analysis Platform CIC bioGUNE Parque TecnolÃ³gico de Bizkaia Building 502, Floor 0 48160 Derio-Spain Tel.: +34 946 572 524 Fax: +34 946 568 732 [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago José Luis Lavín ▴ 80

0

Entering edit mode

Hi Jos?, On 6/24/2014 5:53 AM, Jos? Luis Lav?n wrote: > Hello again, > > I finally managed to "correct" the gff file enough as to get rid of the > duplications on it (I never thought this format could be such a pain, > maybe its inexperience with it, but it took me hours to correct the > errors) and generate the TranscriptDB from it . > But now I get some warnings after creating the TranscriptDB and when I > load it there seem to be transcripts on in (created from the use of the > genes, I presume) > Then I generate a Genomic Ranges List and proceed to summarizeOverlaps. > And to my surprise, the Counts Table has 0 read counts for every gene ... > > Any idea on what did I do wrong? Not really. But here are some possibilities. First, I am assuming you aligned against the A. baumannii genome? And you did get some reads that aligned? And the gff file you are using corresponds to the genome you aligned against? Did you align against a genome, or a bunch of contigs? The NCBI website seems to just have contigs for A. baumannii. You have to ensure that the TxDb and whatever you aligned against have the same the same names for the chromosomes/contigs, and use the same counting scheme. As an example, are the genes really called Gene0, Gene1, etc? You can use e.g., samtools (or Rsamtools; not sure if samtools will work on Windows. That is a problematic OS for genomics work, and if you have access to either Linux or MacOS, you might consider switching.) to see where your reads are aligning, which will give you a good hint. Best, Jim > > Thanks in advance once more ;) > > Here's the code and the messages from R: > > > library(GenomicFeatures) > library(GenomicAlignments) > library(Rsamtools) > > txdb_Abau <- makeTranscriptDbFromGFF(file= "C:/R_data/A_bau_full.gff", > format="gff3", > exonRankAttributeName=NA, > circ_seqs = > c("NC_009085_1","NC_009084_pAB2","NC_009083_pAB1"), > dataSource="gff file fron A_baumanii", > species="Acinetobaster baumannii", > useGenesAsTranscripts=TRUE) > if(interactive()) { > saveDb(txdb_Abau,file="A_baumanni_full.sqlite") > } > > > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome > length information is available. > Warning messages: > 1: In .local(con, format, text, ...) : > gff-version directive indicates version is 3 > , not 3 > 2: In .deduceExonRankings(exs, format = "gff") : > Infering Exon Rankings. If this is not what you expected, then > please be sure that you have provided a valid attribute for > exonRankAttributeName > 3: In .normargSplicings(splicings, transcripts_tx_id) : > no CDS information for this TranscriptDb object > > txdb<-loadDb("A_baumanni_full2.sqlite") > txdb > > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | Data source: gff file for A_baumannii > | Organism: Acinetobaster baumannii > | miRBase build ID: NA > | transcript_nrow: 3469 > | exon_nrow: 0 > | cds_nrow: 0 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2014-06-24 11:04:56 +0200 (Tue, 24 Jun 2014) > | GenomicFeatures version at creation time: 1.16.2 > | RSQLite version at creation time: 0.11.4 > | DBSCHEMAVERSION: 1.0 > > GRList<-transcriptsBy(txdb, by = "gene") > > #Create a BamFileList of files: > bam_files <- list.files(pattern="*.bam") > bfl <- BamFileList(bam_files) > > olaps <- summarizeOverlaps(GRList, bfl) > ct<-assays(olaps)$counts > > > head(ct) > BIOFILM2_R1_SE_sorted.bam > gene0 0 > gene0_pAB1 0 > gene0_pAB2 0 > gene1 0 > gene1_pAB1 0 > gene1_pAB2 0 > > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] GenomicAlignments_1.0.1 BSgenome_1.32.0 > Rsamtools_1.16.1 Biostrings_2.32.0 > [5] XVector_0.4.0 GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 Biobase_2.24.0 > [9] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 > IRanges_1.22.9 BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.1 > biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 > [7] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 > fail_1.2 foreach_1.4.2 iterators_1.0.7 > [13] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > RSQLite_0.11.4 rtracklayer_1.24.2 sendmailR_1.1-2 > [19] stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > XML_3.98-1.1 zlibbioc_1.10.0 > > > > > 2014-06-23 15:07 GMT+02:00 Jos? Luis Lav?n <joluito at="" gmail.com=""> <mailto:joluito at="" gmail.com="">>: > > Dear Jim, > > I've tried your recommendation, and it worked on the main bacterial > genome gff, I'm finding problems with plasmid .gff files, while > trying to join the gff from the 2 plasmids of the organism. Since > the gff are from prokaryotic origin, no exon feature is included in > this plasmids gff, so I get en error complaining about that... > > Here I paste the code (derived from that you wrote) , the error I > get and the plasmid gff file (it's only a few lines long). > > > CODE: > > library(GenomicFeatures) > > txdb_Abau_p1<- makeTranscriptDbFromGFF(file= > "C:/R_data/NC_009083_pAB1.gff", > format="gff3", > exonRankAttributeName=NA, > circ_seqs = DEFAULT_CIRC_SEQS, > dataSource="gff file for bacterial > plasmids", > species="Acinetobacter baumannii", > useGenesAsTranscripts=TRUE) > if(interactive()) { > saveDb(txdb_Abau_p1,file="A_baumanni_main.sqlite") > } > > ERROR: > > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Error in .prepareGFF3Fragments(data, type = "exon") : > No exon information present in gff file > > > > HEAD of the GFF file: > > ##gff-version 3 > #!gff-spec-version 1.20 > #!processor NCBI annotwriter > NC_009084_pAB2 RefSeq region 1 11302 . + . ID=id0;Name=pAB2;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;genome=pla smid;mol_type=genomic DNA;plasmid-name=pAB2;strain=ATCC 17978 > > > NC_009083_pAB1 RefSeq region 1 13408 . + . ID=id0;Name=pAB1;Dbxref=ATCC:17978,taxon:400667;gbkey=Src;geno me=plasmid;mol_type=genomic DNA;plasmid-name=pAB1;strain=ATCC 17978 > NC_009083_pAB1 RefSeq gene 1 957 . + . ID=gene0_pAB1;Name=A1S_3471;Dbxref=GeneID:4916888;gbkey=Gene;l ocus_tag=A1S_3471 > > > NC_009083_pAB1 RefSeq CDS 1 957 . + 0 ID=cds0_pAB1;Name=YP_001083084.1;Parent=gene0_pAB1;Dbxref=Genb ank:YP_001083084.1,GeneID:4916888;gbkey=CDS;product=hypothetical protein;protein_id=YP_001083084.1;transl_table=11 > > > > > > SessionInfo() > > R version 3.1.0 (2014-04-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.16.2 AnnotationDbi_1.26.0 Biobase_2.24.0 > [4] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.9 > [7] BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.6 BiocParallel_0.6.1 > [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6 > [7] brew_1.0-6 BSgenome_1.32.0 codetools_0.2-8 > [10] DBI_0.2-7 digest_0.6.4 fail_1.2 > [13] foreach_1.4.2 GenomicAlignments_1.0.1 iterators_1.0.7 > [16] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > [19] Rsamtools_1.16.1 RSQLite_0.11.4 rtracklayer_1.24.2 > [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0 > [28] zlibbioc_1.10.0 > > > > 2014-06-20 16:07 GMT+02:00 James W. MacDonald <jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">>: > > Hi Jose, > > > On 6/20/2014 5:56 AM, Jos? Luis Lav?n wrote: > > Dear list members, > > I have to analyze bacterial transcriptomic data and I have a > doubt about > how to proceed. > > I have downloaded the reference genome FASTA from the NCBI > and also a gff > file containing the annotation of that reference. I can map > the reads to > the genome and so on, but when the time comes to generate > the table of > counts for the Differential Expression (DE) analysis, I have > no clear Idea > on how to use the gff annotation file to assign reads to the > genomic > features. > > I've looked for solutions like HTSeq, but to my > understanding this program > will generate a table of counts per alignment file (for > instance, one table > per each bam file) which will require to merge all the > independent tables > one by one to generate the full table of count for the DE > analysis... > > To sum up; Is there any R package that enable to generate a > single Table of > counts from multiple BAM files using an annotation gff file > (or similar), > for a genome that is not included in the UCSC catalog of > reference > organisms (as is the case of this bacteria I have to analyze)? > > > > As already mentioned, there is the easyRNASeq package. In > addition, you could use a combination of 'base' Bioconductor > packages to do this. > > library(GenomicFeatures) > tx <- makeTranscriptDbFromGFF(<your gff="" file="">) > > You might need other arguments; I don't work with prokaryotes as > a rule, so cannot advise, but you might need to say something > about circular genomes and whatnot. > > align.to.this <- exonsBy(tx) > or > align.to.this <- transcriptsBy(tx) > or > align.to.this <- genes(tx) > > Again, you need to decide at what level you want to align, using > your knowledge of prokaryotic biology to do 'the right thing'. > > library(Rsamtools) > bfl <- BamFileList(<vector of="" bam="" files="">) > olaps <- summarizeOverlapsalign.to <http: align.to="">.__this, bfl) > > Then your counts are in > > assays(olaps)$counts > > You would be well served to read the vignettes for > GenomicFeatures, Rsamtools, and probably GenomicRanges, IRanges > and GenomeInfoDb. > > Best, > > Jim > > > > Thanks in advance > > JL > > PD. I came across "Rsubread" package, but... > > package ???Rsubread??? is not available (for R version 3.1.0) > > sessionInfo()R version 3.1.0 (2014-04-10) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 > LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] BiocInstaller_1.14.2 > > loaded via a namespace (and not attached): > [1] tools_3.1.0 > > > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > > -- > -- > Jos? Luis Lav?n Trueba, Ph.D > > Genome Analysis Platform > CIC bioGUNE > Parque Tecnol?gico de Bizkaia > Building 502, Floor 0 > 48160 Derio-Spain > > Tel.: +34 946 572 524 <tel:%2b34%20946%20572%20524> > Fax: +34 946 568 732 <tel:%2b34%20946%20568%20732> > > > > > -- > -- > Jos? Luis Lav?n Trueba, Ph.D > > Genome Analysis Platform > CIC bioGUNE > Parque Tecnol?gico de Bizkaia > Building 502, Floor 0 > 48160 Derio-Spain > > Tel.: +34 946 572 524 > Fax: +34 946 568 732 -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.4 years ago James W. MacDonald 67k

0

Entering edit mode

José Luis Lavín ▴ 80

@jose-luis-lavin-5797

Last seen 10.2 years ago

First of all I want to thank Michael for his swift answer. Then I have a concern about this, Am I wrong or the package "easyRNASeq" is dramatically lacking an update at its bioconductor page. I mean that manual script and vignette are not up to date (at least R complains when I try to use functions such as "easyRNASeq" described on them). Any other alternatives that can be used for the table of count generation task ? 2014-06-20 12:30 GMT+02:00 Michael Dondrup <michael.dondrup@ii.uib.no>: > Dear JosÃ©, > > easyRNAseq can do that. > > http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html > > Best > Michael > > On Jun 20, 2014, at 11:56 AM, JosÃ© Luis LavÃn wrote: > > > Dear list members, > > > > I have to analyze bacterial transcriptomic data and I have a doubt about > > how to proceed. > > > > I have downloaded the reference genome FASTA from the NCBI and also a gff > > file containing the annotation of that reference. I can map the reads to > > the genome and so on, but when the time comes to generate the table of > > counts for the Differential Expression (DE) analysis, I have no clear > Idea > > on how to use the gff annotation file to assign reads to the genomic > > features. > > > > I've looked for solutions like HTSeq, but to my understanding this > program > > will generate a table of counts per alignment file (for instance, one > table > > per each bam file) which will require to merge all the independent tables > > one by one to generate the full table of count for the DE analysis... > > > > To sum up; Is there any R package that enable to generate a single Table > of > > counts from multiple BAM files using an annotation gff file (or similar), > > for a genome that is not included in the UCSC catalog of reference > > organisms (as is the case of this bacteria I have to analyze)? > > > > Thanks in advance > > > > JL > > > > PD. I came across "Rsubread" package, but... > > > > package âRsubreadâ is not available (for R version 3.1.0) > > > >> sessionInfo()R version 3.1.0 (2014-04-10) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > > locale: > > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > > Kingdom.1252 > > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United Kingdom.1252 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] BiocInstaller_1.14.2 > > > > loaded via a namespace (and not attached): > > [1] tools_3.1.0 > > > > > > > > > > -- > > -- > > JosÃ© Luis LavÃn Trueba, Ph.D > > > > Genome Analysis Platform > > CIC bioGUNE > > Parque TecnolÃ³gico de Bizkaia > > Building 502, Floor 0 > > 48160 Derio-Spain > > > > Tel.: +34 946 572 524 > > Fax: +34 946 568 732 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- -- JosÃ© Luis LavÃn Trueba, Ph.D Genome Analysis Platform CIC bioGUNE Parque TecnolÃ³gico de Bizkaia Building 502, Floor 0 48160 Derio-Spain Tel.: +34 946 572 524 Fax: +34 946 568 732 [[alternative HTML version deleted]]

ADD COMMENT • link 10.4 years ago José Luis Lavín ▴ 80

0

Entering edit mode

On 06/20/2014 02:27 PM, Jos? Luis Lav?n wrote: > First of all I want to thank Michael for his swift answer. > > Then I have a concern about this, Am I wrong or the package "easyRNASeq" is > dramatically lacking an update at its bioconductor page. I mean that manual > script and vignette are not up to date (at least R complains when I try to > use functions such as "easyRNASeq" described on them). > > Any other alternatives that can be used for the table of count generation > task ? > Have a look at the QuasR http://www.bioconductor.org/packages/release/bioc/html/QuasR.html Regards, Hans-Rudolf > 2014-06-20 12:30 GMT+02:00 Michael Dondrup <michael.dondrup at="" ii.uib.no="">: > >> Dear Jos??, >> >> easyRNAseq can do that. >> >> http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html >> >> Best >> Michael >> >> On Jun 20, 2014, at 11:56 AM, Jos?? Luis Lav??n wrote: >> >>> Dear list members, >>> >>> I have to analyze bacterial transcriptomic data and I have a doubt about >>> how to proceed. >>> >>> I have downloaded the reference genome FASTA from the NCBI and also a gff >>> file containing the annotation of that reference. I can map the reads to >>> the genome and so on, but when the time comes to generate the table of >>> counts for the Differential Expression (DE) analysis, I have no clear >> Idea >>> on how to use the gff annotation file to assign reads to the genomic >>> features. >>> >>> I've looked for solutions like HTSeq, but to my understanding this >> program >>> will generate a table of counts per alignment file (for instance, one >> table >>> per each bam file) which will require to merge all the independent tables >>> one by one to generate the full table of count for the DE analysis... >>> >>> To sum up; Is there any R package that enable to generate a single Table >> of >>> counts from multiple BAM files using an annotation gff file (or similar), >>> for a genome that is not included in the UCSC catalog of reference >>> organisms (as is the case of this bacteria I have to analyze)? >>> >>> Thanks in advance >>> >>> JL >>> >>> PD. I came across "Rsubread" package, but... >>> >>> package ???Rsubread??? is not available (for R version 3.1.0) >>> >>>> sessionInfo()R version 3.1.0 (2014-04-10) >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >>> Kingdom.1252 >>> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C >>> [5] LC_TIME=English_United Kingdom.1252 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] BiocInstaller_1.14.2 >>> >>> loaded via a namespace (and not attached): >>> [1] tools_3.1.0 >>> >>> >>> >>> >>> -- >>> -- >>> Jos?? Luis Lav??n Trueba, Ph.D >>> >>> Genome Analysis Platform >>> CIC bioGUNE >>> Parque Tecnol??gico de Bizkaia >>> Building 502, Floor 0 >>> 48160 Derio-Spain >>> >>> Tel.: +34 946 572 524 >>> Fax: +34 946 568 732 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.4 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

Hi Jose, All the options you have already had from the list work fine. My personal favourite one is to use htseq-count and then DESeq2 in R/Bioconductor. It has a specific function that takes the separate gene-counts table per sample and constructs an experiment-object for you with all the data you need for differential expression. Alternatively, if you have a very complicated design it could be convenient to use limma with the voom function for RNA-Seq data. Jose 2014-06-23 8:49 GMT+02:00 Hans-Rudolf Hotz <hrh@fmi.ch>: > > > On 06/20/2014 02:27 PM, JosÃ© Luis LavÃn wrote: > > First of all I want to thank Michael for his swift answer. > > > > Then I have a concern about this, Am I wrong or the package "easyRNASeq" > is > > dramatically lacking an update at its bioconductor page. I mean that > manual > > script and vignette are not up to date (at least R complains when I try > to > > use functions such as "easyRNASeq" described on them). > > > > Any other alternatives that can be used for the table of count generation > > task ? > > > > Have a look at the QuasR > > http://www.bioconductor.org/packages/release/bioc/html/QuasR.html > > > Regards, Hans-Rudolf > > > > > 2014-06-20 12:30 GMT+02:00 Michael Dondrup <michael.dondrup@ii.uib.no>: > > > >> Dear JosÃÂ©, > >> > >> easyRNAseq can do that. > >> > >> http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html > >> > >> Best > >> Michael > >> > >> On Jun 20, 2014, at 11:56 AM, JosÃÂ© Luis LavÃÂn wrote: > >> > >>> Dear list members, > >>> > >>> I have to analyze bacterial transcriptomic data and I have a doubt > about > >>> how to proceed. > >>> > >>> I have downloaded the reference genome FASTA from the NCBI and also a > gff > >>> file containing the annotation of that reference. I can map the reads > to > >>> the genome and so on, but when the time comes to generate the table of > >>> counts for the Differential Expression (DE) analysis, I have no clear > >> Idea > >>> on how to use the gff annotation file to assign reads to the genomic > >>> features. > >>> > >>> I've looked for solutions like HTSeq, but to my understanding this > >> program > >>> will generate a table of counts per alignment file (for instance, one > >> table > >>> per each bam file) which will require to merge all the independent > tables > >>> one by one to generate the full table of count for the DE analysis... > >>> > >>> To sum up; Is there any R package that enable to generate a single > Table > >> of > >>> counts from multiple BAM files using an annotation gff file (or > similar), > >>> for a genome that is not included in the UCSC catalog of reference > >>> organisms (as is the case of this bacteria I have to analyze)? > >>> > >>> Thanks in advance > >>> > >>> JL > >>> > >>> PD. I came across "Rsubread" package, but... > >>> > >>> package Ã¢â¬ËRsubreadÃ¢â¬â¢ is not available (for R version 3.1.0) > >>> > >>>> sessionInfo()R version 3.1.0 (2014-04-10) > >>> Platform: x86_64-w64-mingw32/x64 (64-bit) > >>> > >>> locale: > >>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > >>> Kingdom.1252 > >>> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > >>> [5] LC_TIME=English_United Kingdom.1252 > >>> > >>> attached base packages: > >>> [1] stats graphics grDevices utils datasets methods base > >>> > >>> other attached packages: > >>> [1] BiocInstaller_1.14.2 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] tools_3.1.0 > >>> > >>> > >>> > >>> > >>> -- > >>> -- > >>> JosÃÂ© Luis LavÃÂn Trueba, Ph.D > >>> > >>> Genome Analysis Platform > >>> CIC bioGUNE > >>> Parque TecnolÃÂ³gico de Bizkaia > >>> Building 502, Floor 0 > >>> 48160 Derio-Spain > >>> > >>> Tel.: +34 946 572 524 > >>> Fax: +34 946 568 732 > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > > > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Jose M. Garcia Manteiga PhD Data Analysis in Functional Genomics Center for Translational Genomics and BioInformatics Dibit2-Basilica, 4A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy Tel: +39-02-2643-9144 e-mail: garciamanteiga.josemanuel@hsr.it [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Jose M Garcia Manteiga ▴ 310

0

Entering edit mode

Thank you very much to the list and specially to all those who answered my question. I'm a little overwhelmed with all this tsumani of information right now, but I expect to be able to use it in a short time. I will test those solutions you offered me and try to determine which is the best for me and come back to this post to comment about my strategy of choice. Thank you very very much, it's always a pleasure to be in contact with the people in this list. Best regards JL 2014-06-23 9:47 GMT+02:00 Jose Garcia <garciamanteiga.josemanuel@hsr.it>: > Hi Jose, > All the options you have already had from the list work fine. My personal > favourite one is to use htseq-count and then DESeq2 in R/Bioconductor. It > has a specific function that takes the separate gene-counts table per > sample and constructs an experiment-object for you with all the data you > need for differential expression. > Alternatively, if you have a very complicated design it could be > convenient to use limma with the voom function for RNA-Seq data. > Jose > > > > 2014-06-23 8:49 GMT+02:00 Hans-Rudolf Hotz <hrh@fmi.ch>: > > >> >> On 06/20/2014 02:27 PM, JosÃ© Luis LavÃn wrote: >> > First of all I want to thank Michael for his swift answer. >> > >> > Then I have a concern about this, Am I wrong or the package >> "easyRNASeq" is >> > dramatically lacking an update at its bioconductor page. I mean that >> manual >> > script and vignette are not up to date (at least R complains when I try >> to >> > use functions such as "easyRNASeq" described on them). >> > >> > Any other alternatives that can be used for the table of count >> generation >> > task ? >> > >> >> Have a look at the QuasR >> >> http://www.bioconductor.org/packages/release/bioc/html/QuasR.html >> >> >> Regards, Hans-Rudolf >> >> >> >> > 2014-06-20 12:30 GMT+02:00 Michael Dondrup <michael.dondrup@ii.uib.no>: >> > >> >> Dear JosÃÂ©, >> >> >> >> easyRNAseq can do that. >> >> >> >> http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html >> >> >> >> Best >> >> Michael >> >> >> >> On Jun 20, 2014, at 11:56 AM, JosÃÂ© Luis LavÃÂn wrote: >> >> >> >>> Dear list members, >> >>> >> >>> I have to analyze bacterial transcriptomic data and I have a doubt >> about >> >>> how to proceed. >> >>> >> >>> I have downloaded the reference genome FASTA from the NCBI and also a >> gff >> >>> file containing the annotation of that reference. I can map the reads >> to >> >>> the genome and so on, but when the time comes to generate the table of >> >>> counts for the Differential Expression (DE) analysis, I have no clear >> >> Idea >> >>> on how to use the gff annotation file to assign reads to the genomic >> >>> features. >> >>> >> >>> I've looked for solutions like HTSeq, but to my understanding this >> >> program >> >>> will generate a table of counts per alignment file (for instance, one >> >> table >> >>> per each bam file) which will require to merge all the independent >> tables >> >>> one by one to generate the full table of count for the DE analysis... >> >>> >> >>> To sum up; Is there any R package that enable to generate a single >> Table >> >> of >> >>> counts from multiple BAM files using an annotation gff file (or >> similar), >> >>> for a genome that is not included in the UCSC catalog of reference >> >>> organisms (as is the case of this bacteria I have to analyze)? >> >>> >> >>> Thanks in advance >> >>> >> >>> JL >> >>> >> >>> PD. I came across "Rsubread" package, but... >> >>> >> >>> package Ã¢â¬ËRsubreadÃ¢â¬â¢ is not available (for R version 3.1.0) >> >>> >> >>>> sessionInfo()R version 3.1.0 (2014-04-10) >> >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >>> >> >>> locale: >> >>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >> >>> Kingdom.1252 >> >>> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C >> >>> [5] LC_TIME=English_United Kingdom.1252 >> >>> >> >>> attached base packages: >> >>> [1] stats graphics grDevices utils datasets methods base >> >>> >> >>> other attached packages: >> >>> [1] BiocInstaller_1.14.2 >> >>> >> >>> loaded via a namespace (and not attached): >> >>> [1] tools_3.1.0 >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> -- >> >>> JosÃÂ© Luis LavÃÂn Trueba, Ph.D >> >>> >> >>> Genome Analysis Platform >> >>> CIC bioGUNE >> >>> Parque TecnolÃÂ³gico de Bizkaia >> >>> Building 502, Floor 0 >> >>> 48160 Derio-Spain >> >>> >> >>> Tel.: +34 946 572 524 >> >>> Fax: +34 946 568 732 >> >>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor@r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> > >> > >> > >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > Jose M. Garcia Manteiga PhD > Data Analysis in Functional Genomics > Center for Translational Genomics and BioInformatics > Dibit2-Basilica, 4A3 > San Raffaele Scientific Institute > Via Olgettina 58, 20132 Milano (MI), Italy > > Tel: +39-02-2643-9144 > e-mail: garciamanteiga.josemanuel@hsr.it > -- -- JosÃ© Luis LavÃn Trueba, Ph.D Genome Analysis Platform CIC bioGUNE Parque TecnolÃ³gico de Bizkaia Building 502, Floor 0 48160 Derio-Spain Tel.: +34 946 572 524 Fax: +34 946 568 732 [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago José Luis Lavín ▴ 80

Login before adding your answer.