Does BSgenome.Dmelanogaster.UCSC.dm2 maintain non-coding RNAs?
6
0
Entering edit mode
@patrick-schorderet-6081
Last seen 9.6 years ago
United States

I was wondering whether anyone knows if the BSgenome.Dmelanogaster.UCSC.dm2 maintains non coding RNAs? Or does any other drosophila BSgenome contain non-coding RNAs? maybe the TxDb.Dmelanogaster.UCSC.dm3.ensGene?

Thanks

dm2 bsgenome noncoding RNA • 2.2k views
ADD COMMENT
0
Entering edit mode

The BSgenome.Dmelanogaster.UCSC.dm2 doesn't contain any RNAs. It contains the genomic sequence for that species. There are ways to get non-coding RNAs, but you will first need to tell us exactly what you want.

In other words, 'non-coding RNAs' encompasses a lot of different things. In addition, there are several things you could be interested in (genomic sequence, RNA sequence, genomic location, etc).

ADD REPLY
0
Entering edit mode

Yes, sorry, you are right. Here is what I do: I count RNAseq reads using the TxDb.Dmelanogaster.UCSC.dm3.ensGene database to compute DEGs. However, I would also be interested in looking at whether ncRNAs (lincRNAs) are up or down regulated. 

I hope this makes more sense. Thanks.

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Patrick,

lincRNAs for Fly are annotated at Ensembl:

library(GenomicFeatures)
txdb <- makeTxDbFromBiomart(dataset="dmelanogaster_gene_ensembl")
tx <- transcripts(txdb, columns=c("tx_name", "gene_id", "tx_type"))
table(mcols(tx)$tx_type)
#  lincRNA       miRNA   pre_miRNA  protein_coding  pseudogene 
#     2776         304         238           30353         289 
#     rRNA      snoRNA       snRNA            tRNA 
#      147         288          31             314 

For example, to extract lincRNA FBtr0345927:

tx[mcols(tx)$tx_name %in% "FBtr0345927"]
# GRanges object with 1 range and 3 metadata columns:
#       seqnames               ranges strand |     tx_name         gene_id
#          <Rle>            <IRanges>  <Rle> | <character> <CharacterList>
#   [1]        X [22514453, 22514891]      - | FBtr0345927     FBgn0264677
#           tx_type
#       <character>
#   [1]     lincRNA
#   -------
#   seqinfo: 1870 sequences from an unspecified genome

To extract the other lincRNAs linked to the same "gene" as FBtr0345927:

tx[as.logical(mcols(tx)$gene_id %in% "FBgn0264677")]
# GRanges object with 2 ranges and 3 metadata columns:
#       seqnames               ranges strand |     tx_name         gene_id
#          <Rle>            <IRanges>  <Rle> | <character> <CharacterList>
#   [1]        X [22514453, 22514891]      - | FBtr0345927     FBgn0264677
#   [2]        X [22514522, 22514891]      - | FBtr0333773     FBgn0264677
#           tx_type
#       <character>
#   [1]     lincRNA
#   [2]     lincRNA
#   -------
#   seqinfo: 1870 sequences from an unspecified genome

Note that tx_type is a new column in BioC 3.1 (our upcoming release, based on R 3.2, and scheduled for April 17) so make sure you use that version of BioC (just install R 3.2 and proceed as usual).

H.

ADD COMMENT
0
Entering edit mode

 

Hey Hervé,

Is the makeTxDfFromBiomart a function that only works on R 3.2? I tried and it gives me an error :-(

txdb <- makeTxDbFromBiomart(dataset="dmelanogaster_gene_ensembl")
Error: could not find function "makeTxDbFromBiomart"

and my sessionInfo()

R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomicFeatures_1.18.7 AnnotationDbi_1.28.2   Biobase_2.26.0        
[4] GenomicRanges_1.18.4   GenomeInfoDb_1.2.5     IRanges_2.0.1         
[7] S4Vectors_0.4.0        BiocGenerics_0.12.1    BiocInstaller_1.16.2  

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.6           BBmisc_1.9             
 [4] BiocParallel_1.0.3      biomaRt_2.22.0          Biostrings_2.34.1      
 [7] bitops_1.0-6            brew_1.0-6              checkmate_1.5.2        
[10] codetools_0.2-11        DBI_0.3.1               digest_0.6.8           
[13] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.2
[16] iterators_1.0.7         RCurl_1.95-4.5          Rsamtools_1.18.3       
[19] RSQLite_1.0.0           rtracklayer_1.26.3      sendmailR_1.2-1        
[22] stringr_0.6.2           tools_3.1.3             XML_3.98-1.1           
[25] XVector_0.6.0           zlibbioc_1.12.0        

Thanks!

ADD REPLY
0
Entering edit mode

It is named makeTranscriptDbFromBiomart in BioC 3.0 (current release) but was renamed makeTxDbFromBiomart in BioC 3.1 (the old name still works and is deprecated in BioC 3.1). But as I said, the tx_type column is new and only available starting with BioC 3.1.

H.

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

That might be a tough one. A quick google search indicates that people are working on lincRNAs for Drosophila, but I don't know if there is a comprehensive source. Certainly I don't see anything on UCSC. Maybe you can dig up a gff or bed file somewhere, and convert to a GRangesList.

ADD COMMENT
0
Entering edit mode

As an example, you could use this.

ADD REPLY
0
Entering edit mode
@patrick-schorderet-6081
Last seen 9.6 years ago
United States

ok, great. Thanks for this info, i'll check it out.

Patrick

ADD COMMENT
0
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 4 weeks ago
Italy

hi Patrick,

with the next Bioc release you can also use the ensembldb package to build EnsDb annotation packages (similar to the TxDb, just tailored for annotations from Ensembl) for drosophila based on the GTF files provided from Ensembl. I'm also working on adding the EnsDb classes to the AnnotationHub which would make it much easier to generate such packages.

cheers, jo


 

ADD COMMENT
0
Entering edit mode
@patrick-schorderet-6081
Last seen 9.6 years ago
United States

Thanks Hervé and Johannes,

Just tried to use the old function (makeTranscriptDbFromBiomart) and it looks like something is going wrong (pasting the error message below). I guess the easiest will be to wait for the new BioC update. Should this work well with the update?

Thanks for the help

Patrick

 

 

txdb <- makeTranscriptDbFromBiomart(dataset="dmelanogaster_gene_ensembl")

Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_result, idx[bad_idx2], id_prefix,  : 
  BioMart data anomaly: in the following transcripts, 
  located on the minus strand, the start of some 3' UTRs 
  (3_utr_start) doesn't match the start of the exon 
  (exon_chrom_start).
  (Showing only the first 6 out of 9 transcripts.)
  1. Transcript FBtr0084081:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         21368380       21369062  FBtr0084081-E2
     2     -1    2         21377288       21377399  FBtr0084081-E1
     3     -1    3         21376819       21377076  FBtr0084081-E3
     4     -1    4         21376602       21376741  FBtr0084081-E4
     5     -1    5         21375060       21375912  FBtr0084081-E5
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
     1             21368871           21369062    21368380  21368870          NA
     2                   NA                 NA          NA        NA    21377288
     3             21376819           21377035          NA        NA    21377036
     4             21376602           21376741          NA        NA          NA
     5             21375060           21375912          NA        NA          NA
       3_utr_end cds_start cds_end cds_length
     1        NA         1     192        521
     2  21377399       193     304        521
     3  21377076       305     521        521
     4        NA        NA      NA        521
     5        NA        NA      NA        521
  2. Transcript FBtr0084084:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         21366004       21366338  FBtr0084084-E2
     2     -1    2         21377288       21377399  FBtr0084081-E1
     3     -1    3         21376819       21377076  FBtr0084081-E3
     4     -1    4         21376602       21376741  FBtr0084081-E4
     5     -1    5         21375060       21375912  FBtr0084081-E5
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
     1             21366294           21366338    21366004  21366293          NA
     2                   NA                 NA          NA        NA    21377288
     3             21376819           21377035          NA        NA    21377036
     4             21376602           21376741          NA        NA          NA
     5             21375060           21375912          NA        NA          NA
       3_utr_end cds_start cds_end cds_length
     1        NA         1      45        374
     2  21377399        46     157        374
     3  21377076       158     374        374
     4        NA        NA      NA        374
     5        NA        NA      NA        374
  3. Transcript FBtr0084085:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         21361398       21361610  FBtr0084085-E2
     2     -1    2         21361670       21362138  FBtr0084085-E4
     3     -1    3         21377288       21377399  FBtr0084081-E1
     4     -1    4         21376819       21377076  FBtr0084085-E3
     5     -1    5         21376602       21376741  FBtr0084085-E5
     6     -1    6         21375060       21375912  FBtr0084085-E6
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
     1                   NA                 NA    21361398  21361610          NA
     2             21361825           21362138    21361670  21361824          NA
     3                   NA                 NA          NA        NA    21377288
     4             21376819           21377035          NA        NA    21377036
     5             21376602           21376741          NA        NA          NA
     6             21375060           21375912          NA        NA          NA
       3_utr_end cds_start cds_end cds_length
     1        NA        NA      NA        643
     2        NA         1     314        643
     3  21377399       315     426        643
     4  21377076       427     643        643
     5        NA        NA      NA        643
     6        NA        NA      NA        643
  4. Transcript FBtr0084082:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         21367910       21368238  FBtr0084082-E2
     2     -1    2         21377288       21377399  FBtr0084081-E1
     3     -1    3         21376819       21377076  FBtr0084081-E3
     4     -1    4         21376602       21376741  FBtr0084081-E4
     5     -1    5         21375060       21375912  FBtr0084081-E5
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
     1             21368215           21368238    21367910  21368214          NA
     2                   NA                 NA          NA        NA    21377288
     3             21376819           21377035          NA        NA    21377036
     4             21376602           21376741          NA        NA          NA
     5             21375060           21375912          NA        NA          NA
       3_utr_end cds_start cds_end cds_length
     1        NA         1      24        353
     2  21377399        25     136        353
     3  21377076       137     353        353
     4        NA        NA      NA        353
     5        NA        NA      NA        353
  5. Transcript FBtr0084083:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         21366450       21366744  FBtr0084083-E2
     2     -1    2         21377288       21377399  FBtr0084081-E1
     3     -1    3         21376819       21377076  FBtr0084085-E3
     4     -1    4         21376602       21376741  FBtr0084085-E5
     5     -1    5         21375060       21375912  FBtr0084085-E6
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
     1             21366710           21366744    21366450  21366709          NA
     2                   NA                 NA          NA        NA    21377288
     3             21376819           21377035          NA        NA    21377036
     4             21376602           21376741          NA        NA          NA
     5             21375060           21375912          NA        NA          NA
       3_utr_end cds_start cds_end cds_length
     1        NA         1      35        364
     2  21377399        36     147        364
     3  21377076       148     364        364
     4        NA        NA      NA        364
     5        NA        NA      NA        364
  6. Transcript FBtr0307759:
       strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
     1     -1    1         21367363       21367688  FBtr0307759-E2
     2     -1    2         21377288       21377399  FBtr0084081-E1
     3     -1    3         21376819       21377076  FBtr0307759-E3
     4     -1    4         21376602       21376741  FBtr0114359-E3
     5     -1    5         21375060       21375912  FBtr0114359-E4
       genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
     1             21367679           21367688    21367363  21367678          NA
     2                   NA                 NA          NA        NA    21377288
     3             21376819           21377035          NA        NA    21377036
     4             21376602           21376741          NA        NA          NA
     5             21375060           21375912          NA        NA          NA
       3_utr_end cds_start cds_end cds_length
     1        NA         1      10        339
     2  21377399        11     122        339
     3  21377076       123     339        339
     4        NA        NA      NA        339
     5        NA        NA      NA      
In addition: Warning messages:
1: In assignProvIdsForSuperGroup(seqlevels, "") :
  inaccurate integer conversion in coercion
2: In 3L * nb_ints : NAs produced by integer overflow
3: In matchCircularity(chromlengths$name, circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi Patrick,

Yes the dmelanogaster_gene_ensembl dataset in the latest release of Ensembl (v79) contains some transcripts that are mis-represented (wrong exon ranking and strand, wrong UTRs). These are detected by the sanity checks that makeTranscriptDbFromBiomart() applies to the incoming data. See

makeTranscriptDbFromBiomart failure from Data Anomaly

for a long version of this story.

Anyway, a patch was applied a couple of weeks ago to GenomicFeatures (1.18.5 in BioC release, and 1.19.35 in BioC devel) to address the issue. The new behavior is that makeTranscriptDbFromBiomart() (renamed makeTxDbFromBiomart() in BioC devel) now drops these problematic transcripts with a warning instead of failing. So please make sure your packages are up-to-date (run biocLite() with no arguments for that).

Thanks,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 1040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6