biomaRt bug? swapped colnames
3
0
Entering edit mode
dmontaner • 0
@dmontaner-7059
Last seen 8.2 years ago

Dear Steffen

I am downloading some snoRNA sequences using biomaRt and it seems that the column names of the final data.frame are swapped.

I am copying my code below.

Thanks for your package

David

 

> library (biomaRt)

> mart <- useDataset ("hsapiens_gene_ensembl", mart = useMart ("ensembl"))

> mydat <- getBM (c ("ensembl_gene_id", "gene_exon_intron"),
+                 filters = "biotype",
+                 values = "snoRNA",
+                 mart = mart)

> mydat[1:3,]
                                                                                                                         ensembl_gene_id
1                                                                   CAGCCCTAAAATGGAAAAAATTTAAAATTACTTAGACAATGTGATGTCATCAAAGGAACCCTAAGTAA
2                                                      GGGTGGTGATGAGAACCTTGTATTCTTCTGAAGAGAGGTGATGACTTAAAAACCATGCTCAATAGGATTACACTTAGGCCG
3 TCATCAGGTGGGATAATCCTTACCTGTTCCTCGTTTTGGAGGGCAGATAGAACAGGATAATTGGAGTTTGCATGATCCATGATTAATGTCTCTGTGTAATCAGGACTTGCAAACTCTGATTGTTCATATCTGAT
  gene_exon_intron
1  ENSG00000201209
2  ENSG00000200801
3  ENSG00000199713

> sessionInfo ()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.24.0

loaded via a namespace (and not attached):
 [1] IRanges_2.2.1        DBI_0.3.1            parallel_3.2.0      
 [4] RCurl_1.95-4.6       Biobase_2.28.0       AnnotationDbi_1.30.1
 [7] RSQLite_1.0.0        S4Vectors_0.6.0      BiocGenerics_0.14.0 
[10] GenomeInfoDb_1.4.0   stats4_3.2.0         bitops_1.0-6        
[13] XML_3.98-1.1        

biomart • 1.5k views
ADD COMMENT
0
Entering edit mode
@steffen-durinck-4894
Last seen 10.2 years ago

Hi David,

That happens with a few query types and the headers are erroneously put on at the BioMart server side, nothing that is fixable from the biomaRt end.  You can overwrite the BioMart headers though with setting bmHeader=FALSE in the getBM query and then things should look fine.  In that case getBM assumes you'll get your columns back in the same order you did the query (which is mostly true) and the headers are put on in R not on the BioMart sever side.

Cheers,

Steffen

ADD COMMENT
0
Entering edit mode
dmontaner • 0
@dmontaner-7059
Last seen 8.2 years ago

Thanks for your quick reply Steffen.

I will just rename the columns as you indicate.

Regard

David

ADD COMMENT
0
Entering edit mode
Thomas Maurel ▴ 800
@thomas-maurel-5295
Last seen 20 months ago
United Kingdom

Hi Steffen and David,

I think that in this case, BiomaRt return the wrong header and as a result bmHeader=TRUE should be used instead of FALSE:

> mydat <- getBM (c ("ensembl_gene_id", "gene_exon_intron"),
+                 filters = "biotype",
+                 values = "snoRNA",
+                 mart = mart, bmHeader=TRUE)
> mydat[1:3,]
                                                                                                                        Unspliced (Gene)
1                                                                   CAGCCCTAAAATGGAAAAAATTTAAAATTACTTAGACAATGTGATGTCATCAAAGGAACCCTAAGTAA
2                                                      GGGTGGTGATGAGAACCTTGTATTCTTCTGAAGAGAGGTGATGACTTAAAAACCATGCTCAATAGGATTACACTTAGGCCG
3 TCATCAGGTGGGATAATCCTTACCTGTTCCTCGTTTTGGAGGGCAGATAGAACAGGATAATTGGAGTTTGCATGATCCATGATTAATGTCTCTGTGTAATCAGGACTTGCAAACTCTGATTGTTCATATCTGAT
  Ensembl Gene ID
1 ENSG00000201209
2 ENSG00000200801
3 ENSG00000199713

Kind Regards,

Thomas

ADD COMMENT
0
Entering edit mode

Hi Thomas,

This definitely looks like a case where columns aren't returned in the same order they're requested, and biomaRt by default assigned the column names incorrectly.  I've got a forked version of the package here (https://github.com/grimbough/biomaRt) which tries to match the correct attributes to the column when you set bmHeader=TRUE

> mydat <- getBM (c ("ensembl_gene_id", "gene_exon_intron"),
+                                  filters = "biotype",
+                                  values = "snoRNA",
+                                  mart = mart, bmHeader=TRUE)
> mydat[1:3,]
                                                                    gene_exon_intron ensembl_gene_id
1      GCCAGTGATGATTAGATTCAATGGTTGCTGAACATTCAATGTTGAAAAGCATCTAACTTGACTAGGACGGTCTGAGG       AT3G47347
2 TATAATGATGATTAAGTCTAGATGGGAATCTCTCTGATGCACCTTTTAAATTGTTAATGATGTTTGTTTTGTGCCGGTGATG       AT1G74456
3   GCAAATGAAGAATTGATTAATTTATGCTTAACCACTGATGAACAGTGTTGACAAAACATCTCCGCTTATTATCTGATGCC       AT1G75163
ADD REPLY

Login before adding your answer.

Traffic: 590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6