Here are two lines of code that create data frames using biomaRt's "getBM" function:
exons7p2 <- getBM(attributes = c("refseq_dna"), filters = "chromosome_name", values = "X", mart = mart.mm)
exons7p3 <- getBM(attributes = c("refseq_dna", "gene_exon_intron"), filters = "chromosome_name", values = "X", mart = mart.mm)
The only thing that differs is that I'm requesting one additional attribute in the second assignment. But if I look at what is in the "refseq_dna" columns of these two data frames, it's completely different.
head(exons7p2$"refseq_dna") gives a bunch of gene names of the "NM_010498" type
head(exons7p3$"refseq_dna") gives a bunch of sequences
Clearly, there is something fundamental I'm misunderstanding about biomaRt. I would appreciate any guidance.
Thanks.
Eric
below is how I made mart.mm and also my sessionInfo()
mart.mm <- useDataset("mmusculus_gene_ensembl", mart = mart.mm)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] org.Mm.eg.db_3.1.2 RSQLite_1.0.0
[3] DBI_0.3.1 TxDb.Mmusculus.UCSC.mm9.knownGene_3.1.2
[5] GenomicFeatures_1.20.5 AnnotationDbi_1.30.1
[7] Biobase_2.28.0 GenomicRanges_1.20.8
[9] GenomeInfoDb_1.4.3 IRanges_2.2.7
[11] S4Vectors_0.6.6 BiocGenerics_0.14.0
[13] biomaRt_2.24.1
loaded via a namespace (and not attached):
[1] XVector_0.8.0 zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.21
[5] tools_3.2.2 lambda.r_1.1.7 futile.logger_1.4.1 rtracklayer_1.28.10
[9] futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.7 Biostrings_2.36.4
[13] Rsamtools_1.20.4 XML_3.98-1.3
Hi Thomas,
Perhaps I'm not understanding your answer, but I don't think that that is my problem. Below I have two data frames that I made with getBM, as described above. One of them has just "refseq_dna" as an attribute and the other has both "refseq_dna" and "gene_exon_intron":
> names(exons7p2)
[1] "refseq_dna"
> names(exons7p3)
[1] "refseq_dna" "gene_exon_intron"
But now, if I ask for the first few "refseq_dna" items in the "exons7p2" data frame, I get gene names, whereas when I ask for the first few "refseq_dna" items in the "exons7p3" data frame (so the exact same type of attributes - not the "refseq_dna" and the "gene_exon_intron", but just the "refseq_dna") then I get DNA sequences. (I listed just the first few lines of sequences, but many more followed.):
> head(exons7p2$refseq_dna)
[1] "XM_006527846" "NM_010498" "XM_006527845" "NM_001290562" "NM_001290561" "NM_011123"
> head(exons7p3$refseq_dna)
[1] "GTCAGTGCACAACTGCCAACTGGGATGCAGAACACTGCTCACGCCAACCATCCTGAAAGCCAACTATAAAAAGCAGAGAGATACTCTGCACCTTTTCAGTGAGGTCCAGATACCCACAGAGCAGAGACAGTCGCTCACACATGATGAGGGTCATCATCCTCCTGCTCACACTGCATGTGCTAGGCGTCTCCAGTGTGATGAGTCTCAAAAAGAAGGTAGCAGACCTGTGTGGAAGGGGGCTGTATGTGGTGGGCATGTTGGGCAGAGACAAACAGACAGAGAGAGGCTTGGGAGG
So items in one data frame that I retrieved by asking for "refseq_dna" give me one type of data (names) whereas asking for the same thing from another data frame gives me a different type of data
Could it be that there is a bug that mixes up attribute names? If I ask for "gene_exon_intron" items from the data frame that has them, it gives me gene names like those in the "refseq_dna" column from the other data frame (though not the same ones).
Thanks.
Eric