Entering edit mode
Noah Dowell
▴
410
@noah-dowell-3791
Last seen 10.3 years ago
Hello All,
Problem:
I would like to obtain the genomic sequence that is upstream (~500 bp)
of a specific bacterial gene. I want to get this sequence for all
bacteria genomes that have the gene. On EcoCyc I see that many (>
100) bacteria have the gene but I do not know how to get all of the
sequence in a high-throughput manner so I was going to use biomaRt to
get the sequence and send to alignment programs later. I have read
through the vignette and tried to get the function to work with a non-
ensembl MART to no avail. I also was presented with an error (see
below) that suggested I report to the mailing list. It looks like I
will also have to query each of the 249 bacterial genomes in the
"bacterial_mart_7" Mart individually (with getLDS or getBM) which does
not seem high-throughput at all... are there any other suggestions
that will allow me to take advantage a the large amount of bacterial
genomic data for homology studies?
Thank you for your help.
Noah
Attempted Solution (for a single genome):
> bacGenome = useMart("bacterial_mart_7", dataset = "esc_20_gene")
Checking attributes ... ok
Checking filters ... ok
>
> filters = c("external_gene_id")
>
> attributes = c("external_gene_id","upstream_flank")
>
> values = list(external_gene_id = c("fis"), 500)
> seq = getBM(attributes=attributes, filters = filters, values =
values, mart= bacGenome,
+ checkFilters= FALSE)
V1
1 fis
Error in getBM(attributes = attributes, filters = filters, values =
values, :
The query to the BioMart webservice returned an invalid result: the
number of columns in the result table does not equal the number of
attributes in the query. Please report this to the mailing list.
> sessionInfo()
R version 2.11.0 (2010-04-22)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.8.1 RCurl_1.3-1 bitops_1.0-4.1
biomaRt_2.4.0
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.0
GenomicRanges_1.0.1 IRanges_1.6.0
[6] tools_2.11.0 XML_2.8-1