Hey Karla,
You could first retrieve a vector of all Entrez and Ensembl gene IDs via biomaRt and then use this with EDASeq, as I show here:
1, retrieve all Entrez and Ensembl gene IDs via biomaRt
require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
lookup <- getBM(
mart = mart,
attributes = c('entrezgene_id', 'ensembl_gene_id', 'gene_biotype'),
uniqueRows = TRUE)
ens_genes <- lookup$ensembl_gene_id[!is.na(lookup$ensembl_gene_id)]
2, use EDASeq to retrieve gene length and GC content
out <- EDASeq::getGeneLengthAndGCContent(
id = ens_genes[1:100],
org = 'hsa',
mode = c('biomart', 'org.db'))
Connecting to BioMart ...
Downloading sequences ...
This may take a few minutes ...
3, check output
head(out)
length gc
ENSG00000210049 71 0.4084507
ENSG00000211459 954 0.4549266
ENSG00000210077 69 0.4202899
ENSG00000210082 1559 0.4278384
ENSG00000209082 75 0.3866667
ENSG00000198888 956 0.4769874
tail(out)
length gc
ENSG00000257171 1525 0.3580328
ENSG00000227447 446 0.4439462
ENSG00000234089 2910 0.5233677
ENSG00000224565 1216 0.4761513
ENSG00000205476 17620 0.5610102
ENSG00000277040 437 0.6475973
The irony here is that EDASeq also uses biomaRt to retrieve the genomic co-ordinates and sequence of each gene, from which it calculates length and GC content, respectively.
It would take a while to run for all Ensembl genes (>60000); so, once done, you may consider saving the output as a R data object, or writing it to a file. You could also filter down to protein coding only via the gene_biotype
column in the lookup
object that I create (above).
Also confirm which ref version you have likely used:
searchDatasets(mart = mart, pattern = 'hsapiens')
dataset description version
78 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13
Kevin