Question

EDASeq gene length normalization

0

Entering edit mode

karlaarz • 0

@karlaruizce30-23403

Last seen 2.7 years ago

Italy

Hello,

I would like to use EDASeq for gene length normalization for RNA-Seq analysis. However, I have a basic question:

According to the manual, the "getGeneLengthAndGCContent" help us retrieve the gene length and GC content of our gene of interest but, how can I use it to download all the genes information instead of some genes for human?

Thanks!

edaseq r rna-seq • 1.9k views

ADD COMMENT • link updated 4.9 years ago by Kevin Blighe ★ 4.0k • written 5.0 years ago by karlaarz • 0

score 1 · Answer 1 · 2020-04-27

Hey Karla,

You could first retrieve a vector of all Entrez and Ensembl gene IDs via biomaRt and then use this with EDASeq, as I show here:

1, retrieve all Entrez and Ensembl gene IDs via biomaRt

require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
lookup <- getBM(
  mart = mart,
  attributes = c('entrezgene_id', 'ensembl_gene_id', 'gene_biotype'),
  uniqueRows = TRUE)

ens_genes <- lookup$ensembl_gene_id[!is.na(lookup$ensembl_gene_id)]

2, use EDASeq to retrieve gene length and GC content

out <- EDASeq::getGeneLengthAndGCContent(
  id = ens_genes[1:100],
  org = 'hsa',
  mode = c('biomart', 'org.db'))
Connecting to BioMart ...
Downloading sequences ...
This may take a few minutes ...

3, check output

head(out)
                length        gc
ENSG00000210049     71 0.4084507
ENSG00000211459    954 0.4549266
ENSG00000210077     69 0.4202899
ENSG00000210082   1559 0.4278384
ENSG00000209082     75 0.3866667
ENSG00000198888    956 0.4769874

tail(out)
                length        gc
ENSG00000257171   1525 0.3580328
ENSG00000227447    446 0.4439462
ENSG00000234089   2910 0.5233677
ENSG00000224565   1216 0.4761513
ENSG00000205476  17620 0.5610102
ENSG00000277040    437 0.6475973

The irony here is that EDASeq also uses biomaRt to retrieve the genomic co-ordinates and sequence of each gene, from which it calculates length and GC content, respectively.

It would take a while to run for all Ensembl genes (>60000); so, once done, you may consider saving the output as a R data object, or writing it to a file. You could also filter down to protein coding only via the gene_biotype column in the lookup object that I create (above).

Also confirm which ref version you have likely used:

searchDatasets(mart = mart, pattern = 'hsapiens')
                 dataset              description    version
78 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

Kevin