Question

Correct annotation of DGEList object from edgeR package regarding an RNA-Seq dataset

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 15 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

i would like to annotate a created DGEList object using the DGEList function from the edgeR function,with unique gene symbols for ensemble identifiers. My approach is the following :

y <- DGEList(counts=assay(coad_clear), group=colData(coad_clear)$definition)

 head(y$counts[1:3,1:3])
                TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07
ENSG00000000003                         7280                        10395
ENSG00000000005                           23                            1
ENSG00000000419                         2065                         4158
                TCGA-AU-6004-01A-11R-1723-07
ENSG00000000003                         2547
ENSG00000000005                           27
ENSG00000000419                         1465

 head(y$samples)
                                           group lib.size norm.factors
TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor 42553617            1
TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor 60377942            1
TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor 47402733            1
TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor 46429596            1
TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal 35484802            1
TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor 15405325            1

#The one approach i followed:

dim(y)
[1] 56963   497

gene.ids <-  select(org.Hs.eg.db, rownames(y), keytype="ENSEMBL",column="SYMBOL")
'select()' returned 1:many mapping between keys and columns

 dim(gene.ids)
[1] 57310     2

head(gene.ids)

          ENSEMBL   SYMBOL

1 ENSG00000000003   TSPAN6
2 ENSG00000000005     TNMD
3 ENSG00000000419     DPM1
4 ENSG00000000457    SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938      FGR

sum(duplicated(gene.ids$ENSEMBL))
[1] 347

gene.ids <- gene.ids[!duplicated(gene.ids$ENSEMBL),] 

iidentical(gene.ids$ENSEMBL,rownames(y))
[1] TRUE

y$genes <- gene.ids

head(y$genes)
          ENSEMBL   SYMBOL
1 ENSG00000000003   TSPAN6
2 ENSG00000000005     TNMD
3 ENSG00000000419     DPM1
4 ENSG00000000457    SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938      FGR

y2 <- y[!duplicated(y$genes$SYMBOL),]

dim(y2)

[1] 25214   497

I wanted to ask if there is a more straightforward or more accurate approach, in order to perform the above annotation ? or my implementation has any pitfalls ? I have also checked the alternative function mapIds, but this returns a vector not a data frame. My aim is to perform downstream DE gene analysis.

Thank you in advance,

Efstathios

edger DGEList org.hs.eg.db gene annotation • 4.2k views

ADD COMMENT • link updated 3.6 years ago by IIart.hubII • 0 • written 7.4 years ago by svlachavas ▴ 840

score 3 · Accepted Answer · 2017-12-03

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

Just use mapIds and save the output in a data.frame:

gene.ids <- mapIds(org.Hs.eg.db, keys=rownames(y),
                   keytype="ENSEMBL", column="SYMBOL")
y$genes <- data.frame(ENSEMBL=rownames(y), SYMBOL=gene.ids)

Done.

ADD COMMENT • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron for the update. The simpler the better.

ADD REPLY • link 7.4 years ago svlachavas ▴ 840

0

Entering edit mode

What is "coad_clear" in this case? What data have you allotted to it? Can you please share the whole code? Or can anyone answer this question?

ADD REPLY • link 3.6 years ago IIart.hubII • 0