Entering edit mode
svlachavas
▴
830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…
Dear Community,
i would like to annotate a created DGEList object using the DGEList function from the edgeR function,with unique gene symbols for ensemble identifiers. My approach is the following :
y <- DGEList(counts=assay(coad_clear), group=colData(coad_clear)$definition) head(y$counts[1:3,1:3]) TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07 ENSG00000000003 7280 10395 ENSG00000000005 23 1 ENSG00000000419 2065 4158 TCGA-AU-6004-01A-11R-1723-07 ENSG00000000003 2547 ENSG00000000005 27 ENSG00000000419 1465 head(y$samples) group lib.size norm.factors TCGA-3L-AA1B-01A-11R-A37K-07 Primary solid Tumor 42553617 1 TCGA-DM-A1D8-01A-11R-A155-07 Primary solid Tumor 60377942 1 TCGA-AU-6004-01A-11R-1723-07 Primary solid Tumor 47402733 1 TCGA-T9-A92H-01A-11R-A37K-07 Primary solid Tumor 46429596 1 TCGA-AA-3663-11A-01R-1723-07 Solid Tissue Normal 35484802 1 TCGA-AA-A01T-01A-21R-A16W-07 Primary solid Tumor 15405325 1 #The one approach i followed: dim(y) [1] 56963 497 gene.ids <- select(org.Hs.eg.db, rownames(y), keytype="ENSEMBL",column="SYMBOL") 'select()' returned 1:many mapping between keys and columns dim(gene.ids) [1] 57310 2 head(gene.ids) ENSEMBL SYMBOL 1 ENSG00000000003 TSPAN6 2 ENSG00000000005 TNMD 3 ENSG00000000419 DPM1 4 ENSG00000000457 SCYL3 5 ENSG00000000460 C1orf112 6 ENSG00000000938 FGR sum(duplicated(gene.ids$ENSEMBL)) [1] 347 gene.ids <- gene.ids[!duplicated(gene.ids$ENSEMBL),] iidentical(gene.ids$ENSEMBL,rownames(y)) [1] TRUE y$genes <- gene.ids head(y$genes) ENSEMBL SYMBOL 1 ENSG00000000003 TSPAN6 2 ENSG00000000005 TNMD 3 ENSG00000000419 DPM1 4 ENSG00000000457 SCYL3 5 ENSG00000000460 C1orf112 6 ENSG00000000938 FGR y2 <- y[!duplicated(y$genes$SYMBOL),] dim(y2) [1] 25214 497
I wanted to ask if there is a more straightforward or more accurate approach, in order to perform the above annotation ? or my implementation has any pitfalls ? I have also checked the alternative function mapIds, but this returns a vector not a data frame. My aim is to perform downstream DE gene analysis.
Thank you in advance,
Efstathios
Thanks Aaron for the update. The simpler the better.
What is "coad_clear" in this case? What data have you allotted to it? Can you please share the whole code? Or can anyone answer this question?