Question

Deseq2, enrichGO and ensembl ID

0

Entering edit mode

Alvin • 0

@3cc02754

Last seen 15 months ago

United Kingdom

Hi I used code initially in DESEQ2

dds=DESeqDataSet(se,design=~TRAIT)
dds=DESeq(dds)
res=results(dds)

I currently have results from DESEq2 which looks like this:

log2 fold change (MLE): TRAIT S vs N 
Wald test p-value: TRAIT S vs N 
DataFrame with 42800 rows and 6 columns
                      baseMean log2FoldChange     lfcSE       stat    pvalue      padj
                     <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
ENSG00000000003.15   4.5568518     -0.1048029 0.5527906 -0.1895889  0.849631        NA
ENSG00000000005.6    0.0802772     -0.1570978 3.0089434 -0.0522103  0.958361        NA
ENSG00000000419.14 152.9563242      0.0437748 0.1139477  0.3841658  0.700856  0.857539
ENSG00000000457.14 271.1873887      0.0873888 0.0923764  0.9460082  0.344144  0.606002
ENSG00000000460.17  55.4021510      0.0671604 0.1729930  0.3882263  0.697849  0.856165

I want to run enrichGO on above results:

genes_to_test <- rownames(res[res$log2FoldChange>0.5,])
GO_res <- enrichGO(genes_to_test, OrgDb = org.Hs.eg.db, keyType = "ENSEMBL", ont="BP")

However, as you can see my ENSEMBL have ID Version (number then .xx = extra numbers) so they do not match to ENSEMBL.

What is keytype alternative that will capture the ID version ?
Second question is how can the inuitial DESEQ2 results be changed so that the rownames are Gene names instead of ENSEMBL version ID?

Many thanks

go DESeq2 • 832 views

ADD COMMENT • link updated 18 hours ago by James W. MacDonald 67k • written 15 months ago by Alvin • 0

score 1 · Answer 1 · 2023-09-21

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 14 hours ago

United States

There isn't a keytype alternative. You need the gene IDs.
Convert the rownames of your 'se' object, or the rownames of your DESeqDataSet object.

rownames(se) <- gsub("\\.[0-9]+$", "", rownames(se))

ADD COMMENT • link 15 months ago James W. MacDonald 67k

0

Entering edit mode

Hi James -- I hope to piggy back on this question of how to deal with the ensemble ID with version ID suffix. In contrast to DeSeq2 results converting ENSG IDs to Gene Symbols, more than one ENSG ID per Gene symbol, my data have Ensemble ID with a suffix indicating versions. When I stripped_ids <- sub("\\.[0-9]+$", "", original_ids) in which original_ids <- rownames(ddsHTSeq_SubsetxUnStimRefConditions), there were no duplicated values in stripped_id. However, a unique Ensemble ID (without the version ID suffix) still mapped to multiple gene symbols or entrezid which is expected.

In this case, I think we should treat these entries as separate "genes" (Filtering read counts matrix: how to deal with duplicated gene symbols, different ENSEMBL ids). Or, is it appropriate to just sum the counts for these genes (that have different ensemble IDs but mapped to the same gene symbol ID)? Thank you for your help.

ADD REPLY • link 21 hours ago Quang ▴ 10

0

Entering edit mode

I am not sure I understand. You say that a unique Ensembl ID maps to multiple gene symbols. That doesn't mean you have two genes (it's just one Ensembl ID, and one row in your data), but instead that you have one-to-many mapping of that ID to other IDs. If that's the case, I normally just pick one symbol and go with that.

If you are saying the opposite (that TWO different Ensembl IDs are mapping to the same gene symbol), then I would consider that to be two genes. The alternative is to look at each individual case and try to figure out why two (or more) Ensembl IDs are mapping to the same symbol, and make individual decisions for each one. That sounds complex and yet boring at the same time, so I just default to thinking that the Ensembl ID is 'real' and the gene symbols are just simplifications for our biologist friends.

ADD REPLY • link 18 hours ago James W. MacDonald 67k