Entering edit mode
Hi I used code initially in DESEQ2
dds=DESeqDataSet(se,design=~TRAIT)
dds=DESeq(dds)
res=results(dds)
I currently have results from DESEq2 which looks like this:
log2 fold change (MLE): TRAIT S vs N
Wald test p-value: TRAIT S vs N
DataFrame with 42800 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003.15 4.5568518 -0.1048029 0.5527906 -0.1895889 0.849631 NA
ENSG00000000005.6 0.0802772 -0.1570978 3.0089434 -0.0522103 0.958361 NA
ENSG00000000419.14 152.9563242 0.0437748 0.1139477 0.3841658 0.700856 0.857539
ENSG00000000457.14 271.1873887 0.0873888 0.0923764 0.9460082 0.344144 0.606002
ENSG00000000460.17 55.4021510 0.0671604 0.1729930 0.3882263 0.697849 0.856165
I want to run enrichGO on above results:
genes_to_test <- rownames(res[res$log2FoldChange>0.5,])
GO_res <- enrichGO(genes_to_test, OrgDb = org.Hs.eg.db, keyType = "ENSEMBL", ont="BP")
However, as you can see my ENSEMBL have ID Version (number then .xx = extra numbers) so they do not match to ENSEMBL.
What is keytype alternative that will capture the ID version ?
Second question is how can the inuitial DESEQ2 results be changed so that the rownames are Gene names instead of ENSEMBL version ID?
Many thanks
Hi James -- I hope to piggy back on this question of how to deal with the ensemble ID with version ID suffix. In contrast to DeSeq2 results converting ENSG IDs to Gene Symbols, more than one ENSG ID per Gene symbol, my data have Ensemble ID with a suffix indicating versions. When I
stripped_ids <- sub("\\.[0-9]+$", "", original_ids)
in whichoriginal_ids <- rownames(ddsHTSeq_SubsetxUnStimRefConditions)
, there were no duplicated values in stripped_id. However, a unique Ensemble ID (without the version ID suffix) still mapped to multiple gene symbols or entrezid which is expected.In this case, I think we should treat these entries as separate "genes" (Filtering read counts matrix: how to deal with duplicated gene symbols, different ENSEMBL ids). Or, is it appropriate to just sum the counts for these genes (that have different ensemble IDs but mapped to the same gene symbol ID)? Thank you for your help.
I am not sure I understand. You say that a unique Ensembl ID maps to multiple gene symbols. That doesn't mean you have two genes (it's just one Ensembl ID, and one row in your data), but instead that you have one-to-many mapping of that ID to other IDs. If that's the case, I normally just pick one symbol and go with that.
If you are saying the opposite (that TWO different Ensembl IDs are mapping to the same gene symbol), then I would consider that to be two genes. The alternative is to look at each individual case and try to figure out why two (or more) Ensembl IDs are mapping to the same symbol, and make individual decisions for each one. That sounds complex and yet boring at the same time, so I just default to thinking that the Ensembl ID is 'real' and the gene symbols are just simplifications for our biologist friends.