Is goana function for Gene Ontology enrichement analysis suitable for both edgeR and DESeq2 DEGs lists?
3
1
Entering edit mode
Raito92 ▴ 60
@raito92-20399
Last seen 2.5 years ago
Italy

Hello, In a previous RNA-Seq work, I've successfully used the function goana to perform a Gene Ontology enrichment analysis, as shown in this edgeR quasi-likelihood pipeline.

tr <- glmTreat(fit, contrast=B.LvsP, lfc=log2(1.5))

...

go <- goana(tr, species="Mm")

Particularly, as shown above, the function takes as input a tr object, a list of DEGs, which, in a edgeR pipeline, appears similar to this (just random data to show what the format looks like):

    Length  Symbol  logFC   logCPM  F   PValue  FDR
SMEL4.1_02g001580.1 10924   SMEL4.1_02g001580.1 0.386082756876937   5.47193248924038    25.2797509008095    4.42855378062147e-06    0.0722591857534324
SMEL4.1_07g022510.1 3129    SMEL4.1_07g022510.1 -0.748615816092187  5.36055092214935    23.9132538015929    7.36455549825986e-06    0.0722591857534324
SMEL4.1_08g023290.1 6074    SMEL4.1_08g023290.1 0.652417990788294   5.68631841662533    22.8866172367585    1.08564820963941e-05    0.0722591857534324
SMEL4.1_06g029720.1 4487    SMEL4.1_06g029720.1 3.05216960716702    -1.21675146726858   22.8190148238499    1.11482103848422e-05    0.0722591857534324
SMEL4.1_06g000870.1 9348    SMEL4.1_06g000870.1 -0.769865924141295  7.81712996389281    22.2868032192706    1.36528711320395e-05    0.0722591857534324
SMEL4.1_12g000930.1 11716   SMEL4.1_12g000930.1 -0.528781930515847  3.76724160223338    20.8372219018991    2.39400058202952e-05    0.105587395670412
SMEL4.1_04g022680.1 5765    SMEL4.1_04g022680.1 0.449553768563767   6.33454740738143    20.1535126943511    3.1322861836535e-05 0.118413841825746
SMEL4.1_11g016210.1 4515    SMEL4.1_11g016210.1 -0.626808335105108  5.93739473093748    19.0991859367415    4.76537706101889e-05    0.157632716457179
SMEL4.1_11g022810.1 3534    SMEL4.1_11g022810.1 -0.583940556612912  4.31038909943095    18.3097484326857    6.55191830346608e-05    0.158176465103364

In the past, I've succesfully used tr as imput and even provided an external list of DEGs (a sub-set of them).

But this time I'm interested in using the goana function for the output of a DESeq2 analysis.

The DEG list is quite different, and looks, as shown in the official tutorial, like this (again, an example to show the format):

baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
SMEL4.1_02g001580.1 2366.00298555111    0.364304598757931   0.074243385889273   4.9068963436185 9.25289021738908e-07    0.0076683327676612
SMEL4.1_07g022510.1 2196.50866292512    -0.761717027785759  0.152859394549787   -4.98312210400432   6.25664404354783e-07    0.0076683327676612
SMEL4.1_06g000870.1 12072.4148262247    -0.780492533034506  0.16593702035746    -4.70354675137095   2.55680515074842e-06    0.0105947613434138
SMEL4.1_12g000930.1 727.48576752828 -0.542466292269438  0.11410126216564    -4.75425321309718   1.99181275267645e-06    0.0105947613434138
SMEL4.1_08g023290.1 2745.00816946739    0.642851279859146   0.140000470422468   4.59177942702098    4.39482718044595e-06    0.0145688521031783
SMEL4.1_09g005850.1 5266.90686547789    -0.394356643158581  0.0875154389042871  -4.50613798086389   6.60181671396486e-06    0.0182375186723279
SMEL4.1_04g022680.1 4306.1114111016 0.436780400377881   0.0981091951616122  4.45198230052123    8.50811805120964e-06    0.0201460080998285
SMEL4.1_11g001830.1 1944.89865643966    5.53066904283725    1.26836216149314    4.3604809499571 1.29776880663403e-05    0.0268881474624488
SMEL4.1_11g016210.1 3279.56438989611    -0.635759169045224  0.148173951998916   -4.29062706682666   1.78169304175309e-05    0.032812846852286

Since, apparently, goana (belonging to the package limma) isn't designed specifically for edgeR, I'd like to run goana on a DESeq2 DEGs list provided externally. However, in order to do this, I'd still need to modify my original DESeq2 list so that it looks like an edgeR DEGs list, and I was wondering how to change it. So, trying to figure out a correspondence between the headers of the lists obtained from edgeR and DESeq2... my guess is that goana needs for sure the colums logFC and PValue to run enrichment analysis, which are log2FoldChange and pvalue, respectively, in the DESeq2 object.

I'm not sure which other parameters are necessary. As far as I know, p-adj in DESeq2 is the equivalent of the edgeR's FDR.
logCPM isn't shown at all in DESeq2 ... F and stat are probably both parameters related to the specific DEGs statistical analysis so may not be necessary in the further GO enrichment test...

Should I provide these parameters in the input list for goana? Are they necessary for the analysis?

Am I making wrong assumptions, or forgetting anything important? Is it correct to use goana on DESeq2 data this way? Thanks in advance for your tips!

DESeq2 edgeR goana GO • 4.2k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

goana will work on fitted model objects from limma or edgeR, see ?goana.DGELRT. It will work on objects created by eBayes, glmTreat, glmLRT, glmQLFTest or exactTest.

goana requires Entrez Gene IDs and does not work for non-model organisms. As James has explained, for new organisms you need to supply your own gene.pathway table and use either kegga or goseq.

ADD COMMENT
0
Entering edit mode

Very interesting and specific.

It isn't a problem for the experiment I'm running right now, but in the future, in case I need to use specifically goana/ kegga(which somehow allows me to use my own 'customized' GO/genes correspondence dataset, that's why I'm sticking to it) in a DESeq2 pipeline, is there any way to obtain such objects? limma for instance isn't a package designed specifically for edgeR. The objects created by glmTreat, glmQLFTest glmLRT to the best of my knowledge are obtained in edgeR pipelines. I have no experience about the other ones you mentioned.

ADD REPLY
0
Entering edit mode

goana and kegga are generic functions and they have specific methods written for limma and edgeR objects. See goana.DGELRT.

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 10 hours ago
United States

See DESeq2 workflow for description of the results columns.

padj is the FDR threshold-able column.

Our measure of expression strength is the mean of normalized counts: baseMean. These are described in the workflow and vignettes. See "Access to all calculated values" in DESeq2 vignette for description.

ADD COMMENT
0
Entering edit mode

Hi, thank you for your answer.

I've tried but apparently the whole thing is harder than just changing columns... The whole tr object is more complicated.

I checked online but I couldn't find any straight way to use goana on DESeq2. How could do I that?

In the past I just prepared an external file to provide a sub-set of DEGs to edgeR (a subpart of tr), imported it and assigned it to tr$table.

I was considering alternative tools, which could also allow me to provide a personal Gene Ontology terms/genes correspondence... do you have any suggestions?

ADD REPLY
1
Entering edit mode

If it requires something downstream of edgeR I would just use edgeR instead. I don't have any alternative tools in mind for personal GO. I use goseq which easily works with DESeq2, but as you say is built around specific organisms.

ADD REPLY
0
Entering edit mode

Thank you, in my work I performed the same DEGs analysis with both DESeq2 and edgeR. The reason was that the comparisons between some samples gave 0 DEGs in edgeR, which sounded a bit suspicious (in my experience edgeR tends to be more stringent). But on DESeq2 the results were virtually the same (just a few units if any at all).

In general, DESeq2 proved to be slightly more sensitive among all samples and always found more DEGs, but the mutual ones had the same FC and similar statistical significance.

So, I'll just switch back to my edgeR DEGs lists.

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

Orthogonal to the original question, but you might consider using goseq instead of goana.

ADD COMMENT
0
Entering edit mode

Hello, thanks for your suggestion.

I was also wondering if there is one way to provide a personal Gene Ontology terms/genes correspondence to goseq or any alternative package I could use?

I'm working with a species for which no Gene Ontology database is available (so I can't use services like Panther nor any official one) and I have to use the GO annotation made exactly on the genome provided by our partners.

From goseq's vignette,

In order to link GO categories to genes, goseq uses the organism packages from Bioconductor. These packages are named org.<Genome>.<ID>.db, where <Genome> is a short string identifying the genome and <ID> is a short string identifying the gene identifier. Currently, goseq will automatically retrieve the mapping between GO categories and genes from the relevant package (as long as it is installed) for commonly used genome/ID combinations.

ADD REPLY
2
Entering edit mode

You should also peruse the help pages. You will need two things. First, for nullp, you need a measure of the median gene length (this is the 'bias.data' argument). Since you have done RNA-Seq, I presume you can get that. Second, for goseq itself, you need to provide something for the 'gene2cat' argument, which is either a data.frame containing the gene ID in the first column and the GO terms in the second, or a list with gene IDs as the name of the list and GO IDs as the values.

You show above that you use 'Mm' as the argument for species in your call to goana, which can't possibly work correctly if you have something other than mouse data. Is that just an example? For goana you need to use an OrgDb package, and if you have some not-standard species I don't see how you ever got that to work. You can use kegga with a 'gene.pathway' argument, in which case that would be identical to what you use for 'gene2cat' in goseq.

ADD REPLY
0
Entering edit mode

Thank you for the goseq suggestion!

Yes, above I used 'Mm' just as an example, I was just mentioning the 'tr' object.

As I said above, I need to work on a non-standard species, so no OrgDb package is available in my case.

About kegga, doesn't it perform a KEGG analysis instead?

ADD REPLY
2
Entering edit mode

kegga performs a hypergeometric test using KEGG pathways. It's the same test as used for GO terms, only the groups of genes come from KEGG instead. In other words, you have a set of genes, some of which are in a group and the rest that aren't in the group (where group is either 'in a GO term' or 'in a KEGG pathway'), and you have an indicator for each gene that says if it's significant or not. The test doesn't care why a gene is in one group or another.

But goana is hard coded to get the GO membership using an OrgDb package, and kegga has the facility to allow you to provide the gene -> group mappings. So you can provide kegga with a gene -> GO mapping, and it will happily do the same test that goana would have done, but without the requirement that you get the mappings from an OrgDb package.

ADD REPLY
0
Entering edit mode

Thank you for this suggestion, it's going to be very helpful!

However, how can I provide the customized list to kegga? I checked this page on R Documentation (same one for goana and kegga), but apparently the synthax is the same as goana and in the second line provided it still refer to the species Hs (Homo sapiens).

"kegga"(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)

"kegga"(de, universe = NULL, species = "Hs", species.KEGG = NULL, convert = FALSE, gene.pathway = NULL, pathway.names = NULL, prior.prob = NULL, covariate=NULL, plot=FALSE, ...)
ADD REPLY
1
Entering edit mode

See the help page for the gene.pathway argument

ADD REPLY
2
Entering edit mode

Also there is this

The statistical approach provided here is the same as that
     provided by the goseq package, with one methodological difference
     and a few restrictions. Unlike the goseq package, the gene
     identifiers here must be Entrez Gene IDs and the user is assumed
     to be able to supply gene lengths if necessary. The goseq package
     has additional functionality to convert gene identifiers and to
     provide gene lengths. The only methodological difference is that
     'goana' and 'kegga' computes gene length or abundance bias using
     'tricubeMovingAverage' instead of monotonic regression. While
     'tricubeMovingAverage' does not enforce monotonicity, it has the
     advantage of numerical stability when 'de' contains only a small
     number of genes.

I don't see how you 'supply gene lengths if necessary', and I do know how to use goseq, so I tend to go with what I know.

ADD REPLY
1
Entering edit mode

Gene lengths can be supplied to goana or kegga via the covariate or trend arguments. There's an example at the bottom of the goana help page.

ADD REPLY
1
Entering edit mode

So there is. Thanks for pointing that out!

ADD REPLY
0
Entering edit mode

data.frame linking genes to pathways. First column gives gene IDs, second column gives pathway IDs. By default this is obtained automatically by getGeneKEGGLinks(species.KEGG).

Pathways, or, indeed, as you said, in this case GO terms.

Something similar (just making up numbers)

GO: 000000002 / Gene1 
GO: 000000001 / Gene1
GO: 000000007 / Gene1
GO: 000000001 / Gene2

I'll try this, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6