Question

Weird results (ribosomal proteins / Y-linked genes) in limma/voom

0

Entering edit mode

blofeld ▴ 10

@blofeld-11309

Last seen 8.3 years ago

Bonn, Germany

Hi everybody,

I am trying to compare gene expression using RNA Seq data from 50 and 50 human tissue samples with two phenotypically similar diseases that may exhibit distinct pathophysiological differences on the transcriptomic level (= my hypothesis).

Several, but not all subjects in the study provided more than one sample, in one case even one from each disease type. Thus, I would want to correct for the subject as a random effect.

Additional (fixed) covariates include gender (disease B has more men that A) and a measure of disease severity (continuous integer values).

In order to be able to incorporate both my fixed and random factors (as a blocking factor), I used limma/voom

design <- model.matrix(~1+ targets$disease + targets$gender + targets$severity) y <- voom(x,design,plot=F, normalize="quantile") corfit <- duplicateCorrelation(y, design, block=targets$Patient) y <- voom(x,design,plot=FALSE, block=targets$Patient, correlation=corfit$consensus, normalize="quantile") fit=lmFit(y, design, block=targets$Patient, correlation=corfit$consensus) fit <- eBayes(fit)

I have also tried TMM normalization, or voomWithQualityWeights.

My problem is that I get a number of DE genes for the disease B vs. A with large FCs (up to 10fold) that are mostly Y-linked, and/or ribosomal proteins. These results do not make sense biologically.

As a comparison, I have used DESeq2 to analyze the same data. Correcting for severity and patient as fixed factors (yes, I know, patient should be random...), I find very different genes to be differentially expressed, with lower FCs. Those make more sense biologically.

I know that DESeq2 is probably not appropriate for my data, and I would like to use limma here. But what am I doing wrong here? Is it a normalization problem?

Many thanks in advance for your thoughts and comments!

limma voom rnaseq • 1.6k views

ADD COMMENT • link updated 8.3 years ago by Gordon Smyth 52k • written 8.3 years ago by blofeld ▴ 10

0

Entering edit mode

You don't show what topTable command you used to get your DE list. Large log-fold changes for Y-linked genes; are you sure you're not testing for the sex effect? Also, everything but severity had better be a factor type.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Many thanks for your comment.

You are right, I should have posted that (see below).

> topTable(fit, coef=2)
                ensembl_gene_id ensembl_transcript_id entrezgene hgnc_symbol
ENSG00000129824 ENSG00000129824       ENST00000250784       6192      RPS4Y1
ENSG00000131002 ENSG00000131002       ENST00000407724     246126      TXLNGY
ENSG00000067048 ENSG00000067048       ENST00000360160       8653       DDX3Y
ENSG00000012817 ENSG00000012817       ENST00000469599       8284       KDM5D
ENSG00000114374 ENSG00000114374       ENST00000338981       8287       USP9Y
ENSG00000183878 ENSG00000183878       ENST00000331397       7404         UTY
ENSG00000067646 ENSG00000067646       ENST00000383052       7544         ZFY
ENSG00000099725 ENSG00000099725       ENST00000528056       5616        PRKY
ENSG00000198692 ENSG00000198692       ENST00000361365       9086      EIF1AY
ENSG00000235109 ENSG00000235109       ENST00000481934      64288     ZSCAN31
                                                                                                               description
ENSG00000129824                                       ribosomal protein S4, Y-linked 1 [Source:HGNC Symbol;Acc:HGNC:10425]
ENSG00000131002                                     taxilin gamma pseudogene, Y-linked [Source:HGNC Symbol;Acc:HGNC:18473]
ENSG00000067048                                           DEAD-box helicase 3, Y-linked [Source:HGNC Symbol;Acc:HGNC:2699]
ENSG00000012817                                                  lysine demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115]
ENSG00000114374                               ubiquitin specific peptidase 9, Y-linked [Source:HGNC Symbol;Acc:HGNC:12633]
ENSG00000183878 ubiquitously transcribed tetratricopeptide repeat containing, Y-linked [Source:HGNC Symbol;Acc:HGNC:12638]
ENSG00000067646                                          zinc finger protein, Y-linked [Source:HGNC Symbol;Acc:HGNC:12870]
ENSG00000099725                                    protein kinase, Y-linked, pseudogene [Source:HGNC Symbol;Acc:HGNC:9444]
ENSG00000198692                   eukaryotic translation initiation factor 1A, Y-linked [Source:HGNC Symbol;Acc:HGNC:3252]
ENSG00000235109                              zinc finger and SCAN domain containing 31 [Source:HGNC Symbol;Acc:HGNC:14097]
                     logFC    AveExpr         t      P.Value    adj.P.Val
ENSG00000129824  4.7201875  1.0601605  6.609595 1.924885e-09 3.699244e-05
ENSG00000131002  4.2981889  1.1965928  6.241831 1.062628e-08 5.891736e-05
ENSG00000067048  4.0875311  1.3380538  6.161889 1.532869e-08 5.891736e-05
ENSG00000012817  4.1857966  1.1529671  6.184375 1.383026e-08 5.891736e-05
ENSG00000114374  4.1592220  1.0849112  6.191685 1.337496e-08 5.891736e-05
ENSG00000183878  3.7892623  1.3358802  5.865323 5.866505e-08 1.879042e-04
ENSG00000067646  3.5861337  0.1536729  5.476375 3.260742e-07 8.952134e-04
ENSG00000099725  2.7008122  0.6064057  5.056551 1.946346e-06 4.675611e-03
ENSG00000198692  3.1083149 -0.5579705  4.857377 4.425510e-06 9.449939e-03
ENSG00000235109 -0.4464802  4.7901975 -4.256346 4.706265e-05 7.537083e-02
                       B
ENSG00000129824 9.944145
ENSG00000131002 8.572458
ENSG00000067048 8.335578
ENSG00000012817 8.285957
ENSG00000114374 8.261634
ENSG00000183878 7.209747
ENSG00000067646 5.140808
ENSG00000099725 3.866204
ENSG00000198692 2.705380
ENSG00000235109 1.858661
> is.factor(targets$disease) 
[1] TRUE
> is.factor(targets$gender)
[1] TRUE
> is.integer(targets$severity)
[1] TRUE

ADD REPLY • link 8.3 years ago blofeld ▴ 10

0

Entering edit mode

I also looked into the gender effects - very similar genes appear, with better p-values.

> topTable(fit, coef=3)
                ensembl_gene_id ensembl_transcript_id entrezgene hgnc_symbol
ENSG00000099725 ENSG00000099725       ENST00000528056       5616        PRKY
ENSG00000067646 ENSG00000067646       ENST00000383052       7544         ZFY
ENSG00000129824 ENSG00000129824       ENST00000250784       6192      RPS4Y1
ENSG00000005889 ENSG00000005889       ENST00000497813       7543         ZFX
ENSG00000233864 ENSG00000233864       ENST00000457658      64595      TTTY15
ENSG00000198692 ENSG00000198692       ENST00000361365       9086      EIF1AY
ENSG00000114374 ENSG00000114374       ENST00000338981       8287       USP9Y
ENSG00000012817 ENSG00000012817       ENST00000469599       8284       KDM5D
ENSG00000131002 ENSG00000131002       ENST00000407724     246126      TXLNGY
ENSG00000067048 ENSG00000067048       ENST00000360160       8653       DDX3Y
                                                                                                     description
ENSG00000099725                          protein kinase, Y-linked, pseudogene [Source:HGNC Symbol;Acc:HGNC:9444]
ENSG00000067646                                zinc finger protein, Y-linked [Source:HGNC Symbol;Acc:HGNC:12870]
ENSG00000129824                             ribosomal protein S4, Y-linked 1 [Source:HGNC Symbol;Acc:HGNC:10425]
ENSG00000005889                                zinc finger protein, X-linked [Source:HGNC Symbol;Acc:HGNC:12869]
ENSG00000233864 testis-specific transcript, Y-linked 15 (non-protein coding) [Source:HGNC Symbol;Acc:HGNC:18567]
ENSG00000198692         eukaryotic translation initiation factor 1A, Y-linked [Source:HGNC Symbol;Acc:HGNC:3252]
ENSG00000114374                     ubiquitin specific peptidase 9, Y-linked [Source:HGNC Symbol;Acc:HGNC:12633]
ENSG00000012817                                        lysine demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115]
ENSG00000131002                           taxilin gamma pseudogene, Y-linked [Source:HGNC Symbol;Acc:HGNC:18473]
ENSG00000067048                                 DEAD-box helicase 3, Y-linked [Source:HGNC Symbol;Acc:HGNC:2699]
                     logFC    AveExpr         t      P.Value    adj.P.Val
ENSG00000099725  4.5733716  0.6064057  6.815547 7.283925e-10 4.046800e-06
ENSG00000067646  5.6232992  0.1536729  6.784889 8.422936e-10 4.046800e-06
ENSG00000129824  6.6336546  1.0601605  6.586468 2.145456e-09 8.246275e-06
ENSG00000005889 -0.3415792  6.4751894 -6.056646 2.475845e-08 3.660061e-05
ENSG00000233864  4.6152907 -0.7869401  6.852263 6.118939e-10 4.046800e-06
ENSG00000198692  5.0844618 -0.5579705  6.786910 8.342729e-10 4.046800e-06
ENSG00000114374  5.8648622  1.0849112  6.408775 4.915018e-09 1.349383e-05
ENSG00000012817  5.9082179  1.1529671  6.358619 6.201300e-09 1.390369e-05
ENSG00000131002  6.1177434  1.1965928  6.339047 6.788850e-09 1.390369e-05
ENSG00000067048  5.9136206  1.3380538  6.304632 7.958193e-09 1.390369e-05
                       B
ENSG00000099725 9.635516
ENSG00000067646 9.292060
ENSG00000129824 9.117774
ENSG00000005889 8.866058
ENSG00000233864 8.849122
ENSG00000198692 8.827255
ENSG00000114374 8.511719
ENSG00000012817 8.368047
ENSG00000131002 8.302359
ENSG00000067048 8.232692

ADD REPLY • link 8.3 years ago blofeld ▴ 10

score 2 · Answer 1 · 2016-10-09

You're quite right that you can't analyse the data with patient as a fixed effect. If you do that, then the disease effect will be estimated entirely from the one patient who provided a sample for both diseases, and all other patients will be ignored.

You also can't correct for the gender effect simply by including gender as an additive effect because there is apparently an interaction between gender and disease (males having more severe disease). So I'm not surprised that the Y chromosome genes are near the top of your toptable.

There are two possible approaches. Either you can estimate the disease effect for males and females separately, for example by using the model:

gender + gender:disease

Or alternatively, you might filter the sex-linked genes (Y chromosome and Xist) from your analysis and ignore gender in your analysis. We do that frequently in our own in-house analyses when samples are not balanced by gender.

The first approach may be better if the disease very different in males and females. The second approach may be better if the transcriptional profile of the disease at a given severity is much the same in males and females even though it may have arisen somewhat differently.

There also may be complicated interactions between disease and severity, but it's hard to comment on that.

I would also include the options robust=TRUE when you run eBayes().

score 1 · Answer 2 · 2016-10-09

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 45 minutes ago

The city by the bay

Have you tried examining the expression profile of some of those genes (e.g., PRKY) across all disease/sex combinations? This can provide a useful sanity check in cases like this.

Apply some simple normalisation, e.g., edgeR's cpm normalisation with a prior of 3.
For a particular gene, make a box/violin/swarm plot (depending on what visualisation method you prefer) of expression values for each disease/sex(/severity) combination.
Compare across combinations to see if the change in expression is genuinely present.

The reported log-fold changes from topTable are quite large, so I would expect that any changes in expression should be quite apparent in the raw data. If that's the case, then it's understandable that limma should report these genes as being DE. That fact that they don't make biological sense is hardly limma's fault, it just works with the data that's been provided. The only potential issues I can forsee are:

You've mislabelled the male/female samples. This is easy enough to check based on XIST expression.
The disease/sex effects are not additive. Switch to a disease/sex interaction model to deal with this.

P.S. Putting patient as a fixed effect is not wrong per se, but in an experimental set-up like this, it means that your disease comparison will be performed using only those patients who contributed samples to both disease types.

ADD COMMENT • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

no offense, I am a big fan of limma and have used it for many array studies in the past.

I am just puzzled by the results I am getting here, and try to understand what is going on. If it is biology, so be it, but I have the feeling that it is something I am doing wrong with my analysis.

Many thanks for your helpful comments. Indeed, the raw/uncorrected cpms for disease A and B differ, albeit not at all at this proportion (like 10x). This is understandable and expected, since I have more males in the B group - this is why I wanted to correct for gender. I will try to model the interaction, it has been suggested that males have more severe disease than females.

Many thanks!

P.S. Just saw your comment on the subject as fixed factor. So you say that all results I got in DESeq2 were in fact a comparison of two samples A and B from a single individual? I was puzzled, because in those results, all my limma/voom DE genes are reported as unregulated, with an FDR of 0.99. But this would be not very surprising in a case where the comparison is within a single patient.

ADD REPLY • link 8.3 years ago blofeld ▴ 10

0

Entering edit mode

Yes, the fixed effect results are based on a single individual. Obviously you won't get any sex linked genes then.

ADD REPLY • link 8.3 years ago Gordon Smyth 52k