Question

Warning messages from diffSplice in Limma

1

Entering edit mode

willj ▴ 30

@willj-8763

Last seen 7.5 years ago

France

I have data from Affy Human exon arrays 1.0 ST, and I'm using the diffSplice function in Limma to detect differential splicing. I'm getting some warnings (see below) when I call diffSplice and I'd like to know if I should be concerned about them (results from topSplice and plotSplice look fine as far as I can tell).

The experiment has a Healthy group and 3 disease groups, with a design as follows:

> design
   Dis1 Healthy Dis2 Dis3
1    1       0       0   0
2    1       0       0   0
3    1       0       0   0
4    1       0       0   0
5    1       0       0   0
6    1       0       0   0
7    0       1       0   0
8    0       1       0   0
9    0       1       0   0
10   0       1       0   0
11   0       1       0   0
12   0       1       0   0
13   0       0       1   0
14   0       0       1   0
15   0       0       1   0
16   0       0       1   0
17   0       0       1   0
18   0       0       1   0
19   0       0       0   1
20   0       0       0   1
21   0       0       0   1
22   0       0       0   1
23   0       0       0   1
24   0       0       0   1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Conds
[1] "contr.treatment"

When I run diffSplice it gives some warning messages:

> fit <- lmFit(rma.probesets.filt, design)
> dSplice <- diffSplice(fit[,"Dis1"], geneid="geneNames", exonid="start")
Total number of exons:  787188 
Total number of genes:  31123 
Number of genes with 1 exon:  4769 
Mean number of exons in a gene:  25 
Max number of exons in a gene:  504965 
Warning messages:
1: In rowsum.default(exon.stat, geneid, reorder = FALSE) :
  missing values for 'group'
2: In rowsum.default(u2, geneid, reorder = FALSE) :
  missing values for 'group'
3: In rowsum.default(exon.coefficients * u2, geneid, reorder = FALSE) :
  missing values for 'group'
4: In rowsum.default(exon.t^2, geneid, reorder = FALSE) :
  missing values for 'group'

Should I be worried about these warning messages?

Also (perhaps related?) should I be worried that my geneid column has lots of unlabelled probesets that are all being considered by diffSplice as one giant gene with 504,965 exons?

Many thanks for any help/advice.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] genefilter_1.56.0        pd.huex.1.0.st.v2_3.14.1 DBI_0.5-1                RSQLite_1.1-2           
 [5] oligo_1.38.0             Biostrings_2.42.1        XVector_0.14.0           IRanges_2.8.1           
 [9] S4Vectors_0.12.1         oligoClasses_1.36.0      limma_3.30.8             Biobase_2.34.0          
[13] BiocGenerics_0.20.0      BiocInstaller_1.24.0    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9                GenomeInfoDb_1.10.2        bitops_1.0-6               iterators_1.0.8
 [5] tools_3.3.2                zlibbioc_1.20.0            digest_0.6.11              statmod_1.4.27
 [9] bit_1.1-12                 annotate_1.52.1            memoise_1.0.0              preprocessCore_1.36.0
[13] lattice_0.20-34            ff_2.2-13                  Matrix_1.2-7.1             foreach_1.4.3
[17] affxparser_1.46.0          grid_3.3.2                 AnnotationDbi_1.36.2       survival_2.40-1
[21] XML_3.98-1.5               codetools_0.2-15           GenomicRanges_1.26.2       splines_3.3.2
[25] SummarizedExperiment_1.4.0 xtable_1.8-2               RCurl_1.95-4.8             affyio_1.44.0

limma diffSplice • 1.4k views

ADD COMMENT • link updated 7.8 years ago by Gordon Smyth 52k • written 7.8 years ago by willj ▴ 30

score 2 · Answer 1 · 2017-04-19

As you may have already realized, the geneNames column of fit$genes contains missing values. This causes the warnings and can be fixed by removing the rows corresponding to the NA values. Otherwise, the function will consider all NA exons as being part of the same gene, which would be obviously silly.

Also, use of fit[,"Dis1"] doesn't make much sense for a design matrix without an intercept. The "exon-specific log-fold changes" in this setting would just be the log-expression of each exon in the Dis1 group, rather than the log-fold change between two groups (e.g., Dis1 and Healthy). Use contrasts.fit prior to running diffSplice if you want to compare exon usage between groups.

score 2 · Answer 2 · 2017-04-19

Yes, you should be worried by the unlabelled probesets. diffSplice() assumes that you know which gene each exon belongs to. You have 787k exons in total but 505k of these have no gene identifier. So you do need to fix this basic data problem before proceeding. Is there no annotation as to which exons belong to the same gene? As Aaron has advised, you will have to remove the exons that have no annotation.

I am going to edit diffSplice() so that it treats all NA values for geneid as different genes, each with a single exon.

Also, you would find it easier to define the design matrix like this:

Conds <- relevel(Conds, ref="Healthy")
design <- model.matrix(~Conds)

That way, each disease will be compared back to healthy. Then

ds <- diffSplice(fit, ...)

will do the differential splicing analysis for all the diseases at once.