Question

sva + egdeR - differential expression analysis - RNA-seq data

0

Entering edit mode

mrodrigues.fernanda ▴ 10

@mrodriguesfernanda-9433

Last seen 8.0 years ago

University of Illinois, Urbana-Champaign

Dear list,
I am performing an RNA-seq analysis for differential gene expression and I have a question regarding the use of the package sva for the estimation of unknown batch effects.

In the sva vignette, it shows examples of using the package for estimation of surrogate variables and then performing DE analysis using the package limma (I am referring to the section 6 of the sva vignette: "Adjusting for surrogate variables using the limma package")

Is that possible to do the same using the package edgeR instead of limma?

Or is sva not compatible with edgeR?

Sorry if this is a dumb question. I am a little new to the bioinformatics world.

Thank you!!!

sva edgeR • 7.1k views

ADD COMMENT • link 9.3 years ago mrodrigues.fernanda ▴ 10

0

Entering edit mode

mrodrigues.fernanda ▴ 10

@mrodriguesfernanda-9433

Last seen 8.0 years ago

University of Illinois, Urbana-Champaign

Thank you all for your quick responses.
So just to clarify and check if I am doing this right. I can normalize my data (using edgeR in my case) and then provide normalized cpm data to svaseq, right?

I am including my codes below, not sure if I am doing this right. If someone could give me some insight that would be nice. My data include 24 samples (12 of each genotype - QQ and qq - and each sex - F and M, having 6 biological replicates for each group - QQ.F, QQ.M, qq.F an qq.M). Not sure if I included the estimated surrogate variables in the model properly.

Also, I tried using the normal sva function instead of svaseq and it gave me very similar results. I heard however that the svaseq function does not really account for TMM normalization factors, which would be important for my analysis. Could someone clarify if this is true? Thank you!

My codes:

## QC, FILTERING AND NORMALIZATION USING EDGER ##

> d <- readDGE(targets, labels = targets$Label, comment.char = "#", header = TRUE,
columns = c(1,7))

> d <- calcNormFactors(d)

> cpm.values <- cpm(d$counts)

> above1cpm <- rowSums(cpm.values >= 1)

> d.filt <- d[above1cpm >= 6, , keep.lib.sizes = FALSE]

> dim(d.filt)

# [1] 11679 24

> d.filt <- calcNormFactors(d.filt)

> cpm.values.filt < cpm(d.filt$counts)

> head(cpm.values.filt)

(...)
                   Samples
                 Tags                     QQ.M.3        QQ.M.4   QQ.M.5   QQ.M.6
ENSBTAG00000020035   99.33248 182.23360 168.80417 94.63499
ENSBTAG00000011528   48.37818    55.00343    53.30658 62.32472
ENSBTAG00000012594 17.57829    32.66077    20.34709    18.87298
ENSBTAG00000018278 242.30761 244.77718 230.29837 241.66867
ENSBTAG00000021997   13.03218    25.75558    13.58795    19.61574
ENSBTAG00000008490   17.31310    22.38235    19.68511    23.73471

> logCPM <- cpm(d.filt, log = T)

> head(logCPM)

# (...)
# Samples
# Tags QQ.M.5 QQ.M.6
# ENSBTAG00000020035 7.526067 6.557956
# ENSBTAG00000011528 5.863254 5.955458
# ENSBTAG00000012594 4.474122 4.232455
# ENSBTAG00000018278 7.974202 7.910458
# ENSBTAG00000021997 3.891917 4.288119
# ENSBTAG00000008490 4.426424 4.562991

## SVA ##

> Group <- relevel(d.filt$samples$group, ref = "qq.M")
> mod <- model.matrix(~Group)
> colnames(mod)[-1] <- paste0(levels(Group)[-1], "vs", levels(Group)[1])
> head(mod)
#   (Intercept) qq.Fvsqq.M QQ.Fvsqq.M QQ.Mvsqq.M
# 1              1            1                0         0
# 2          1                  1                0                   0
# 3              1                  1            0         0
# 4              1                  1            0         0
# 5          1            1            0         0
# 6          1            1            0         0

> cont.mod <- makeContrasts(QQ.FvsQQ.M = QQ.Fvsqq.M - QQ.Mvsqq.M,
                          qq.Fvsqq.M = qq.Fvsqq.M,
                          QQ.Fvsqq.F = QQ.Fvsqq.M - qq.Fvsqq.M,
                          QQ.Mvsqq.M = QQ.Mvsqq.M,
                          Interact = (QQ.Fvsqq.M - QQ.Mvsqq.M) - (qq.Fvsqq.M),
                          levels = mod)
> cont.mod
# Contrasts
# Levels         QQ.FvsQQ.M qq.Fvsqq.M QQ.Fvsqq.F QQ.Mvsqq.M Interact
#   Intercept           0                  0                  0            0       0
#     qq.Fvsqq.M         0            1           -1            0      -1
#    QQ.Fvsqq.M             1            0        1            0         1
#   QQ.Mvsqq.M            -1                  0            0            1        -1

> svseq = svaseq(cpm.values.filt, mod, mod[,1])
# Number of significant surrogate variables is: 5
# Iteration (out of 5 ):1 2 3 4 5

> head(svseq$sv)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.05741827 -0.06602954 0.20009240 -0.02684960 0.23766642
[2,] 0.26314274 -0.66218564 0.19546081 0.38882480 -0.21056924
[3,] -0.06735629 0.01038846 0.08790005 0.02410775 0.34376335
[4,] 0.10120300 0.10301322 0.24170332 0.15549638 0.06207581
[5,] -0.17679364 -0.09238517 -0.03882105 0.15047943 0.23952769
[6,] 0.07944103 -0.01603642 -0.26944731 -0.16374470 -0.23344724

> mod.sv <- cbind(mod, svseq$sv)
> colnamesmod.sv)[5:9] <- paste0("sv", 1:5)

# (Intercept) qq.Fvsqq. M QQ.Fvsqq.M QQ.Mvsqq.M         sv1      sv2        sv3              sv4          sv5
# 1           1          1            0         0    0.05741827 -0.06602954 0.20009240 -0.02684960 0.23766642
# 2           1                        1            0         0    0.26314274 -0.66218564 0.19546081 0.38882480 -0.21056924
# 3           1            1            0       0   -0.06735629 0.01038846 0.08790005 0.02410775 0.34376335
# 4           1            1                  0       0    0.10120300 0.10301322 0.24170332 0.15549638 0.06207581
# 5           1            1        0       0   -0.17679364 -0.09238517 -0.03882105 0.15047943 0.23952769
# 6           1          1          0           0    0.07944103 -0.01603642 -0.26944731 -0.16374470 -0.23344724

> cont.mod.sv <- makeContrasts(QQ.FvsQQ.M = QQ.Fvsqq.M - QQ.Mvsqq.M,
                             qq.Fvsqq.M = qq.Fvsqq.M,
                             QQ.Fvsqq.F = QQ.Fvsqq.M - qq.Fvsqq.M,
                             QQ.Mvsqq.M = QQ.Mvsqq.M,
                             Interact = (QQ.Fvsqq.M - QQ.Mvsqq.M) - (qq.Fvsqq.M),
                             levels = mod.sv)

> cont.mod.sv
# Contrasts
# Levels             QQ.FvsQQ.M   qq.Fvsqq.M   QQ.Fvsqq.F   QQ.Mvsqq.M   Interact
#        Intercept                   0       0       0       0          0
#     qq.Fvsqq.M                  0         1                  -1         0         -1
#    QQ.Fvsqq.M            1         0         1         0            1
#   QQ.Mvsqq.M           -1           0             0         1             -1
#                  sv1                 0                      0           0         0      0
#    sv2                 0            0                   0         0      0
# sv3                 0        0           0         0        0
#                  sv4                 0          0           0         0          0
# sv5                 0          0           0         0        0

> d.filt.sv <- estimateGLMCommonDisp(d.filt, mod.sv, verbose = TRUE)
# Disp = 0.0268 , BCV = 0.1637 ###not sure if the BCV value is good. I am working on bovine. edgeR recommends
0.1 for identical strains and 0.4 for humans. Not sure if 0.1637 is okay for bovine.

> d.filt.sv <- estimateGLMTrendedDisp(d.filt, mod.sv)
> d.filt.sv <- estimateGLMTagwiseDisp(d.filt, mod.sv)

> plotBCVd.filt.sv)

> fit.edgeR.sv <- glmFitd.filt.sv, mod.sv)

> eR.QQ.FvsQQ.M.sv <- glmLRTfit.edgeR.sv, contrast = cont.mod.sv[ , 1])
> eR.qq.Fvsqq.M.sv <- glmLRTfit.edgeR.sv, contrast = cont.mod.sv[ , 2])
> eR.QQ.Fvsqq.F.sv <- glmLRTfit.edgeR.sv, contrast = cont.mod.sv[ , 3])
> eR.QQ.Mvsqq.M.sv <- glmLRTfit.edgeR.sv, contrast = cont.mod.sv[ , 4])
> eR.Interact.sv <- glmLRTfit.edgeR.sv, contrast = cont.mod.sv[ , 5])

# I am not sure on how to check on sv significance from here or if what I did above is right. Should I have the svs in the model another way? Any insight are welcome! Thank you again

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] sva_3.14.0           genefilter_1.50.0    mgcv_1.8-10          nlme_3.1-122         rtracklayer_1.28.10
[6] GenomicRanges_1.20.8 GenomeInfoDb_1.4.3   IRanges_2.2.9        S4Vectors_0.6.6      rgl_0.95.1429
[11] edgeR_3.10.5         limma_3.24.15        affycoretools_1.40.5 affy_1.46.1          Biobase_2.28.0
[16] BiocGenerics_0.14.0 WGCNA_1.48           RSQLite_1.0.0        DBI_0.3.1            fastcluster_1.1.16
[21] dynamicTreeCut_1.62

loaded via a namespace (and not attached):
[1] Category_2.34.2           bitops_1.0-6              matrixStats_0.50.1        doParallel_1.0.10
[5] RColorBrewer_1.1-2        gcrma_2.40.0              tools_3.2.2               affyio_1.36.0
[9] KernSmooth_2.23-15        rpart_4.1-10              Hmisc_3.17-1              colorspace_1.2-6
[13] nnet_7.3-11               gridExtra_2.0.0           GGally_0.5.0              DESeq2_1.8.2
[17] bit_1.1-12                preprocessCore_1.30.0     graph_1.46.0              ggbio_1.16.1
[21] caTools_1.17.1            scales_0.3.0              RBGL_1.44.0               stringr_1.0.0
[25] Rsamtools_1.20.5          foreign_0.8-66            R.utils_2.2.0             AnnotationForge_1.10.1
[29] XVector_0.8.0             dichromat_2.0-0           BSgenome_1.36.3           PFAM.db_3.1.2
[33] impute_1.42.0             BiocInstaller_1.18.5      GOstats_2.34.0            hwriter_1.3.2
[37] gtools_3.5.0              BiocParallel_1.2.22       acepack_1.3-3.3           R.oo_1.19.0
[41] VariantAnnotation_1.14.13 RCurl_1.95-4.7            magrittr_1.5              GO.db_3.1.2
[45] Formula_1.2-1             oligoClasses_1.30.0       futile.logger_1.4.1       Matrix_1.2-3
[49] Rcpp_0.12.2               munsell_0.4.2             R.methodsS3_1.7.0         stringi_1.0-1
[53] zlibbioc_1.14.0           gplots_2.17.0             plyr_1.8.3                grid_3.2.2
[57] gdata_2.17.0              ReportingTools_2.8.0      lattice_0.20-33           Biostrings_2.36.4
[61] splines_3.2.2             GenomicFeatures_1.20.6    annotate_1.46.1           locfit_1.5-9.1
[65] knitr_1.11                geneplotter_1.46.0        reshape2_1.4.1            codetools_0.2-14
[69] biomaRt_2.24.1            futile.options_1.0.0      XML_3.98-1.3              RcppArmadillo_0.6.400.2.2
[73] biovizBase_1.16.0         latticeExtra_0.6-26       lambda.r_1.1.7            foreach_1.4.3
[77] gtable_0.1.2              reshape_0.8.5             ggplot2_2.0.0             xtable_1.8-0
[81] ff_2.2-13                 survival_2.38-3           OrganismDbi_1.10.0        iterators_1.0.8
[85] GenomicAlignments_1.4.2   AnnotationDbi_1.30.1      cluster_2.0.3             GSEABase_1.30.2

ADD COMMENT • link 9.3 years ago mrodrigues.fernanda ▴ 10

score 3 · Accepted Answer · 2015-12-29

3

Entering edit mode

Jeff Leek ▴ 650

@jeff-leek-5015

Last seen 4.1 years ago

United States

You can estimate the surrogate variables with the svaseq command, then pass them as covariates to the model matrix for edgeR and you should be ok.

Jeff

ADD COMMENT • link 9.3 years ago Jeff Leek ▴ 650

1

Entering edit mode

Would it make more sense to apply the svaseq command to normalized cpm data rather than the raw counts?

ADD REPLY • link 9.3 years ago Ryan C. Thompson ★ 7.9k

2

Entering edit mode

In our workflow, we suggest running svaseq on normalized counts, as we estimate size factors on raw counts. If raw counts are provided to svaseq, I think something correlated to sequencing depth should be among the first surrogate variables (I think Jeff confirmed this sometime).

ADD REPLY • link 9.3 years ago Michael Love 43k

1

Entering edit mode

Yes this is what I've observed previously, for example in these analyses:

https://github.com/jtleek/svaseq

That if you don't normalize first the first sv is usually representing library size. This is a potential way to normalize but currently unexplored, so I'd stick with normalize, then svaseq like Mike suggests

ADD REPLY • link 9.3 years ago Jeff Leek ▴ 650