can we use sva function to estimate artifacts for rna seq data?
1
0
Entering edit mode
szenitha ▴ 20
@szenitha-10863
Last seen 8.1 years ago

Hello all,

I have a question regarding the sva function. To start with I am analysing a RNA seq dataset. My condition consists of 4 groups(CON_GM,CON_WM,MS_GM and MS_WM). CON is the control and MS is patients affected with multiple sclerosis. GM(grey matter) and WM(white matter) are cell types. I first normalize the data by deseq2 and then visualize the PCA. I can see that the data clusters according to cell type.  I then use svaseq function to search for hidden batch effects. After that I use a function to remove the surrogate variables and then visualize the PCA . I do not use the cleaned data for downstream analysis. I just take the PCA  to decide on the number of surrogates to be added as covariates when I will perform downstream analysis using deseq2. I found the function that removes batch effect from this link https://www.biostars.org/p/121489/  . I also used the function num.sv to get the total number of surrogates. The function returned a value of 18.  Initially I removed only 2-3 surrogates . But I was unable to get 4 different clusters.I found that after removing 14 surrogate variables , I am able to see 4 separate groups. I repeated the same steps but this time I use sva function. When I used sva I only had to remove 6 surrogate variables to get 4 separate groups. As per the documentation it says that the sva function can be used to estimate artifacts from microarray data the svaseq function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data.  I dont understand why I would need to remove more number of surrogate variables while using svaseq to get good clustering. Can we use sva to estimate artifacts for RNA-seq data? 

Code:

#compute null and full model
mod =model.matrix(~merged, data=pheno)#merged is condition
mod0 = model.matrix(~1,data=pheno)

#find the number of surrogate variables
n.sv = num.sv(data,mod,method="leek")


#function that computes surrogate variables
#passing deseq2 normalized data(normalized by size factors) as input

svobj = svaseq(data_normalized, mod, mod0, n.sv=14)

#svobj = sva(data_normalized, mod, mod0, n.sv=6)

####clean function: removes unwanted variation from data##
cleanY = function(y, mod, svs) {
  X = cbind(mod, svs)
  Hat = solve(t(X) %*% X) %*% t(X)
  beta = (Hat %*% t(y))
  rm(Hat)
  gc()
  P = ncol(mod)
  return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}

#calling the above function to remove unwanted variation
cleany = cleanY(data_normalized,mod,svobj$sv[,1:14])

#plotpca

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sva_3.20.0                 mgcv_1.8-13                nlme_3.1-128              
 [4] ggplot2_2.1.0              rgl_0.95.1441              genefilter_1.54.2         
 [7] DESeq2_1.12.3              SummarizedExperiment_1.2.3 Biobase_2.32.0            
[10] GenomicRanges_1.24.2       RColorBrewer_1.1-2         GenomeInfoDb_1.8.3        
[13] IRanges_2.6.1              S4Vectors_0.10.2           BiocGenerics_0.18.0       
[16] BiocInstaller_1.22.3      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6          plyr_1.8.4           XVector_0.12.1       tools_3.3.0         
 [5] zlibbioc_1.18.0      digest_0.6.10        rpart_4.1-10         RSQLite_1.0.0       
 [9] annotate_1.50.0      gtable_0.2.0         lattice_0.20-33      Matrix_1.2-6        
[13] DBI_0.4-1            gridExtra_2.2.1      cluster_2.0.4        locfit_1.5-9.1      
[17] grid_3.3.0           nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.4
[21] XML_3.98-1.4         survival_2.39-5      BiocParallel_1.6.4   foreign_0.8-66      
[25] latticeExtra_0.6-28  Formula_1.2-1        geneplotter_1.50.0   Hmisc_3.17-4        
[29] scales_0.4.0         splines_3.3.0        colorspace_1.2-6     xtable_1.8-2        
[33] labeling_0.3         acepack_1.3-3.3      munsell_0.4.3        chron_2.3-47  
sva svaseq batcheffectcorrection • 2.7k views
ADD COMMENT
0
Entering edit mode
Jeff Leek ▴ 650
@jeff-leek-5015
Last seen 3.7 years ago
United States
Hi sva and svaseq do nearly the same thing, except svaseq performs a log transform before applying sva. If you didn't take the log transform before performing sva you may get different results. If you are using count data we suggest you either perform an appropriate transform of the data (log is only one option here, DESeq2 has more clever transforms you could use) before using the sva function. I hope this helps! Jeff On Mon, Aug 8, 2016 at 11:58 AM szenitha [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User szenitha <https: support.bioconductor.org="" u="" 10863=""/> wrote Question: > can we use sva function to estimate artifacts for rna seq data? > <https: support.bioconductor.org="" p="" 85905=""/>: > > Hello all, > > I have a question regarding the sva function. To start with I am analysing > a RNA seq dataset. My condition consists of 4 groups(CON_GM,CON_WM,MS_GM > and MS_WM). CON is the control and MS is patients affected with multiple > sclerosis. GM(grey matter) and WM(white matter) are cell types. I first > normalize the data by deseq2 and then visualize the PCA. I can see that the > data clusters according to cell type. I then use svaseq function to search > for hidden batch effects. After that I use a function to remove the > surrogate variables and then visualize the PCA . I do not use the cleaned > data for downstream analysis. I just take the PCA to decide on the number > of surrogates to be added as covariates when I will perform downstream > analysis using deseq2. I found the function that removes batch effect from > this link https://www.biostars.org/p/121489/ . I also used the function > num.sv to get the total number of surrogates. The function returned a > value of 18. Initially I removed only 2-3 surrogates . But I was unable to > get 4 different clusters.I found that after removing 14 surrogate variables > , I am able to see 4 separate groups. I repeated the same steps but this > time I use sva function. When I used sva I only had to remove 6 surrogate > variables to get 4 separate groups. As per the documentation it says that > the sva <http: 127.0.0.1:47764="" help="" library="" sva="" help="" sva=""> function can > be used to estimate artifacts from microarray data the svaseq > <http: 127.0.0.1:47764="" help="" library="" sva="" help="" svaseq=""> function can be > used to estimate artifacts from count-based RNA-sequencing (and other > sequencing) data. I dont understand why I would need to remove more number > of surrogate variables while using svaseq to get good clustering. Can we > use sva to estimate artifacts for RNA-seq data? > > Code: > > #compute null and full model > mod =model.matrix(~merged, data=pheno)#merged is condition > mod0 = model.matrix(~1,data=pheno) > > #find the number of surrogate variables > n.sv = num.sv(data,mod,method="leek") > > > #function that computes surrogate variables > #passing deseq2 normalized data(normalized by size factors) as input > > svobj = svaseq(data_normalized, mod, mod0, n.sv=14) > > #svobj = sva(data_normalized, mod, mod0, n.sv=6) > > ####clean function: removes unwanted variation from data## > cleanY = function(y, mod, svs) { > X = cbind(mod, svs) > Hat = solve(t(X) %*% X) %*% t(X) > beta = (Hat %*% t(y)) > rm(Hat) > gc() > P = ncol(mod) > return(y - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),])) > } > > #calling the above function to remove unwanted variation > cleany = cleanY(data_normalized,mod,svobj$sv[,1:14]) > > #plotpca > > > sessionInfo() > R version 3.3.0 (2016-05-03) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 7 x64 (build 7601) Service Pack 1 > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats4 stats graphics grDevices utils datasets methods base > > other attached packages: > [1] sva_3.20.0 mgcv_1.8-13 nlme_3.1-128 > [4] ggplot2_2.1.0 rgl_0.95.1441 genefilter_1.54.2 > [7] DESeq2_1.12.3 SummarizedExperiment_1.2.3 Biobase_2.32.0 > [10] GenomicRanges_1.24.2 RColorBrewer_1.1-2 GenomeInfoDb_1.8.3 > [13] IRanges_2.6.1 S4Vectors_0.10.2 BiocGenerics_0.18.0 > [16] BiocInstaller_1.22.3 > > loaded via a namespace (and not attached): > [1] Rcpp_0.12.6 plyr_1.8.4 XVector_0.12.1 tools_3.3.0 > [5] zlibbioc_1.18.0 digest_0.6.10 rpart_4.1-10 RSQLite_1.0.0 > [9] annotate_1.50.0 gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6 > [13] DBI_0.4-1 gridExtra_2.2.1 cluster_2.0.4 locfit_1.5-9.1 > [17] grid_3.3.0 nnet_7.3-12 data.table_1.9.6 AnnotationDbi_1.34.4 > [21] XML_3.98-1.4 survival_2.39-5 BiocParallel_1.6.4 foreign_0.8-66 > [25] latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.50.0 Hmisc_3.17-4 > [29] scales_0.4.0 splines_3.3.0 colorspace_1.2-6 xtable_1.8-2 > [33] labeling_0.3 acepack_1.3-3.3 munsell_0.4.3 chron_2.3-47 > > ------------------------------ > > Post tags: sva, svaseq, batcheffectcorrection > > You may reply via email or visit can we use sva function to estimate artifacts for rna seq data? >
ADD COMMENT
0
Entering edit mode

Thank you jeff. Yes I now transformed the data using deseq2 and then used sva. Now I am getting the same result as earlier. Just to confirm . sva needs transformed data like log2 or rlog or voom tranformation as input. svaseq takes count data and then log transforms it. So the difference would be in the transformation we are using in these cases. Therefore I get different results when I take the PCA after batch correction. Did I understand correctly?can we give deseq2 normalized data as input to svaseq since deseq2 normalization only corrects for size factors?

ADD REPLY
0
Entering edit mode
Yes I think you should be able to, but you might confirm with Mike Love, the maintainer of deseq2. On Tue, Aug 9, 2016 at 6:10 AM szenitha [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User szenitha <https: support.bioconductor.org="" u="" 10863=""/> wrote Comment: > can we use sva function to estimate artifacts for rna seq data? > <https: support.bioconductor.org="" p="" 85905="" #85944="">: > > Thank you jeff. Yes I now transformed the data using deseq2 and then used > sva. Now I am getting the same result as svseq. Thank you. Just to confirm > . sva needs transformed data like log2 or rlog or voom tranformation as > input. svaseq takes count data and then log transforms it. can we give > deseq2 normalized data as input to svaseq since deseq2 normalization only > corrects for size factors. > ------------------------------ > > Post tags: sva, svaseq, batcheffectcorrection > > You may reply via email or visit > C: can we use sva function to estimate artifacts for rna seq data? >
ADD REPLY

Login before adding your answer.

Traffic: 470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6