Question

Annotating Expression Set for GSVA w/ Brainarray Human Gene 1.1 ST Array CDF

0

Entering edit mode

kdyson • 0

@kdyson-15403

Last seen 6.8 years ago

Hello,

I am trying to filter an expression set (GSE85127) for input into GSVA (following the vignette). I'm having problems when I attempt to filter the eset that I think are related to my lack of understanding regarding how to work with the custom CDF files provided by MBNI. Any help is greatly appreciated.

My code:

> library(GEOquery)
> library(genefilter)
> install.packages("http://mbni.org/customcdf/22.0.0/ensg.download/hugene11sthsensgcdf_22.0.0.tar.gz", repos = NULL)

> library(hugene11sthsensgcdf)

> cavalliMB <- getGEO("GSE85217")
> cavalliMB <- cavalliMB[[1]]

> filtered_eset <- nsFilter(cavalliMB, require.entrez=FALSE, remove.dupEntrez=TRUE, var.func=IQR,
                          var.filter=TRUE, var.cutoff=0.5, filterByQuantile=TRUE, feature.exclude="^AFFX")

Error: getAnnMap: package GPL22286 not available

When I set remove.dupEntrez = FALSE the filtering seems to work fine but I get the same Error: getAnnMap: package GPL22286 not available when running gsva.

Alternatively, when I try to read in and normalize with oligo first, I'm getting a different error:

> library(oligo)
> library(hugene11sthsensgcdf)
> CEL.files <- list.files(path = "Data/Raw", pattern = ".CEL", full.names = TRUE)
> raw.data <- read.celfiles(CEL.files, pkgname = "hugene11sthsensgcdf")
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘kind’ for signature ‘"environment"’

I tried to run install.packages("http://mbni.org/customcdf/21.0.0/entrezg.download/pd.hugene11st.hs.entrezg_21.0.0.tar.gz") but got back a warning that this isn't available for R 3.5. I also tried to use read.celfiles(list.celfiles(), pkgname = "pd.hugene11st.hs.entrezg") but got back:

Loading required package: pd.hugene11st.hs.ensg
Attempting to obtain 'pd.hugene11st.hs.ensg' from BioConductor website.
Checking to see if your internet connection works...
Package 'pd.hugene11st.hs.ensg' was not found in the BioConductor repository.
The 'pdInfoBuilder' package can often be used in situations like this.
Error in read.celfiles(CEL.files, pkgname = "pd.hugene11st.hs.ensg") : 
  The annotation package, pd.hugene11st.hs.ensg, could not be loaded.
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘pd.hugene11st.hs.ensg’

Not sure if this is helpful/relevant but in the original publication the methods section states that "hugene11sthsensgcdf (v19.0.0)" CDF was used for annotation.

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GSVA_1.28.0                RColorBrewer_1.1-2         limma_3.36.1               GSVAdata_1.16.0            hgu95a.db_3.2.3           
 [6] org.Hs.eg.db_3.6.0         GSEABase_1.42.0            graph_1.58.0               annotate_1.58.0            XML_3.98-1.11             
[11] hugene11sthsensgcdf_22.0.0 AnnotationDbi_1.42.1       IRanges_2.14.10            S4Vectors_0.18.2           bindrcpp_0.2.2            
[16] affycoretools_1.52.1       genefilter_1.62.0          GEOquery_2.48.0            Biobase_2.40.0             BiocGenerics_0.26.0       

loaded via a namespace (and not attached):
  [1] backports_1.1.2             GOstats_2.46.0              Hmisc_4.1-1                 plyr_1.8.4                 
  [5] lazyeval_0.2.1              splines_3.5.0               BiocParallel_1.14.1         GenomeInfoDb_1.16.0        
  [9] ggplot2_2.2.1               digest_0.6.15               foreach_1.4.4               BiocInstaller_1.30.0       
 [13] ensembldb_2.4.1             htmltools_0.3.6             GO.db_3.6.0                 gdata_2.18.0               
 [17] magrittr_1.5                checkmate_1.8.5             memoise_1.1.0               BSgenome_1.48.0            
 [21] cluster_2.0.7-1             gcrma_2.52.0                Biostrings_2.48.0           readr_1.1.1                
 [25] matrixStats_0.53.1          R.utils_2.6.0               ggbio_1.28.0                prettyunits_1.0.2          
 [29] colorspace_1.3-2            blob_1.1.1                  dplyr_0.7.5                 RCurl_1.95-4.10            
 [33] bindr_0.1.1                 survival_2.41-3             VariantAnnotation_1.26.0    iterators_1.0.9            
 [37] glue_1.2.0                  gtable_0.2.0                zlibbioc_1.26.0             XVector_0.20.0             
 [41] DelayedArray_0.6.0          Rgraphviz_2.24.0            scales_0.5.0                DBI_1.0.0                  
 [45] GGally_1.4.0                edgeR_3.22.2                Rcpp_0.12.17                xtable_1.8-2               
 [49] progress_1.1.2              htmlTable_1.12              foreign_0.8-70              bit_1.1-14                 
 [53] OrganismDbi_1.22.0          preprocessCore_1.42.0       Formula_1.2-3               AnnotationForge_1.22.0     
 [57] htmlwidgets_1.2             httr_1.3.1                  gplots_3.0.1                acepack_1.4.1              
 [61] ff_2.2-14                   pkgconfig_2.0.1             reshape_0.8.7               R.methodsS3_1.7.1          
 [65] nnet_7.3-12                 locfit_1.5-9.1              later_0.7.2                 tidyselect_0.2.4           
 [69] rlang_0.2.1                 reshape2_1.4.3              munsell_0.4.3               tools_3.5.0                
 [73] RSQLite_2.1.1               stringr_1.3.1               knitr_1.20                  bit64_0.9-7                
 [77] oligoClasses_1.42.0         caTools_1.17.1              purrr_0.2.5                 AnnotationFilter_1.4.0     
 [81] RBGL_1.56.0                 mime_0.5                    R.oo_1.22.0                 xml2_1.2.0                 
 [85] biomaRt_2.36.1              shinythemes_1.1.1           compiler_3.5.0              rstudioapi_0.7             
 [89] curl_3.2                    affyio_1.50.0               PFAM.db_3.6.0               tibble_1.4.2               
 [93] geneplotter_1.58.0          stringi_1.1.7               GenomicFeatures_1.32.0      lattice_0.20-35            
 [97] ProtGenerics_1.12.0         Matrix_1.2-14               pillar_1.2.3                data.table_1.11.4          
[101] bitops_1.0-6                httpuv_1.4.3                rtracklayer_1.40.3          GenomicRanges_1.32.3       
[105] R6_2.2.2                    latticeExtra_0.6-28         affy_1.58.0                 hwriter_1.3.2              
[109] promises_1.0.1              KernSmooth_2.23-15          gridExtra_2.3               codetools_0.2-15           
[113] dichromat_2.0-0             gtools_3.5.0                assertthat_0.2.0            SummarizedExperiment_1.10.1
[117] DESeq2_1.20.0               Category_2.46.0             ReportingTools_2.20.0       GenomicAlignments_1.16.0   
[121] Rsamtools_1.32.0            GenomeInfoDbData_1.1.0      hms_0.4.2                   grid_3.5.0                 
[125] rpart_4.1-13                tidyr_0.8.1                 biovizBase_1.28.0           shiny_1.1.0                
[129] base64enc_0.1-3

bioconductor brainarray genefilter oligo • 2.3k views

ADD COMMENT • link updated 6.9 years ago by James W. MacDonald 68k • written 6.9 years ago by kdyson • 0

0

Entering edit mode

hi, this is not a GSVA-specific issue. The filtering of genes one does for GSVA is the same you would do for a differential expression analysis. The problem you show has to do the particular microarray chip to which the data you are analyzing belongs to and how to analyze it using the oligo package.

ADD REPLY • link 6.9 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Thank you, Robert. I've modified the post to remove the GSVA tag and added oligo.

ADD REPLY • link 6.9 years ago kdyson • 0

score 0 · Answer 1 · 2018-06-11

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

You are doing a bunch of things that don't make sense. First, the data you get from GEO using getGEO is already summarized using the MBNI re-mapped probesets that you want to use. So there is no reason to re-do that part. In addition, the MBNI re-mapped probesets that the original authors used are based on Entrez Gene, so by definition they all have Entrez Gene identifiers! That's the whole point, after all. In addition, removing duplicate Entrez Gene IDs is nonsensical, because the whole point of the MBNI re-mapped data is to summarize all the probes for each Gene ID into one value. Why do you think there would be duplicates?

There won't be any AFFX probesets either because, well see above.

I am not familiar with GSVA, but you would also be better served to ensure that using a variance filter prior makes sense as well. Robert says you filter as you would do for differential expression, but if you were planning to use limma, you most certainly would not use variability, so you should make sure that you do the right thing.

Also do note that things like nsFilter look at the annotation slot of the ExpressionSet, and by default what you get from GEO is the GPL ID, which is a GEO-specific identifier. You would need to change that to correspond with the annotation package that MBNI supplies for this array.

ADD COMMENT • link 6.9 years ago James W. MacDonald 68k

0

Entering edit mode

Part of the reason for redoing the mappings is for my own edification. It was also an attempt at solving the first error I mentioned getAnnMap: package GPL22286 not available. You mention that I merely need to change the GPL identifier in the annotation slot. I'd greatly appreciate it if you could explain how to do that step exactly.

Also, I thought the paper I cited lists a brain array package that uses Ensembl gene IDs, not Entrez gene IDs. The difference isn't entirely clear to me after searching, but it does seem that they are not the same. I installed the "ensg" version of the 1.1 ST chip, not the "entez" version.

ADD REPLY • link 6.9 years ago kdyson • 0

0

Entering edit mode

I am not sure what you mean by 'redoing the mappings'. In my mind a mapping is usually from a manufacturer's ID to an ID from an annotation service like NCBI or EBI/EMBL. However you might actually mean 'redoing the normalization and summarization steps', in which case you should note that oligo doesn't use CDF packages. It uses pdInfoPackages, which are also available from MBNI.

> install.packages("http://mbni.org/customcdf/21.0.0/entrezg.download/pd.hugene11st.hs.entrezg_21.0.0.tar.gz")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/3.5'
(as 'lib' is unspecified)
inferring 'repos = NULL' from 'pkgs'
trying URL 'http://mbni.org/customcdf/21.0.0/entrezg.download/pd.hugene11st.hs.entrezg_21.0.0.tar.gz'
Content type 'application/x-gzip' length 10916821 bytes (10.4 MB)
downloaded 10.4 MB

* installing *source* package 'pd.hugene11st.hs.entrezg' ...
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'pd.hugene11st.hs.entrezg'
    finding HTML links ... done
    pkg                                     html  
    probeSequences                          html  
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (pd.hugene11st.hs.entrezg)
In R CMD INSTALL
> library(oligo)
> dat <- read.celfiles(<filenames go here>, pkgname = "pd.hugene11st.hs.entrezg")

> eset <- rma(dat)

Should summarize the data using the Entrez Gene MBNI package. If you want the Ensembl version, you substitute ensg for entrezg.

Anyway, I was trying to (maybe too subtly) get you to learn what MBNI is doing when they re-map the CDFs. That's sort of important to understand, but I am a bit of a purist when it comes to analyzing data, thinking (naively I suppose) that people should be cognizant of what they are doing and the tradeoffs they are making with their choices. So to me it's a critical thing to know A) what MBNI is doing when they make these things, and B) what it means to use Ensembl vs Entrez Gene IDs. These aren't trivialities, and your results depend on the choices you make.

But do note that the probeset IDs the MBNI append to their remapped probesets are simply the Ensembl or Entrez Gene ID, with a trailing "_at". So if you want to annotate the data then you just need to strip the trailing _at off and then use either an EnsemblDb package or the biomaRt package (if you are using Ensembl IDs), or the org.Hs.eg.db package (if you are using Entrez Gene IDs).

To do the annotation you will need to learn about ExpressionSets, particularly the fData slot, as well as the EnsemblDb, biomaRt, and/or AnnotationDbi based packages.

ADD REPLY • link 6.9 years ago James W. MacDonald 68k