I've started using BrainArray's custom CDFs with oligo
and noticed something I cannot explain. I looked at some Q&As -- such as this one: How to use brainarray custom cdf with oligo package? -- but couldn't figure out an answer.
When I use CDFs with ENTREZ or ENSEMBL annotation I noticed read.celfiles
collapses probesets/transcripts/exon rows into each unique ENTREZ or ENSEMBL code. Using a generic GEO dataseries, for example, the following code (assuming all necessary libraries are loaded; see sessioInfo() below)
# Change dir base.dir = "~/Documents/IntegrationTest/" setwd(base.dir) # Reading data data.dir = paste(base.dir,"GSE474",sep="") list1 = list.celfiles(data.dir, full.name=TRUE) data96a = read.celfiles(list1) data96b = read.celfiles(list1, pkgname = "pd.hgu133a.hs.ensg")
produces
> data96a # ExpressionFeatureSet ExpressionFeatureSet (storageMode: lockedEnvironment) assayData: 506944 features, 24 samples element names: exprs protocolData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use 'experimentData(object)' Annotation: pd.hg.u133a > data96b # GenericFeatureSet GenericFeatureSet (storageMode: lockedEnvironment) assayData: 506944 features, 24 samples element names: exprs protocolData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use 'experimentData(object)' Annotation: pd.hgu133a.hs.ensg
The first difference can be seen right here: without any custom CDF, read.celfiles
generates an ExpressionFeatureSet
, while with the BrainArray CDF it gives a GenericFeatureSet
. How could I get an ExpressionFeatureSet
with a custom CDF?
Moving on with
eset96a = oligo::rma(data96a) eset96b = oligo::rma(data96b)
one gets
> eset96a ExpressionSet (storageMode: lockedEnvironment) assayData: 22283 features, 24 samples element names: exprs protocolData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use 'experimentData(object)' Annotation: pd.hg.u133a > eset96b ExpressionSet (storageMode: lockedEnvironment) assayData: 11959 features, 24 samples element names: exprs protocolData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total) varLabels: index varMetadata: labelDescription channel featureData: none experimentData: use 'experimentData(object)' Annotation: pd.hgu133a.hs.ensg
It is clear that oligo is collapsing rows into ENSEMBL gene codes and I don't really mind that it is doing it. I just need to understand HOW it is doing that. Is it taking the average, median, max of all rows/transcripts/exons related to any single gene code?
Any help will be greatly appreciated.
> sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] pd.hgu133a.hs.ensg_0.0.1 hugene11sthsentrezg.db_21.0.0 hugene10sthsentrezg.db_21.0.0 [4] huex10sthsentrezg.db_21.0.0 hgu133a2hsentrezg.db_21.0.0 hgu133plus2hsentrezg.db_21.0.0 [7] hgu133ahsentrezg.db_21.0.0 pd.hugene11st.hs.entrezg_0.0.1 pd.hugene10st.hs.entrezg_0.0.1 [10] pd.huex10st.hs.entrezg_0.0.1 pd.hgu133a2.hs.entrezg_0.0.1 pd.hgu133plus2.hs.entrezg_0.0.1 [13] pd.hgu133a.hs.entrezg_0.0.1 hugene11sttranscriptcluster.db_8.5.0 hugene11stprobeset.db_8.5.0 [16] hugene10sttranscriptcluster.db_8.5.0 hugene10stprobeset.db_8.5.0 huex10sttranscriptcluster.db_8.5.0 [19] huex10stprobeset.db_8.5.0 hgu133a2.db_3.2.3 hgu133plus2.db_3.2.3 [22] hgu133a.db_3.2.3 org.Hs.eg.db_3.4.0 AnnotationDbi_1.36.2 [25] pd.hugene.1.1.st.v1_3.14.1 pd.hugene.1.0.st.v1_3.14.1 pd.huex.1.0.st.v2_3.14.1 [28] pd.hg.u133a.2_3.12.0 pd.hg.u133.plus.2_3.12.0 pd.hg.u133a_3.12.0 [31] affycoretools_1.46.4 CustomCDF_1.0.3 DBI_0.5-1 [34] RSQLite_1.1-2 frma_1.26.0 oligo_1.38.0 [37] Biostrings_2.42.1 XVector_0.14.0 IRanges_2.8.1 [40] S4Vectors_0.12.1 oligoClasses_1.36.0 affy_1.52.0 [43] Biobase_2.34.0 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] backports_1.0.5 GOstats_2.40.0 Hmisc_4.0-2 AnnotationHub_2.6.4 [5] plyr_1.8.4 lazyeval_0.2.0 GSEABase_1.36.0 splines_3.3.2 [9] BiocParallel_1.8.1 GenomeInfoDb_1.10.3 ggplot2_2.2.1 digest_0.6.12 [13] foreach_1.4.3 BiocInstaller_1.24.0 ensembldb_1.6.2 htmltools_0.3.5 [17] GO.db_3.4.0 gdata_2.17.0 magrittr_1.5 checkmate_1.8.2 [21] memoise_1.0.0 BSgenome_1.42.0 cluster_2.0.5 gcrma_2.46.0 [25] limma_3.30.11 annotate_1.52.1 R.utils_2.5.0 ggbio_1.22.3 [29] colorspace_1.3-2 RCurl_1.95-4.8 graph_1.52.0 genefilter_1.56.0 [33] survival_2.40-1 VariantAnnotation_1.20.2 iterators_1.0.8 gtable_0.2.0 [37] zlibbioc_1.20.0 scales_0.4.1 GGally_1.3.0 edgeR_3.16.5 [41] Rcpp_0.12.9 xtable_1.8-2 htmlTable_1.9 foreign_0.8-67 [45] bit_1.1-12 OrganismDbi_1.16.0 preprocessCore_1.36.0 Formula_1.2-1 [49] AnnotationForge_1.16.0 htmlwidgets_0.8 httr_1.2.1 gplots_3.0.1 [53] RColorBrewer_1.1-2 acepack_1.4.1 ff_2.2-13 reshape_0.8.6 [57] XML_3.98-1.5 R.methodsS3_1.7.1 nnet_7.3-12 locfit_1.5-9.1 [61] reshape2_1.4.2 munsell_0.4.3 tools_3.3.2 stringr_1.1.0 [65] yaml_2.1.14 knitr_1.15.1 caTools_1.17.1 RBGL_1.50.0 [69] mime_0.5 R.oo_1.21.0 biomaRt_2.30.0 interactiveDisplayBase_1.12.0 [73] affyio_1.44.0 PFAM.db_3.4.0 tibble_1.2 geneplotter_1.52.0 [77] stringi_1.1.2 GenomicFeatures_1.26.2 lattice_0.20-34 Matrix_1.2-8 [81] data.table_1.10.4 bitops_1.0-6 httpuv_1.3.3 rtracklayer_1.34.1 [85] GenomicRanges_1.26.2 R6_2.2.0 latticeExtra_0.6-28 hwriter_1.3.2 [89] KernSmooth_2.23-15 gridExtra_2.2.1 affxparser_1.46.0 codetools_0.2-15 [93] dichromat_2.0-0 MASS_7.3-45 gtools_3.5.0 assertthat_0.1 [97] SummarizedExperiment_1.4.0 DESeq2_1.14.1 Category_2.40.0 ReportingTools_2.14.0 [101] GenomicAlignments_1.10.0 Rsamtools_1.26.1 grid_3.3.2 rpart_4.1-10 [105] biovizBase_1.22.0 shiny_1.0.0 base64enc_0.1-3