Filtering genes with missing Entrez ID using "nsFilter"
1
0
Entering edit mode
hallie.cla • 0
@halliecla-8705
Last seen 7.8 years ago
Sweden

Dear list,

I'm analyzing data obtained using the HuGene1.0-st arrays from affymetrix.

I'm interested in a particular set of genes. Two of these genes are not present in the results if I filter out probes with missing Entrez IDs using nsFilter. However, I think that these two genes (PER1 and CSNK1E) actually have an Entrez ID and should not be filtered out. The only way I obtain them in my results is if I don't apply any kind of filter with nsFilter. Do you think I'm doing something wrong with the annotation? This is the code I'm using and the output of sessionInfo(). Thank you so much for your kind help.

 

exp.data <- read.celfiles(filenames=list.celfiles(full.names=TRUE, dataDir))
phenoData <- read.table(file.path(dataDir ,"pheno_BD_CT.txt"), sep="\t", header=TRUE)
row.names(phenoData) <- sampleNames(exp.data)
pd <- AnnotatedDataFrame(data=phenoData)
phenoData(exp.data) <- pd
eset <- rma(exp.data)
subs <- grep("CT", phenoData(exp.data)$Phenotype)
eset.subs <- eset[,c(subs)]
annotation(eset.subs) <- "hugene10sttranscriptcluster.db"
esetf.subs <- nsFilter(eset.subs, require.entrez=TRUE, remove.dupEntrez=FALSE, var.filter=FALSE)
express <- esetf.subs$eset

> esetf.subs
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 19777 features, 20 samples 
  element names: exprs 
protocolData
  rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
    9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
    9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
  varLabels: FileName Response ... Analysis (5 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hugene10sttranscriptcluster.db 

$filter.log
$filter.log$numRemoved.ENTREZID
[1] 13520

 

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pd.hugene.1.0.st.v1_3.14.1           RSQLite_1.0.0                       
 [3] DBI_0.5                              hugene10sttranscriptcluster.db_8.4.0
 [5] org.Hs.eg.db_3.3.0                   AnnotationDbi_1.34.4                
 [7] genefilter_1.54.2                    oligo_1.36.1                        
 [9] Biostrings_2.40.2                    XVector_0.12.1                      
[11] IRanges_2.6.1                        S4Vectors_0.10.3                    
[13] limma_3.28.18                        oligoClasses_1.34.0                 
[15] Biobase_2.32.0                       BiocGenerics_0.18.0                 
[17] mvtnorm_1.0-5                       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6                BiocInstaller_1.22.3       GenomeInfoDb_1.8.3        
 [4] iterators_1.0.8            tools_3.3.1                zlibbioc_1.18.0           
 [7] bit_1.1-12                 tibble_1.1                 annotate_1.50.0           
[10] preprocessCore_1.34.0      lattice_0.20-33            ff_2.2-13                 
[13] Matrix_1.2-6               foreach_1.4.3              dplyr_0.5.0               
[16] affxparser_1.44.0          grid_3.3.1                 R6_2.1.2                  
[19] XML_3.98-1.4               survival_2.40-1            magrittr_1.5              
[22] codetools_0.2-14           splines_3.3.1              GenomicRanges_1.24.2      
[25] assertthat_0.1             SummarizedExperiment_1.2.3 xtable_1.8-2              
[28] sandwich_2.3-4             zoo_1.7-14                 affyio_1.42.0             

nsfilter affymetrix hugene10sttranscriptcluster.db annotation • 2.0k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

The short answer is "don't use nsFilter". I wouldn't even recommend filtering based on whether or not a probeset has an Entrez Gene ID, but that's just me. If you wanted to do something like that you could just do

library(affycoretools)
eset <- annotateEset(eset, hugene10sttranscriptcluster.db)
eset.sub <- eset[!is.na(fData(eset)$ENTREZID),]
ADD COMMENT
0
Entering edit mode

Thank you so much James for your answer. 

I wanted to use nsFilter to filter out also probes with duplicated IDs, but also in this case I find my genes of interest filtered out (even if they do not have duplicated probes according to the results I obtain when I do not apply any filter). 

Is there a way to filter out duplicated probes without using nsFilter? 

Also, just if you have time to explain this to me, do you have an idea on why I don't find these two genes in the results if they have an ID?

Thank you so much for your time and help

Claudia

ADD REPLY
0
Entering edit mode

This is what I obtain when using the method you suggested, but unfortunately my genes of interest are still filtered out:

 

> eset.subs <- annotateEset(eset.subs, "hugene10sttranscriptcluster.db")
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> eset.subs2 <- eset.subs[!is.na(fData(eset.subs)$ENTREZID),]
> eset.subs2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22384 features, 20 samples 
  element names: exprs 
protocolData
  rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
    9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
    9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
  varLabels: FileName Response ... Analysis (5 total)
  varMetadata: labelDescription
featureData
  featureNames: 7896740 7896742 ... 8180418 (22384 total)
  fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hugene10sttranscriptcluster.db 

ADD REPLY
0
Entering edit mode

Well, what is happening is essentially this:

> z <- as.data.frame(lapply(c("ENTREZID","SYMBOL"), function(x) mapIds(hugene10sttranscriptcluster.db, keys(hugene10sttranscriptcluster.db), x, "PROBEID")))
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> names(z) <- c("ENTREZID","SYMBOL")
> head(z)
        ENTREZID SYMBOL
7892501     <NA>   <NA>
7892502     <NA>   <NA>
7892503     <NA>   <NA>
7892504     <NA>   <NA>
7892505     <NA>   <NA>
7892506     <NA>   <NA>
> z[z$SYMBOL %in% c("PER1","CSNK1E"),]
        ENTREZID SYMBOL
8012349     5187   PER1
8076056     1454 CSNK1E

So if you take an unfiltered ExpressionSet and annotate it using annotateEset, then I can assure you that both PER1 and CSNK1E will be in there.

ADD REPLY
0
Entering edit mode

Thank you for your answer

ADD REPLY

Login before adding your answer.

Traffic: 581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6