Question

Filtering genes with missing Entrez ID using "nsFilter"

0

Entering edit mode

hallie.cla • 0

@halliecla-8705

Last seen 7.8 years ago

Sweden

Dear list,

I'm analyzing data obtained using the HuGene1.0-st arrays from affymetrix.

I'm interested in a particular set of genes. Two of these genes are not present in the results if I filter out probes with missing Entrez IDs using nsFilter. However, I think that these two genes (PER1 and CSNK1E) actually have an Entrez ID and should not be filtered out. The only way I obtain them in my results is if I don't apply any kind of filter with nsFilter. Do you think I'm doing something wrong with the annotation? This is the code I'm using and the output of sessionInfo(). Thank you so much for your kind help.

exp.data <- read.celfiles(filenames=list.celfiles(full.names=TRUE, dataDir))
phenoData <- read.table(file.path(dataDir ,"pheno_BD_CT.txt"), sep="\t", header=TRUE)
row.names(phenoData) <- sampleNames(exp.data)
pd <- AnnotatedDataFrame(data=phenoData)
phenoData(exp.data) <- pd
eset <- rma(exp.data)
subs <- grep("CT", phenoData(exp.data)$Phenotype)
eset.subs <- eset[,c(subs)]
annotation(eset.subs) <- "hugene10sttranscriptcluster.db"
esetf.subs <- nsFilter(eset.subs, require.entrez=TRUE, remove.dupEntrez=FALSE, var.filter=FALSE)
express <- esetf.subs$eset

> esetf.subs
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 19777 features, 20 samples
element names: exprs
protocolData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: FileName Response ... Analysis (5 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hugene10sttranscriptcluster.db

$filter.log
$filter.log$numRemoved.ENTREZID
[1] 13520

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] pd.hugene.1.0.st.v1_3.14.1 RSQLite_1.0.0
[3] DBI_0.5 hugene10sttranscriptcluster.db_8.4.0
[5] org.Hs.eg.db_3.3.0 AnnotationDbi_1.34.4
[7] genefilter_1.54.2 oligo_1.36.1
[9] Biostrings_2.40.2 XVector_0.12.1
[11] IRanges_2.6.1 S4Vectors_0.10.3
[13] limma_3.28.18 oligoClasses_1.34.0
[15] Biobase_2.32.0 BiocGenerics_0.18.0
[17] mvtnorm_1.0-5

loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 BiocInstaller_1.22.3 GenomeInfoDb_1.8.3
[4] iterators_1.0.8 tools_3.3.1 zlibbioc_1.18.0
[7] bit_1.1-12 tibble_1.1 annotate_1.50.0
[10] preprocessCore_1.34.0 lattice_0.20-33 ff_2.2-13
[13] Matrix_1.2-6 foreach_1.4.3 dplyr_0.5.0
[16] affxparser_1.44.0 grid_3.3.1 R6_2.1.2
[19] XML_3.98-1.4 survival_2.40-1 magrittr_1.5
[22] codetools_0.2-14 splines_3.3.1 GenomicRanges_1.24.2
[25] assertthat_0.1 SummarizedExperiment_1.2.3 xtable_1.8-2
[28] sandwich_2.3-4 zoo_1.7-14 affyio_1.42.0
>

nsfilter affymetrix hugene10sttranscriptcluster.db annotation • 2.0k views

ADD COMMENT • link updated 7.8 years ago by James W. MacDonald 67k • written 7.8 years ago by hallie.cla • 0

score 1 · Answer 1 · 2017-02-22

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 10 minutes ago

United States

The short answer is "don't use nsFilter". I wouldn't even recommend filtering based on whether or not a probeset has an Entrez Gene ID, but that's just me. If you wanted to do something like that you could just do

library(affycoretools)
eset <- annotateEset(eset, hugene10sttranscriptcluster.db)
eset.sub <- eset[!is.na(fData(eset)$ENTREZID),]

ADD COMMENT • link 7.8 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you so much James for your answer.

I wanted to use nsFilter to filter out also probes with duplicated IDs, but also in this case I find my genes of interest filtered out (even if they do not have duplicated probes according to the results I obtain when I do not apply any filter).

Is there a way to filter out duplicated probes without using nsFilter?

Also, just if you have time to explain this to me, do you have an idea on why I don't find these two genes in the results if they have an ID?

Thank you so much for your time and help

Claudia

ADD REPLY • link 7.8 years ago hallie.cla • 0

0

Entering edit mode

This is what I obtain when using the method you suggested, but unfortunately my genes of interest are still filtered out:

> eset.subs <- annotateEset(eset.subs, "hugene10sttranscriptcluster.db")
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> eset.subs2 <- eset.subs[!is.na(fData(eset.subs)$ENTREZID),]
> eset.subs2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22384 features, 20 samples
element names: exprs
protocolData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: FileName Response ... Analysis (5 total)
varMetadata: labelDescription
featureData
featureNames: 7896740 7896742 ... 8180418 (22384 total)
fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hugene10sttranscriptcluster.db
>

ADD REPLY • link 7.8 years ago hallie.cla • 0

0

Entering edit mode

Well, what is happening is essentially this:

> z <- as.data.frame(lapply(c("ENTREZID","SYMBOL"), function(x) mapIds(hugene10sttranscriptcluster.db, keys(hugene10sttranscriptcluster.db), x, "PROBEID")))
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> names(z) <- c("ENTREZID","SYMBOL")
> head(z)
        ENTREZID SYMBOL
7892501     <NA>   <NA>
7892502     <NA>   <NA>
7892503     <NA>   <NA>
7892504     <NA>   <NA>
7892505     <NA>   <NA>
7892506     <NA>   <NA>
> z[z$SYMBOL %in% c("PER1","CSNK1E"),]
        ENTREZID SYMBOL
8012349     5187   PER1
8076056     1454 CSNK1E

So if you take an unfiltered ExpressionSet and annotate it using annotateEset, then I can assure you that both PER1 and CSNK1E will be in there.