Exon probesets as "core" probsets in pd.mogene.2.0.st
1
0
Entering edit mode
@captainfats-7180
Last seen 9.9 years ago
United States

Using Oligo/limma for analysis of the Mouse Gene 2.0 ST array, I noticed a few un annotated  results that lack any gene annotation, upon inspecting some of these IDs on Netaffx they turned out to be the "exon" probesetids  not the transcript_ids. I noticed that for some genes both the Transcript_cluster_id and the individual probeset_ids (exon?) were present in my output. 

ID

symbol

name

Ensembl

adj.P

logFC

17219448

Copa

coatomer...

ENSMUSG00000026553

0.2239127307

0.1570203949

17219450

NA

NA

NA

0.3382192058

0.148945971

17219451

NA

NA

NA

0.2129264336

0.2089252306

17219452

NA

NA

NA

0.3294590243

0.1762376171

17219453

NA

NA

NA

0.0730023042

0.2049750825

17219454

NA

NA

NA

0.4965835277

0.0907620885

Looking into pd.mogene.2.0.st it does appear that this happens a times, where the transcript_cluster_id and the individual exon id are both in the "core" set. Is this an affymetrix annotation issue/feature and does anyone have a suggestion on what to do. 

con <- db(pd.mogene.2.0.st)
head(dbGetQuery(con, "select * from pmfeature inner join 
+ core_mps on pmfeature.fsetid=core_mps.fsetid where 
+ core_mps.meta_fsetid='17219448';"))
      fid   fsetid  atom    x    y meta_fsetid transcript_cluster_id   fsetid
1  216932 17219449 16055  923  134    17219448              17219448 17219449
2 1015902 17219450 16056  341  630    17219448              17219448 17219450
3 2307453 17219451 16057  680 1431    17219448              17219448 17219451
4  164045 17219452 16058 1232  101    17219448              17219448 17219452
5 1692866 17219453 16059  265 1050    17219448              17219448 17219453
6  811295 17219454 16060  458  503    17219448              17219448 17219454

For each fsetid of meta_fesetid 17219448 each of the fsetid (which I presume is an exon probeset) also appears as a core_mps

dbGetQuery(con, "select * from pmfeature inner join 
+ core_mps on pmfeature.fsetid=core_mps.fsetid where 
+ core_mps.meta_fsetid='17219449';")
     fid   fsetid  atom   x   y meta_fsetid transcript_cluster_id   fsetid
1 216932 17219449 16055 923 134    17219449              17219449 17219449

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pd.mogene.2.0.st_2.14.1 oligo_1.30.0            Biobase_2.26.0          oligoClasses_1.28.0     RSQLite_1.0.0          
 [6] DBI_0.3.1               Biostrings_2.34.1       XVector_0.6.0           IRanges_2.0.1           S4Vectors_0.4.0        
[11] BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] affxparser_1.38.0     affyio_1.34.0         BiocInstaller_1.16.1  bit_1.1-12            codetools_0.2-9       ff_2.2-13            
 [7] foreach_1.4.2         GenomeInfoDb_1.2.3    GenomicRanges_1.18.3  iterators_1.0.7       preprocessCore_1.28.0 splines_3.1.2        
[13] tools_3.1.2           zlibbioc_1.12.0      
 
 
 

 

oligo pd.mogene.2.0.st • 1.2k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

This is one of the fun aspects of the newer generation of Affymetrix arrays; they tend to reuse the exon level probesets as controls when you summarize at the transcript level.

There does however appear to be a problem with the pd.mogene.2.0.st package:

> dbGetQuery(con, "select fsetid, type, meta_fsetid from featureSet inner join core_mps using(fsetid) where meta_fsetid in ('17219448','17219450','17219451','17219452');")
     fsetid type meta_fsetid
1  17219449    1    17219448
2  17219450    1    17219448
3  17219451    1    17219448
4  17219452    1    17219448
5  17219453    1    17219448
6  17219454    1    17219448
7  17219455    1    17219448
8  17219456    1    17219448
9  17219457    1    17219448
10 17219459    1    17219448
11 17219460    1    17219448
12 17219461    1    17219448
13 17219462    1    17219448
14 17219463    1    17219448
15 17219464    1    17219448
16 17219465    1    17219448
17 17219466    1    17219448
18 17219469    1    17219448
19 17219470    1    17219448
20 17219471    1    17219448
21 17219472    1    17219448
22 17219473    1    17219448
23 17219474    1    17219448
24 17219475    1    17219448
25 17219476    1    17219448
26 17219477    1    17219448
27 17219478    1    17219448
28 17219479    1    17219448
29 17219480    1    17219448
30 17219481    1    17219448
31 17219482    1    17219448
32 17219483    1    17219448
33 17219484    1    17219448
34 17219486    1    17219448
35 17219487    1    17219448
36 17219488    1    17219448
37 17219450    1    17219450
38 17219451    1    17219451
39 17219452    1    17219452

But the last three lines there should have a 'type' of 6, rather than 1:

> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2              control->affx
3     3              control->chip
4     4  control->bgp->antigenomic
5     5      control->bgp->genomic
6     6             normgene->exon
7     7           normgene->intron
8     8   rescue->FLmRNA->unmapped
9     9   control->affx->bac_spike
10   10             oligo_spike_in
11   11            r1_bac_spike_at
12   12 control->affx->polya_spike
13   13        control->affx->ercc
14   14  control->affx->ercc->step

If we look at the netaffxTranscript.rda data that come with this package:

> load(paste0(path.package("pd.mogene.2.0.st"), "/extdata/netaffxTranscript.rda"))
> pd <- pData(netaffxTranscript)
> pd[pd$transcriptclusterid %in% c("17219448","17219450","17219451","17219452"),c(1:2,18)]
         transcriptclusterid probesetid       category
17219448            17219448   17219448           main
17219450            17219450   17219450 normgene->exon
17219451            17219451   17219451 normgene->exon
17219452            17219452   17219452 normgene->exon

Ideally you would be able to excise these control probesets so you don't have this stuff popping up in your top hits, but evidently the normgene->exon controls are slipping through:

> table(dbGetQuery(con, "select type from featureSet;")[,1])

     1      2      4      7      9     12
263551     18     23   5331     18     39
> table(pd[,18])

             control->affx   control->affx->bac_spike
                        18                         18
 control->affx->ercc-spike control->affx->polya_spike
                        92                         39
 control->bgp->antigenomic                       main
                        23                      33793
            normgene->exon           normgene->intron
                      1352                       5331
                  reporter                     rescue
                        82                        558
          rescue->flmrna->                       rrna
                        14                         25

There are evidently lots of control probes that are sneaking through the cracks, so pdInfoBuilder needs some updating to handle these arrays.

 

 

 

ADD COMMENT
0
Entering edit mode

Jim thanks for the insight, looks like a I can do a little cleanup on my own now that I know what to look for.  Yes the new arrays are very fun! 

ADD REPLY

Login before adding your answer.

Traffic: 487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6