So I have seen this Affymetrix Intronic Normalization Control Probes Differentially Expressed?. Like the OP I'm seeing a large number of intronic sequences in the most differentially expressed genes.
using toptable to get the top 100 results (not correcting for multiple testing) I get between 65 to 81 of the 100 being intronic sequences dependant on which pairwise comparison I'm doing.
As an explanation of the experiment, I'm looking at the effects of an imprinted gene in the offspring and how that regulates gene expression in the mother. So the arrays are from either WT, KO or Transgenic offspring implanted into WT mothers, and the samples come from a number of tissues (though my project is only concerning the maternal pancreas).
I've included my R code, I apologise if its not the prettiest or there are some ugly hacks to make it output results.
From the previously linked thread I see that I could exclude the intronic sequences (which would reduce the amount of multiple testing, so some of the results might become significant that way), but obviously I'd like to work out what I'm doing wrong first. (maximum of 5 tags, so just specifying I'm using the pd.mogene.2.0.st array)
Please note I am aware that I am not selecting significant genes, where I have filtered by P-value (mostly for the GO analysis) it is because I wanted to make sure the script worked properly and that I get some kind of pathway out.
Removed unnecessary code, and tried to follow the bioconductor style guide
## loading libraries ---- run these line by line
source("http://bioconductor.org/biocLite.R")
library(Biobase)
library(oligo) # array data handling
library(limma) # linear models of array data
library(mogene20sttranscriptcluster.db) # array daya annotation
## data input
setwd('/Pancreas_CEL_Files/') # change to directory containing array data
pd <- read.AnnotatedDataFrame("cel.pData.txt", header = TRUE, row.names = 1) # read in phenotype data
celfiles <- read.celfiles(filenames = rownames(pData(pd)), phenoData = pd) # read in data from microarray, incorporating phenotype data
norm <- rma(celfiles) # normalise expression data
groups <- factor(c(rep("WT",4),rep("KO",4),rep("2x",4))) # create groups to compare - this needs to be changed to reflect sample order
## array annotation
gns <- select(mogene20sttranscriptcluster.db, featureNames(norm), c("ENSEMBL","SYMBOL","GENENAME","REFSEQ")) # get annotation data
gns <- gns[!duplicated(gns[,1]),] # remove duplicate entries
## pairwise comparisons & linear modelling
design <- model.matrix(~0+groups)
names <- matrix(c("WTvsKO","WTvs2X","KOvs2X","WTvsKO+2X"), nrow=4, ncol=1) # create list of comparisons for naming files in loop
acontrast <- makeContrasts(
WT_vs_KO = (groupsWT - groupsKO),
WT_vs_2X = (groupsWT - groups2x),
KO_vs_2X = (groupsKO - groups2x),
WT_vs_others = (groupsWT - ((groupsKO + groups2x)/2)),
levels=design) # compare different conditions
lm1 <- lmFit(exprs(norm), design) # create linear model of expression data given design matrix
lm1 <- contrasts.fit(lm1, acontrast) # create linear model factoring the contrasts made
lm1 <- eBayes(lm1) # compute test statistics on differential expression
## loop for pairwise comparisons, heatmaps and GO analysis
for (a in 1:4) {
dat <- toptable(lm1, coef=a, num=100) # Create dataframe of top results
dat["Symbol"] <- gns$SYMBOL[match(rownames(dat), gns$PROBEID)]
dat["GeneName"] <- gns$GENENAME[match(rownames(dat), gns$PROBEID)]
if (a == 1) WTvKO <- dat
if (a == 2) WTv2X <- dat
if (a == 3) KOv2X <- dat
if (a == 4) WTvKO2X <- dat
}
For future reference, you should post a minimal working example that recapitulates the relevant behaviour. There's a lot of code here that's irrelevant to the issue at hand.
Hopefully fixed that now