Intronic sequences are the majority of the top differentially expressed genes
1
0
Entering edit mode
@ben_cossins-8532
Last seen 5.7 years ago
United Kingdom

So I have seen this Affymetrix Intronic Normalization Control Probes Differentially Expressed?. Like the OP I'm seeing a large number of intronic sequences in the most differentially expressed genes.

using toptable to get the top 100 results (not correcting for multiple testing) I get between 65 to 81 of the 100 being intronic sequences dependant on which pairwise comparison I'm doing.

As an explanation of the experiment, I'm looking at the effects of an imprinted gene in the offspring and how that regulates gene expression in the mother. So the arrays are from either WT, KO or Transgenic offspring implanted into WT mothers, and the samples come from a number of tissues (though my project is only concerning the maternal pancreas).

I've included my R code, I apologise if its not the prettiest or there are some ugly hacks to make it output results.

From the previously linked thread I see that I could exclude the intronic sequences (which would reduce the amount of multiple testing, so some of the results might become significant that way), but obviously I'd like to work out what I'm doing wrong first.  (maximum of 5 tags, so just specifying I'm using the pd.mogene.2.0.st array)

 

Please note I am aware that I am not selecting significant genes, where I have filtered by P-value (mostly for the GO analysis) it is because I wanted to make sure the script worked properly and that I get some kind of pathway out.

limma microarray normalization oligo differential gene expression • 1.5k views
ADD COMMENT
0
Entering edit mode

Removed unnecessary code, and tried to follow the bioconductor style guide

 

## loading libraries   ---- run these line by line
source("http://bioconductor.org/biocLite.R")
library(Biobase)
library(oligo)  # array data handling
library(limma)  # linear models of array data
library(mogene20sttranscriptcluster.db)  # array daya annotation

## data input
setwd('/Pancreas_CEL_Files/')  # change to directory containing array data
pd <- read.AnnotatedDataFrame("cel.pData.txt", header = TRUE, row.names = 1) # read in phenotype data
celfiles <- read.celfiles(filenames = rownames(pData(pd)), phenoData = pd)  # read in data from microarray, incorporating phenotype data
norm <- rma(celfiles)  # normalise expression data
groups <- factor(c(rep("WT",4),rep("KO",4),rep("2x",4)))  # create groups to compare - this needs to be changed to reflect sample order


## array annotation
gns <- select(mogene20sttranscriptcluster.db, featureNames(norm), c("ENSEMBL","SYMBOL","GENENAME","REFSEQ"))  # get annotation data
gns <- gns[!duplicated(gns[,1]),]  # remove duplicate entries


## pairwise comparisons & linear modelling
design <- model.matrix(~0+groups)
names <- matrix(c("WTvsKO","WTvs2X","KOvs2X","WTvsKO+2X"), nrow=4, ncol=1)  # create list of comparisons for naming files in loop
acontrast <- makeContrasts(
    WT_vs_KO = (groupsWT - groupsKO),
    WT_vs_2X = (groupsWT - groups2x),
    KO_vs_2X = (groupsKO - groups2x),
    WT_vs_others = (groupsWT - ((groupsKO + groups2x)/2)),
    levels=design)  # compare different conditions
lm1 <- lmFit(exprs(norm), design)  # create linear model of expression data given design matrix
lm1 <- contrasts.fit(lm1, acontrast)  # create linear model factoring the contrasts made
lm1 <- eBayes(lm1)  # compute test statistics on differential expression


## loop for pairwise comparisons, heatmaps and GO analysis
for (a in 1:4) {  
    dat <- toptable(lm1, coef=a, num=100)  # Create dataframe of top results
    dat["Symbol"] <- gns$SYMBOL[match(rownames(dat), gns$PROBEID)]
    dat["GeneName"] <- gns$GENENAME[match(rownames(dat), gns$PROBEID)]
    if (a == 1) WTvKO <- dat
    if (a == 2) WTv2X <- dat
    if (a == 3) KOv2X <- dat
    if (a == 4) WTvKO2X <- dat
}

ADD REPLY
0
Entering edit mode

For future reference, you should post a minimal working example that recapitulates the relevant behaviour. There's a lot of code here that's irrelevant to the issue at hand.

ADD REPLY
0
Entering edit mode

Hopefully fixed that now

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

Aaron's right - that is a metric ton of irrelevant code, plus when you use = for assignment rather than <-, and 'bad' indenting, it's pretty difficult to scan through to see what you are doing. But do note that this part

### linear modeling
design = model.matrix(~0+groups)
lm1 = lmFit(exprs(norm), design) 
lm1 = eBayes(lm1) 

### pairwise comparisons
names = matrix(c("WTvsKO","WTvs2X","KOvs2X","WTvsKO+2X"), nrow=4, ncol=1)                        # create list of comparisons for naming files in loop
acontrast = makeContrasts(
     WT_vs_KO = (groupsWT - groupsKO),
    WT_vs_2X = (groupsWT - groups2x),
    KO_vs_2X = (groupsKO - groups2x),
    WT_vs_others = (groupsWT - ((groupsKO + groups2x)/2)),
     levels=design)
lm3 = contrasts.fit(lm1, acontrast)

Is not correct. You use contrasts.fit() before eBayes(), not after.

In addition, this part

for (a in 1:4)                                                        # hard coded for 4 comparisons, needs to be changed if the number of comparisons changed
{
    dat=toptable(lm3, coef=a, num=100)                                                # Create dataframe of top results
    dat[,c("Symbol","GeneName")] = NA                                                # create columns for gene symbol and name
    b = length(row.names(dat))                                            # calculate length of table for loop count
    for (n in 1:b)                                                    # loop to add human readable names and gene symbols
    {
        dat[n,6] = gns$SYMBOL[gns$PROBEID == rownames(dat[n,])]                            # add gene symbols    
        dat[n,7] = gns$GENENAME[gns$PROBEID == rownames(dat[n,])]                        # add gene names
    }

Isn't needed. There is a 'genes' list item in the MArrayLM object you are calling lm1, and you can just do

lm1$genes <- gns

and the topTable() results will be correctly annotated. Plus you will make your life easier if you vectorize things rather than using for() loops. Instead of thinking that you are going to add a column of NA values and then iterate through each one, replacing with the correct value, it's better to do something like use match() to create a correctly ordered vector and then add that to the data.frame.

ADD COMMENT
0
Entering edit mode

Thanks for the reply, I've edited the code posted to just the relevant bits and tried to follow the bioconductor style.

 

this code works for creating toptable with annotation, however the majority of my results are still showing up as NA, and when I search them in the affymetrix csv of probe IDs they return "normgene->intron"

ADD REPLY
0
Entering edit mode

Any thoughts?

 

ADD REPLY

Login before adding your answer.

Traffic: 655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6