Data filtering

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.3 years ago

Greetings friends! I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries The goal is to find genes that have different time expression profiles amongst 2 or more genotypes. After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post. So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are removing genes that have low expression (count) levels removing genes that have high variance across replicates removing genes that have low variance across time (constitutively expressed genes are biologically less interesting) So my question to you is what stage of my analysis do I employ these filters? On the raw data itself, prior to normalization? Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?)


count = read.table("Input.txt", sep="\t", header=T)
#$#$ read in raw count mapped data

f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]
#$#$ filter ou genes with total read count < 27 across all libraries

f.dat = f.count[,-c(1,ncol(count))]
#$#$ select only read count, not rest of data frame

S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))
#$#$ define group

Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))
#$#$ define time

Time.rep = rep(1:3,9)
#$#$ define replicate

Group = paste(S,Time,Time.rep,sep="_")
#$#$ define group_time_replicate

library(edgeR)
#$#$ load edgeR package

f.factor = data.frame(files = names(f.dat), S = S , Time = Time,
lib.size = c(apply(f.dat,2,sum)),norm.factors =
calcNormFactors(as.matrix(f.dat)))
#$#$  make data for edgeR method

count.d = new("DGEList", list(samples = f.factor, counts =
as.matrix(f.dat)))
#$#$  make data for edgeR method

design = model.matrix(~ Time + S)
#$#$  make design data for edgeR method

count.d = calcNormFactors(count.d)
#$#$  Normalize TMM

glmfit.d = glmFit(count.d, design, dispersion = 0.1)
#$#$  Fit the Negative Binomial Gen Lin Models

lrt.count = glmLRT(count.d, glmfit.d)
#$#$  Likelihood ratio tests

result.count = data.frame(f.count, lrt.count$table)
#$#$  combining raw data and results from edgeR

result.count$FDR = p.adjust(result.count$p.value,method="BH")
#$#$  calculating the False Discovery Rate

write.table(result.count,
"edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)
#$#$  saving the combined data set

-- output of sessionInfo(): . -- Sent via the guest posting facility at bioconductor.org.

Normalization edgeR Normalization edgeR • 1.8k views

ADD COMMENT • link updated 12.2 years ago by Mark Robinson ▴ 880 • written 12.2 years ago by Guest User ★ 13k

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.1 years ago

Hi Anand, I've added a few "reactions" below; I hope it can help. > Greetings friends! > > I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries > > The goal is to find genes that have different time expression profiles amongst 2 or more genotypes. > After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post. What did you actually plot? What did an MDS plot look like? > So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are > removing genes that have low expression (count) levels > removing genes that have high variance across replicates > removing genes that have low variance across time (constitutively expressed genes are biologically less interesting) I understand the first one (removing low counts) and would recommend it, but the statistics should somewhat take care of highlighting which genes are differential, between the contrast of interest that you specify. So, are your second and third really necessary? > So my question to you is what stage of my analysis do I employ these filters? > On the raw data itself, prior to normalization? > Or should I perform the TMM normalization, use the norm factors to transform my data to non-integer normalized counts and then filter (in which case I think I cannot fit them into negative binomial model, right?) > >


> count = read.table("Input.txt", sep="\t", header=T)
> #$#$ read in raw count mapped data
>
> f.count = count[apply(count[,-c(1,ncol(count))],1,sum) > 27,]
> #$#$ filter ou genes with total read count < 27 across all libraries

I'm not sure where this comes from.  The edgeR manual shows something
similar using the cpm() function as opposed to total read count.  It's
still a somewhat arbitrary cutoff, but see "3.3.4 Normalization and
Filtering":
http://bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/ed
geRUsersGuide.pdf


> f.dat = f.count[,-c(1,ncol(count))]
> #$#$ select only read count, not rest of data frame
>
> S = factor(rep(c("gen1","gen2","gen3"),rep(9,3)))
> #$#$ define group
>
> Time = factor(rep(rep(c("0","10","20"),rep(3,3)),3))
> #$#$ define time
>
> Time.rep = rep(1:3,9)
> #$#$ define replicate
>
> Group = paste(S,Time,Time.rep,sep="_")
> #$#$ define group_time_replicate
>
> library(edgeR)
> #$#$ load edgeR package
>
> f.factor = data.frame(files = names(f.dat), S = S , Time = Time,
lib.size = c(apply(f.dat,2,sum)),norm.factors =
calcNormFactors(as.matrix(f.dat)))
> #$#$  make data for edgeR method
>
> count.d = new("DGEList", list(samples = f.factor, counts =
as.matrix(f.dat)))
> #$#$  make data for edgeR method

I *think* your manual construction is ok here, but I would suggest in
practice to just use the standard workflow:

d <- DGEList(counts=<>,group=<>)
d <- d[<mysubset>,]
d <- calcNormFactors(d)

You can always annotate d$samples data.frame afterwards if necessary
... but you don't even need to, since none of this is used in the
(design matrix) construction below.

Note also that, in general, rowSums(.)/colSums(.) is preferred to
apply(.,1,sum)/apply(.,2,sum) ... faster and (for me at least) easier
to read.


> design = model.matrix(~ Time + S)
> #$#$  make design data for edgeR method
>
> count.d = calcNormFactors(count.d)
> #$#$  Normalize TMM

Maybe do your MDS plot here before proceeding?


> glmfit.d = glmFit(count.d, design, dispersion = 0.1)
> #$#$  Fit the Negative Binomial Gen Lin Models

You have replicates and a good number of degrees of freedom.  Why not
use the data to estimate dispersion instead of hard-coding it like
this?  And, where did 0.1 come from?

>
> lrt.count = glmLRT(count.d, glmfit.d)
> #$#$  Likelihood ratio tests

Here, you need to be careful what glmLRT is actually testing.  By
default, this testing the last column of the design matrix (read
?glmLRT).  Is this what you intend?

Hope that helps.

Best regards,
Mark


>
> result.count = data.frame(f.count, lrt.count$table)
> #$#$  combining raw data and results from edgeR
>
> result.count$FDR = p.adjust(result.count$p.value,method="BH")
> #$#$  calculating the False Discovery Rate
>
> write.table(result.count,
"edgeR.Medicago_count_WT_Mu3.txt",sep="\t",row.names=F)
> #$#$  saving the combined data set
>

> > -- output of sessionInfo(): > > . > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012

ADD COMMENT • link 12.2 years ago Mark Robinson ▴ 880

0

Entering edit mode

Hi, Just a quick comment regarding the pre-filtering steps: On Wed, Oct 10, 2012 at 6:06 AM, Mark Robinson <mark.robinson at="" imls.uzh.ch=""> wrote: [snip] >> So in short - we now need to employ data filters to check and reduce noise in our data. Some ideas are >> removing genes that have low expression (count) levels >> removing genes that have high variance across replicates >> removing genes that have low variance across time (constitutively expressed genes are biologically less interesting) > > > I understand the first one (removing low counts) and would recommend it, but the statistics should somewhat take care of highlighting which genes are differential, between the contrast of interest that you specify. So, are your second and third really necessary? I usually get a bit wary when I bring "the labels" of my data into consideration when filtering (so, the "within replicate stuff" smells a bit fishy to me). A more rigorous analysis of what you could (and probably shouldn't) do has been published by some familiar names in the bioconductor community: Independent filtering increases detection power for high-throughput experiments http://www.pnas.org/content/107/21/9546.long Which is probably worth reading ... HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On Wed, Oct 10, 2012 at 3:06 AM, Mark Robinson <mark.robinson at="" imls.uzh.ch="">wrote: > Hi Anand, > > I've added a few "reactions" below; I hope it can help. > > > > Greetings friends! > > > > I seek help with data that I have : 3 time points, 3 genotypes, 3 > replicates for each of these = 27 libraries > > > > The goal is to find genes that have different time expression profiles > amongst 2 or more genotypes. > > > After our 1st round of data analysis, (including TMM normalization), the > time course graphs and box plots were so noisy in terms of high std error > at each time point, that it was hard to say if expression profile of one > genotype was overlapping or distinct from that for the other genotypes! R > code attached at bottom of this post. > > What did you actually plot? What did an MDS plot look like? > > Hello Mark, Per your advice, we ended up making the MDS plots. The MDS plot is attached for one genotype only, 9 time points and 4 replicates per time point. We have not generated MDS plot for across genotypes data as well - that is our next step. But even now, it looks like there is quite a bit of variability of libraries across replicates. It looks that for each time point we need to remove one or more libraries that are outliers. In order to do that I suppose there are a few different ways to do accomplish this : * * *1.* Remove just one library that is an outlier, like T0.2? * * *2.* Remove entire time points because of the scatter of the reps, like T0.1, T0.2, T0.3 and T0.4 each of which are quite distant from each other on this MDS plot? *3.* Remove an entire replicate and retain others, in our data I think replicate 2 is different from the other three reps, but I dont think this MDS plot shows that, does it? A simple heirarchical clustering of the 9 time points * 4 reps = 36 libraries is attached. Here you can see similar behavior as seen in the MDS plot, though the visualizations are different. How do you reckon we should remove the 'noisy' data, if we should do it at all? Thanks again. - Anand -------------- next part -------------- A non-text attachment was scrubbed... Name: MDS_plots_A17_9timepoints_4repseach.pdf Type: application/pdf Size: 4835 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20121015="" 93dcdbff="" attachment.pdf=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: heatmap_gini_a17.pdf Type: application/pdf Size: 15307 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20121015="" 93dcdbff="" attachment-0001.pdf="">

ADD REPLY • link 12.2 years ago AKSR ▴ 20

0

Entering edit mode

Dear Anand, One subtle thing ? from the scale of the plot, my guess is that you ran plotMDS() on a matrix of counts? Note that this could be different (and is very different computationally) from running plotMDS() on a DGEList object, which does a ("normalized") count- specific calculation. Here is an example: library(edgeR) counts <- matrix(rnbinom(6000, size = 1/2, mu = 10),1000,6) counts[1:200,4:6] <- counts[1:200,4:6] + 10 y <- DGEList(counts) cols <- rep(c("black","blue"),each=3) par(mfrow=c(1,2)) plotMDS(y,col=cols) plotMDS(y$counts,col=cols) I'm not sure how much that will change your result, but at least it's on a more appropriate scale to think about the decision to exclude samples and so on. Mark On 16.10.2012, at 00:50, Anand K S Rao wrote: > > > On Wed, Oct 10, 2012 at 3:06 AM, Mark Robinson <mark.robinson at="" imls.uzh.ch=""> wrote: > Hi Anand, > > I've added a few "reactions" below; I hope it can help. > > > > Greetings friends! > > > > I seek help with data that I have : 3 time points, 3 genotypes, 3 replicates for each of these = 27 libraries > > > > The goal is to find genes that have different time expression profiles amongst 2 or more genotypes. > > > After our 1st round of data analysis, (including TMM normalization), the time course graphs and box plots were so noisy in terms of high std error at each time point, that it was hard to say if expression profile of one genotype was overlapping or distinct from that for the other genotypes! R code attached at bottom of this post. > > What did you actually plot? What did an MDS plot look like? > > > > Hello Mark, > > Per your advice, we ended up making the MDS plots. The MDS plot is attached for one genotype only, 9 time points and 4 replicates per time point. > We have not generated MDS plot for across genotypes data as well - that is our next step. > > But even now, it looks like there is quite a bit of variability of libraries across replicates. > > It looks that for each time point we need to remove one or more libraries that are outliers. In order to do that I suppose there are a few different ways to do accomplish this : > > 1. Remove just one library that is an outlier, like T0.2? > > 2. Remove entire time points because of the scatter of the reps, like T0.1, T0.2, T0.3 and T0.4 each of which are quite distant from each other on this MDS plot? > > 3. Remove an entire replicate and retain others, in our data I think replicate 2 is different from the other three reps, but I dont think this MDS plot shows that, does it? A simple heirarchical clustering of the 9 time points * 4 reps = 36 libraries is attached. Here you can see similar behavior as seen in the MDS plot, though the visualizations are different. > > How do you reckon we should remove the 'noisy' data, if we should do it at all? > > Thanks again. > > - Anand > > <mds_plots_a17_9timepoints_4repseach.pdf><heatmap_gini_a17.pdf>

ADD REPLY • link 12.2 years ago Mark Robinson ▴ 880

Login before adding your answer.