replicates and low expression levels

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.0 years ago

United States

On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: > Hi, > Just a quick question about low expression levels on Affy systems - I hope it's not too off-topic; it is about normalisation and data analysis... > I've heard a lot of people advocating that it's a good idea to perform an initial filtering on either Present Marginal or Absent calls, or on gene-expression levels (so that only genes with an expression > 40, say, after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the further analysis). Firstly, am I right in thinking that this is to eliminate data that are too close to the background noise level of the system. > > I wanted to canvas opinion as to whether people feel we need to do this if we have replicates and are using statistical tests - rather than just fold-changes - to identify 'interesting' genes. Does the statistical testing do this job for us? Hi, In my opinion you should always do some sort of non-specific filtering. What you have described is one form of it, others include removing genes that show little or no variability across samples. I think of non-specific filtering as filtering without reference to phenotype (of any sort). There are a number of reasons for doing this, some motivated by the biology and some by the statistics. First off, especially for Affy, the chip is designed for all tissue types but a commonly held belief is that only about 40% of the genome is expressed in any specific tissue type. So, for any experiment you will have a pretty large number of probes for genes that are not expressed in the tissue you are looking at. From a statistical perspective you need to be a little bit cautious if you are going to standardize genes across samples (this is pretty common). If you do not remove those genes that show little variability before standardization then you have just elevated the noise to the same status as the signal (and if the 40% estimate is right then you actually have more noise than signal - not too pleasant). Using a test statistic (such as a t-test) does not help, since that measures the between group differences relative to the variation (so if there is very little variation and a small difference in mean, well you get an enormous t-statistic and a small p-value; of course in this case looking at the "fold-change" or the size of the effect will indicate a problem, but not many people check all the things that need checking (and what to check depends on the test that you have just carried out). It seems to me to be much easier to just filter those genes with no expression or little variation out at the very start. If they don't show any variation across samples they can't help to classify or to cluster (there is no information about any phenotype contained in them). Robert > > Crispin > > -------------------------------------------------------- > > > This email is confidential and intended solely for the use of th... {{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | +--------------------------------------------------------------------- ------+

affy affy • 1.7k views

ADD COMMENT • link updated 21.9 years ago by Stephen Henderson ★ 1.0k • written 21.9 years ago by rgentleman ★ 5.5k

0

Entering edit mode

Claire Wilson ▴ 280

@claire-wilson-273

Last seen 10.7 years ago

>On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: > > Hi, > > Just a quick question about low expression levels on Affy systems - I > hope it's not too off-topic; it is about normalisation and data analysis... > > I've heard a lot of people advocating that it's a good idea to perform > an initial filtering on either Present Marginal or Absent calls, or on > gene-expression levels (so that only genes with an expression > 40, say, > after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the > further analysis). Firstly, am I right in thinking that this is to > eliminate data that are too close to the background noise level of the system. > > > > I wanted to canvas opinion as to whether people feel we need to do this > if we have replicates and are using statistical tests - rather than just > fold-changes - to identify 'interesting' genes. Does the statistical > testing do this job for us? > >Hi, > In my opinion you should always do some sort of non-specific > filtering. What you have described is one form of it, others include > removing genes that show little or no variability across samples. > I think of non-specific filtering as filtering without reference to > phenotype (of any sort). > > There are a number of reasons for doing this, some motivated by the > biology and some by the statistics. > > First off, especially for Affy, the chip is designed for all tissue > types but a commonly held belief is that only about 40% of the genome > is expressed in any specific tissue type. So, for any experiment you > will have a pretty large number of probes for genes that are not > expressed in the tissue you are looking at. > From a statistical perspective you need to be a little bit cautious > if you are going to standardize genes across samples (this is pretty > common). If you do not remove those genes that show little > variability before standardization then you have just elevated the > noise to the same status as the signal (and if the 40% estimate is > right then you actually have more noise than signal - not too > pleasant). Hi, Just to clarify a couple of points. This suggest to me that filtering of genes with low expression is required prior to normalization and I was just wondering in Bioconductor how this is achieved without the use of Present/Absent calls and following on from a later point > you have just carried out). It seems to me to be much easier to just > filter those genes with no expression or little variation out at the > very start. what would be your filter for no expression of little variation? Sorry if these questions are a little basic Thanks Claire -------------------------------------------------------- This email is confidential and intended solely for the use of th... {{dropped}}

ADD COMMENT • link 21.9 years ago Claire Wilson ▴ 280

0

Entering edit mode

On Mon, Jun 02, 2003 at 11:17:26AM +0100, Claire Wilson wrote: > >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: <much wisdom="" cut=""> > > Hi, > > Just to clarify a couple of points. This suggest to me that filtering of genes with low expression is required prior to normalization and I was just wondering in Bioconductor how this is achieved without the use of Present/Absent calls and following on from a later point One should keep in mind the assumption behind many of the normalization techniques: "most of the genes are not differentially expressed across the experiments". Filtering before normalization/scaling should be done with that in mind. In the case of the "affy" pacakge (since you mention P/A), the normalization is performed at the probe level (no need for P/A). L.

ADD REPLY • link 21.9 years ago Laurent Gautier ★ 2.3k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.0 years ago

I think you have to normalise prior to filtering. The noise should be a reliable component of the normalisation procedure. the second point is interesting. How do you select a filter for low variance low expression data. In the first instance if its not varying then you might as well filter it as it is not interesting to your given experiment in any case-- regardless of whether it is noise or low expression! The question of what constitutes present and absent is more difficult. I would like to see a better example of spike-in data in the literature that really focuses on low expression values (though genelogic and affy sets are an excellent and appreciated resource for designing expression indices generally). Stephen -----Original Message----- From: Claire Wilson To: Robert Gentleman Cc: BioC mailing list Sent: 02/06/03 11:17 Subject: RE: [BioC] replicates and low expression levels >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: > > Hi, > > Just a quick question about low expression levels on Affy systems - I > hope it's not too off-topic; it is about normalisation and data analysis... > > I've heard a lot of people advocating that it's a good idea to perform > an initial filtering on either Present Marginal or Absent calls, or on > gene-expression levels (so that only genes with an expression > 40, say, > after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the > further analysis). Firstly, am I right in thinking that this is to > eliminate data that are too close to the background noise level of the system. > > > > I wanted to canvas opinion as to whether people feel we need to do this > if we have replicates and are using statistical tests - rather than just > fold-changes - to identify 'interesting' genes. Does the statistical > testing do this job for us? > >Hi, > In my opinion you should always do some sort of non-specific > filtering. What you have described is one form of it, others include > removing genes that show little or no variability across samples. > I think of non-specific filtering as filtering without reference to > phenotype (of any sort). > > There are a number of reasons for doing this, some motivated by the > biology and some by the statistics. > > First off, especially for Affy, the chip is designed for all tissue > types but a commonly held belief is that only about 40% of the genome > is expressed in any specific tissue type. So, for any experiment you > will have a pretty large number of probes for genes that are not > expressed in the tissue you are looking at. > From a statistical perspective you need to be a little bit cautious > if you are going to standardize genes across samples (this is pretty > common). If you do not remove those genes that show little > variability before standardization then you have just elevated the > noise to the same status as the signal (and if the 40% estimate is > right then you actually have more noise than signal - not too > pleasant). Hi, Just to clarify a couple of points. This suggest to me that filtering of genes with low expression is required prior to normalization and I was just wondering in Bioconductor how this is achieved without the use of Present/Absent calls and following on from a later point > you have just carried out). It seems to me to be much easier to just > filter those genes with no expression or little variation out at the > very start. what would be your filter for no expression of little variation? Sorry if these questions are a little basic Thanks Claire -------------------------------------------------------- This email is confidential and intended solely for the use of th... {{dropped}}

ADD COMMENT • link 21.9 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.0 years ago

United States

On Mon, Jun 02, 2003 at 11:17:26AM +0100, Claire Wilson wrote: > >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: > > > Hi, > > > Just a quick question about low expression levels on Affy systems - I > > hope it's not too off-topic; it is about normalisation and data analysis... > > > I've heard a lot of people advocating that it's a good idea to perform > > an initial filtering on either Present Marginal or Absent calls, or on > > gene-expression levels (so that only genes with an expression > 40, say, > > after scaling to a TGT of 100 using the MAS5.0 algorithm, are part of the > > further analysis). Firstly, am I right in thinking that this is to > > eliminate data that are too close to the background noise level of the system. > > > > > > I wanted to canvas opinion as to whether people feel we need to do this > > if we have replicates and are using statistical tests - rather than just > > fold-changes - to identify 'interesting' genes. Does the statistical > > testing do this job for us? > > > >Hi, > > In my opinion you should always do some sort of non-specific > > filtering. What you have described is one form of it, others include > > removing genes that show little or no variability across samples. > > I think of non-specific filtering as filtering without reference to > > phenotype (of any sort). > > > > There are a number of reasons for doing this, some motivated by the > > biology and some by the statistics. > > > > First off, especially for Affy, the chip is designed for all tissue > > types but a commonly held belief is that only about 40% of the genome > > is expressed in any specific tissue type. So, for any experiment you > > will have a pretty large number of probes for genes that are not > > expressed in the tissue you are looking at. > > From a statistical perspective you need to be a little bit cautious > > if you are going to standardize genes across samples (this is pretty > > common). If you do not remove those genes that show little > > variability before standardization then you have just elevated the > > noise to the same status as the signal (and if the 40% estimate is > > right then you actually have more noise than signal - not too > > pleasant). > > Hi, > > Just to clarify a couple of points. This suggest to me that filtering of genes with low expression is required prior to normalization and I was just wondering in Bioconductor how this is achieved without the use of Present/Absent calls and following on from a later point > > > you have just carried out). It seems to me to be much easier to just > > filter those genes with no expression or little variation out at the > > very start. > > what would be your filter for no expression of little variation? > Nope, they are important (and I'm not sure that they have been well dealt with yet, as the variety of opinion shows). I explored the gap filter as one way of filtering out those that were unlikely to be informative. Otherwise simply looking for a decent interquartile range could be helpful (decent is of course in the eye of the beholder). (see the man pages in genefilter for more info and examples) As for not expressed, my current thinking is as follows, suppose that the smallest meaningful group (by phenotype has k samples in it, eg we have 100 ALL samples and 10 have a t(4;11) translocation and all other subgroups of interest are larger) then I would want to require that some number, like 8 or 9 of the samples had high expression values for the probe. I would definitely not be interested in a probe that had say only 3 (of the 100) samples registering an expression value of larger than 100 (in Affy terms). I just don't think that there is enough information in it to draw conclusions. That said, my view changes pretty substantially if the probes are identified with genes that are implicated in some of the basic mechanisms of disease -- then I would poke around a little in any event. This remains a bit of an art and there are tradeoffs between including irrelevant probes and excluding relevant ones. As we learn more about the biology (or get more biological meta-data) this will become simpler. For example, suppose that I am studying T-cells and I know that gene X is not normally expressed in these cells. Then in 3 (of my 100) samples gene X is expressed. Well, I'd be pretty interested... Mileage depends on the conditions, the vehicle and the driver, your's may vary Regards, Robert > Sorry if these questions are a little basic > > Thanks > > Claire > > -------------------------------------------------------- > > > This email is confidential and intended solely for the use of th... {{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | +--------------------------------------------------------------------- ------+

ADD COMMENT • link 21.9 years ago rgentleman ★ 5.5k

0

Entering edit mode

Crispin Miller ★ 1.1k

@crispin-miller-264

Last seen 10.7 years ago

Hi, In addition, filtering prior to normalisation would need a chip- specific threshold to filter by (otherwise intensity levels between chips would be directly comparable and we wouldn't need to normalise). Presumeably, this would be done by computing global statistics and then, and then determining the threshold relative to these... This sounds pretty much like normalisation? :-) Crispin > -----Original Message----- > From: Stephen Henderson [mailto:s.henderson@ucl.ac.uk] > Sent: 02 June 2003 11:28 > To: Claire Wilson; 'Robert Gentleman ' > Cc: 'BioC mailing list ' > Subject: RE: [BioC] replicates and low expression levels > > > I think you have to normalise prior to filtering. The noise > should be a > reliable component of the normalisation procedure. > > the second point is interesting. How do you select a filter > for low variance > low expression data. In the first instance if its not varying > then you might > as well filter it as it is not interesting to your given > experiment in any > case-- regardless of whether it is noise or low expression! > > The question of what constitutes present and absent is more > difficult. I > would like to see a better example of spike-in data in the > literature that > really focuses on low expression values (though genelogic and > affy sets are > an excellent and appreciated resource for designing expression indices > generally). > > Stephen > -----Original Message----- > From: Claire Wilson > To: Robert Gentleman > Cc: BioC mailing list > Sent: 02/06/03 11:17 > Subject: RE: [BioC] replicates and low expression levels > > >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: > > > Hi, > > > Just a quick question about low expression levels on Affy > systems - > I > > hope it's not too off-topic; it is about normalisation and data > analysis... > > > I've heard a lot of people advocating that it's a good idea to > perform > > an initial filtering on either Present Marginal or Absent > calls, or on > > > gene-expression levels (so that only genes with an expression > 40, > say, > > after scaling to a TGT of 100 using the MAS5.0 algorithm, > are part of > the > > further analysis). Firstly, am I right in thinking that this is to > > eliminate data that are too close to the background noise > level of the > system. > > > > > > I wanted to canvas opinion as to whether people feel we need to do > this > > if we have replicates and are using statistical tests - rather than > just > > fold-changes - to identify 'interesting' genes. Does the > statistical > > testing do this job for us? > > > >Hi, > > In my opinion you should always do some sort of non-specific > > filtering. What you have described is one form of it, > others include > > removing genes that show little or no variability across samples. > > I think of non-specific filtering as filtering without > reference to > > phenotype (of any sort). > > > > There are a number of reasons for doing this, some > motivated by the > > biology and some by the statistics. > > > > First off, especially for Affy, the chip is designed for > all tissue > > types but a commonly held belief is that only about 40% of the > genome > > is expressed in any specific tissue type. So, for any > experiment you > > will have a pretty large number of probes for genes that are not > > expressed in the tissue you are looking at. > > From a statistical perspective you need to be a little > bit cautious > > if you are going to standardize genes across samples > (this is pretty > > common). If you do not remove those genes that show little > > variability before standardization then you have just elevated the > > noise to the same status as the signal (and if the 40% estimate is > > right then you actually have more noise than signal - not too > > pleasant). > > Hi, > > Just to clarify a couple of points. This suggest to me that > filtering of > genes with low expression is required prior to normalization and I was > just wondering in Bioconductor how this is achieved without the use of > Present/Absent calls and following on from a later point > > > you have just carried out). It seems to me to be much > easier to just > > filter those genes with no expression or little variation > out at the > > very start. > > what would be your filter for no expression of little variation? > > Sorry if these questions are a little basic > > Thanks > > Claire > > -------------------------------------------------------- > > > This email is confidential and intended solely for the use of > th... {{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -------------------------------------------------------- This email is confidential and intended solely for the use of th... {{dropped}}

ADD COMMENT • link 21.9 years ago Crispin Miller ★ 1.1k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.0 years ago

Quite... Quite! Relating to the next point-- you can get quite a good appreciation of the inherent uncertainty in expression values by using a resampling strategy. If you replace the real PM values in an AffyBatch object with resampled MM values (after probe level normalisation of the batch?) Using something like >ABObject2<-ABObject >pm(ABObject2)<- sample(mm(ABObject), size=22283, replace=FALSE) If you then use an expression diagnostic like >false.eset<-rma(ABObject2) You can create a distribution of noise created values-- a useful tool when considering a filter cutoff. -----Original Message----- From: Crispin Miller [mailto:CMiller@PICR.man.ac.uk] Sent: Monday, June 02, 2003 1:07 PM To: Stephen Henderson; Claire Wilson; Robert Gentleman Cc: BioC mailing list Subject: RE: [BioC] replicates and low expression levels Hi, In addition, filtering prior to normalisation would need a chip- specific threshold to filter by (otherwise intensity levels between chips would be directly comparable and we wouldn't need to normalise). Presumeably, this would be done by computing global statistics and then, and then determining the threshold relative to these... This sounds pretty much like normalisation? :-) Crispin > -----Original Message----- > From: Stephen Henderson [mailto:s.henderson@ucl.ac.uk] > Sent: 02 June 2003 11:28 > To: Claire Wilson; 'Robert Gentleman ' > Cc: 'BioC mailing list ' > Subject: RE: [BioC] replicates and low expression levels > > > I think you have to normalise prior to filtering. The noise > should be a > reliable component of the normalisation procedure. > > the second point is interesting. How do you select a filter > for low variance > low expression data. In the first instance if its not varying > then you might > as well filter it as it is not interesting to your given > experiment in any > case-- regardless of whether it is noise or low expression! > > The question of what constitutes present and absent is more > difficult. I > would like to see a better example of spike-in data in the > literature that > really focuses on low expression values (though genelogic and > affy sets are > an excellent and appreciated resource for designing expression indices > generally). > > Stephen > -----Original Message----- > From: Claire Wilson > To: Robert Gentleman > Cc: BioC mailing list > Sent: 02/06/03 11:17 > Subject: RE: [BioC] replicates and low expression levels > > >On Fri, May 30, 2003 at 05:28:45PM +0100, Crispin Miller wrote: > > > Hi, > > > Just a quick question about low expression levels on Affy > systems - > I > > hope it's not too off-topic; it is about normalisation and data > analysis... > > > I've heard a lot of people advocating that it's a good idea to > perform > > an initial filtering on either Present Marginal or Absent > calls, or on > > > gene-expression levels (so that only genes with an expression > 40, > say, > > after scaling to a TGT of 100 using the MAS5.0 algorithm, > are part of > the > > further analysis). Firstly, am I right in thinking that this is to > > eliminate data that are too close to the background noise > level of the > system. > > > > > > I wanted to canvas opinion as to whether people feel we need to do > this > > if we have replicates and are using statistical tests - rather than > just > > fold-changes - to identify 'interesting' genes. Does the > statistical > > testing do this job for us? > > > >Hi, > > In my opinion you should always do some sort of non-specific > > filtering. What you have described is one form of it, > others include > > removing genes that show little or no variability across samples. > > I think of non-specific filtering as filtering without > reference to > > phenotype (of any sort). > > > > There are a number of reasons for doing this, some > motivated by the > > biology and some by the statistics. > > > > First off, especially for Affy, the chip is designed for > all tissue > > types but a commonly held belief is that only about 40% of the > genome > > is expressed in any specific tissue type. So, for any > experiment you > > will have a pretty large number of probes for genes that are not > > expressed in the tissue you are looking at. > > From a statistical perspective you need to be a little > bit cautious > > if you are going to standardize genes across samples > (this is pretty > > common). If you do not remove those genes that show little > > variability before standardization then you have just elevated the > > noise to the same status as the signal (and if the 40% estimate is > > right then you actually have more noise than signal - not too > > pleasant). > > Hi, > > Just to clarify a couple of points. This suggest to me that > filtering of > genes with low expression is required prior to normalization and I was > just wondering in Bioconductor how this is achieved without the use of > Present/Absent calls and following on from a later point > > > you have just carried out). It seems to me to be much > easier to just > > filter those genes with no expression or little variation > out at the > > very start. > > what would be your filter for no expression of little variation? > > Sorry if these questions are a little basic > > Thanks > > Claire > > -------------------------------------------------------- > > > This email is confidential and intended solely for the use of > th... {{dropped}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -------------------------------------------------------- This email is confidential and intended solely for the use of th... {{dropped}}

ADD COMMENT • link 21.9 years ago Stephen Henderson ★ 1.0k

Login before adding your answer.