dear listers:
I have a question on whether bioconductor has some tool-kit to detect
outliers and remove them.
my original dataset looks like this:
V1 V51 V53 V55 V57
1 -493249600 1.459459 -3.069444 -1.300000 1.935484
2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
3 1626196571 -3.500000 -1.011662 2.223881 3.921053
4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
7 530182506 -1.431677 -1.336343 -3.126437 4.878788
8 1173842263 1.215385 1.856410 -2.059794 -6.020833
9 28847 2.407895 -2.048889 -1.730337 -1.178947
10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
V1: internal probe id
the rests are different samples. the cells are fold-change of
disease/normal.
summary of the sample columns( V51, ... V57) gives the following:
V51 V53 V55 V57
Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
:-14086.750
1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.:
-1.831
Median : -1.199 Median : -1.0416 Median : -1.200 Median :
-1.080
Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean :
-1.874
3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.:
1.521
Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. :
683.519
My question is, is there any package which can detect those outliers
(like -14086.750)and remove them and get an "average" for each gene
(instead of each probe)?
Thank you.
Weiwei
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
Dear Weiwei,
The definition of outlier is not clear, and no data point should be
treated as outlier unless there is reason to believe so. The simple
way to
detect it is that 1.5IQR criteria, which you can write your own code
(one
or two lines). Update me if there are any other method to detect
outliers.
Fangxin
> dear listers:
>
> I have a question on whether bioconductor has some tool-kit to
detect
> outliers and remove them.
>
> my original dataset looks like this:
> V1 V51 V53 V55 V57
> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
>
> V1: internal probe id
> the rests are different samples. the cells are fold-change of
> disease/normal.
>
> summary of the sample columns( V51, ... V57) gives the following:
> V51 V53 V55 V57
> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
> :-14086.750
> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.:
> -1.831
> Median : -1.199 Median : -1.0416 Median : -1.200 Median :
> -1.080
> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean :
> -1.874
> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.:
> 1.521
> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. :
> 683.519
>
>
> My question is, is there any package which can detect those outliers
> (like -14086.750)and remove them and get an "average" for each gene
> (instead of each probe)?
>
> Thank you.
>
> Weiwei
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--------------------
Fangxin Hong Ph.D.
Plant Biology Laboratory
The Salk Institute
10010 N. Torrey Pines Rd.
La Jolla, CA 92037
E-mail: fhong at salk.edu
(Phone): 858-453-4100 ext 1105
my current way is using mahalanobis() distance.
to Sean:
do u think that example: -14k is ok?
On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote:
> Dear Weiwei,
> The definition of outlier is not clear, and no data point should be
> treated as outlier unless there is reason to believe so. The simple
way to
> detect it is that 1.5IQR criteria, which you can write your own code
(one
> or two lines). Update me if there are any other method to detect
outliers.
>
> Fangxin
>
>
> > dear listers:
> >
> > I have a question on whether bioconductor has some tool-kit to
detect
> > outliers and remove them.
> >
> > my original dataset looks like this:
> > V1 V51 V53 V55 V57
> > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> > 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
> >
> > V1: internal probe id
> > the rests are different samples. the cells are fold-change of
> > disease/normal.
> >
> > summary of the sample columns( V51, ... V57) gives the following:
> > V51 V53 V55
V57
> > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
> > :-14086.750
> > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st
Qu.:
> > -1.831
> > Median : -1.199 Median : -1.0416 Median : -1.200 Median
:
> > -1.080
> > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean
:
> > -1.874
> > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd
Qu.:
> > 1.521
> > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max.
:
> > 683.519
> >
> >
> > My question is, is there any package which can detect those
outliers
> > (like -14086.750)and remove them and get an "average" for each
gene
> > (instead of each probe)?
> >
> > Thank you.
> >
> > Weiwei
> >
> > --
> > Weiwei Shi, Ph.D
> > Research Scientist
> > GeneGO, Inc.
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
>
> --------------------
> Fangxin Hong Ph.D.
> Plant Biology Laboratory
> The Salk Institute
> 10010 N. Torrey Pines Rd.
> La Jolla, CA 92037
> E-mail: fhong at salk.edu
> (Phone): 858-453-4100 ext 1105
>
>
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote:
> my current way is using mahalanobis() distance.
>
> to Sean:
> do u think that example: -14k is ok?
That example could be a case of the gene being expressed in one
condition and not being expressed in another. I do not remember where
the data are from (or if you have even described that) or platform
or ..., but I would agree with Sean and say that you do not want to
blindly remove the genes. Note that we are not advising that you
shouldn't remove the gene, just that you should take a careful look
at the data and try to decide what to do.
As Fangxin clearly writes, it is hard to really know what is an
outlier.
Kasper
>
> On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote:
>> Dear Weiwei,
>> The definition of outlier is not clear, and no data point should be
>> treated as outlier unless there is reason to believe so. The
>> simple way to
>> detect it is that 1.5IQR criteria, which you can write your own
>> code (one
>> or two lines). Update me if there are any other method to detect
>> outliers.
>>
>> Fangxin
>>
>>
>>> dear listers:
>>>
>>> I have a question on whether bioconductor has some tool-kit to
>>> detect
>>> outliers and remove them.
>>>
>>> my original dataset looks like this:
>>> V1 V51 V53 V55 V57
>>> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
>>> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
>>> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
>>> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
>>> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
>>> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
>>> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
>>> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
>>> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
>>> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
>>>
>>> V1: internal probe id
>>> the rests are different samples. the cells are fold-change of
>>> disease/normal.
>>>
>>> summary of the sample columns( V51, ... V57) gives the following:
>>> V51 V53 V55
V57
>>> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
>>> :-14086.750
>>> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st
Qu.:
>>> -1.831
>>> Median : -1.199 Median : -1.0416 Median : -1.200 Median
:
>>> -1.080
>>> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean
:
>>> -1.874
>>> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd
Qu.:
>>> 1.521
>>> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max.
:
>>> 683.519
>>>
>>>
>>> My question is, is there any package which can detect those
outliers
>>> (like -14086.750)and remove them and get an "average" for each
gene
>>> (instead of each probe)?
>>>
>>> Thank you.
>>>
>>> Weiwei
>>>
>>> --
>>> Weiwei Shi, Ph.D
>>> Research Scientist
>>> GeneGO, Inc.
>>>
>>> "Did you always know?"
>>> "No, I did not. But I believed..."
>>> ---Matrix III
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>> --------------------
>> Fangxin Hong Ph.D.
>> Plant Biology Laboratory
>> The Salk Institute
>> 10010 N. Torrey Pines Rd.
>> La Jolla, CA 92037
>> E-mail: fhong at salk.edu
>> (Phone): 858-453-4100 ext 1105
>>
>>
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/
> gmane.science.biology.informatics.conductor
thanks for all of suggestions here.
i will go w/o removing those "outliers" first and update some result
if necessary.
On 9/19/06, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu=""> wrote:
>
> On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote:
>
> > my current way is using mahalanobis() distance.
> >
> > to Sean:
> > do u think that example: -14k is ok?
>
> That example could be a case of the gene being expressed in one
> condition and not being expressed in another. I do not remember
where
> the data are from (or if you have even described that) or platform
> or ..., but I would agree with Sean and say that you do not want to
> blindly remove the genes. Note that we are not advising that you
> shouldn't remove the gene, just that you should take a careful look
> at the data and try to decide what to do.
>
> As Fangxin clearly writes, it is hard to really know what is an
outlier.
>
> Kasper
>
>
> >
> > On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote:
> >> Dear Weiwei,
> >> The definition of outlier is not clear, and no data point should
be
> >> treated as outlier unless there is reason to believe so. The
> >> simple way to
> >> detect it is that 1.5IQR criteria, which you can write your own
> >> code (one
> >> or two lines). Update me if there are any other method to detect
> >> outliers.
> >>
> >> Fangxin
> >>
> >>
> >>> dear listers:
> >>>
> >>> I have a question on whether bioconductor has some tool-kit to
> >>> detect
> >>> outliers and remove them.
> >>>
> >>> my original dataset looks like this:
> >>> V1 V51 V53 V55 V57
> >>> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> >>> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> >>> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> >>> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> >>> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> >>> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> >>> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> >>> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> >>> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> >>> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
> >>>
> >>> V1: internal probe id
> >>> the rests are different samples. the cells are fold-change of
> >>> disease/normal.
> >>>
> >>> summary of the sample columns( V51, ... V57) gives the
following:
> >>> V51 V53 V55
V57
> >>> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
> >>> :-14086.750
> >>> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st
Qu.:
> >>> -1.831
> >>> Median : -1.199 Median : -1.0416 Median : -1.200
Median :
> >>> -1.080
> >>> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean
:
> >>> -1.874
> >>> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd
Qu.:
> >>> 1.521
> >>> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max.
:
> >>> 683.519
> >>>
> >>>
> >>> My question is, is there any package which can detect those
outliers
> >>> (like -14086.750)and remove them and get an "average" for each
gene
> >>> (instead of each probe)?
> >>>
> >>> Thank you.
> >>>
> >>> Weiwei
> >>>
> >>> --
> >>> Weiwei Shi, Ph.D
> >>> Research Scientist
> >>> GeneGO, Inc.
> >>>
> >>> "Did you always know?"
> >>> "No, I did not. But I believed..."
> >>> ---Matrix III
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at stat.math.ethz.ch
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>
> >>
> >> --------------------
> >> Fangxin Hong Ph.D.
> >> Plant Biology Laboratory
> >> The Salk Institute
> >> 10010 N. Torrey Pines Rd.
> >> La Jolla, CA 92037
> >> E-mail: fhong at salk.edu
> >> (Phone): 858-453-4100 ext 1105
> >>
> >>
> >
> >
> > --
> > Weiwei Shi, Ph.D
> > Research Scientist
> > GeneGO, Inc.
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/
> > gmane.science.biology.informatics.conductor
>
>
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
You should really check the original data, not the ratio, and then
decide, rather than blindly choosing to use or remove those extreme
values. As Kasper said, some could well represent genes that show
strong expresion on one condition only, either because they become
silenced or activated, and these are potentially very interesting.
Jose
Quoting Weiwei Shi <helprhelp at="" gmail.com="">:
> thanks for all of suggestions here.
>
> i will go w/o removing those "outliers" first and update some result
> if necessary.
>
> On 9/19/06, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu="">
wrote:
>>
>> On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote:
>>
>> > my current way is using mahalanobis() distance.
>> >
>> > to Sean:
>> > do u think that example: -14k is ok?
>>
>> That example could be a case of the gene being expressed in one
>> condition and not being expressed in another. I do not remember
where
>> the data are from (or if you have even described that) or platform
>> or ..., but I would agree with Sean and say that you do not want to
>> blindly remove the genes. Note that we are not advising that you
>> shouldn't remove the gene, just that you should take a careful look
>> at the data and try to decide what to do.
>>
>> As Fangxin clearly writes, it is hard to really know what is an
outlier.
>>
>> Kasper
>>
>>
>> >
>> > On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote:
>> >> Dear Weiwei,
>> >> The definition of outlier is not clear, and no data point should
be
>> >> treated as outlier unless there is reason to believe so. The
>> >> simple way to
>> >> detect it is that 1.5IQR criteria, which you can write your own
>> >> code (one
>> >> or two lines). Update me if there are any other method to detect
>> >> outliers.
>> >>
>> >> Fangxin
>> >>
>> >>
>> >>> dear listers:
>> >>>
>> >>> I have a question on whether bioconductor has some tool-kit to
>> >>> detect
>> >>> outliers and remove them.
>> >>>
>> >>> my original dataset looks like this:
>> >>> V1 V51 V53 V55 V57
>> >>> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
>> >>> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
>> >>> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
>> >>> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
>> >>> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
>> >>> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
>> >>> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
>> >>> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
>> >>> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
>> >>> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
>> >>>
>> >>> V1: internal probe id
>> >>> the rests are different samples. the cells are fold-change of
>> >>> disease/normal.
>> >>>
>> >>> summary of the sample columns( V51, ... V57) gives the
following:
>> >>> V51 V53 V55
V57
>> >>> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
>> >>> :-14086.750
>> >>> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st
Qu.:
>> >>> -1.831
>> >>> Median : -1.199 Median : -1.0416 Median : -1.200
Median :
>> >>> -1.080
>> >>> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean
:
>> >>> -1.874
>> >>> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd
Qu.:
>> >>> 1.521
>> >>> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max.
:
>> >>> 683.519
>> >>>
>> >>>
>> >>> My question is, is there any package which can detect those
outliers
>> >>> (like -14086.750)and remove them and get an "average" for each
gene
>> >>> (instead of each probe)?
>> >>>
>> >>> Thank you.
>> >>>
>> >>> Weiwei
>> >>>
>> >>> --
>> >>> Weiwei Shi, Ph.D
>> >>> Research Scientist
>> >>> GeneGO, Inc.
>> >>>
>> >>> "Did you always know?"
>> >>> "No, I did not. But I believed..."
>> >>> ---Matrix III
>> >>>
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at stat.math.ethz.ch
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> >>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>
>> >>>
>> >>
>> >>
>> >> --------------------
>> >> Fangxin Hong Ph.D.
>> >> Plant Biology Laboratory
>> >> The Salk Institute
>> >> 10010 N. Torrey Pines Rd.
>> >> La Jolla, CA 92037
>> >> E-mail: fhong at salk.edu
>> >> (Phone): 858-453-4100 ext 1105
>> >>
>> >>
>> >
>> >
>> > --
>> > Weiwei Shi, Ph.D
>> > Research Scientist
>> > GeneGO, Inc.
>> >
>> > "Did you always know?"
>> > "No, I did not. But I believed..."
>> > ---Matrix III
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives: http://news.gmane.org/
>> > gmane.science.biology.informatics.conductor
>>
>>
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Dr. Jose I. de las Heras Email: J.delasHeras at
ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131
6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
hi, Sean:
I added some info here:
I did some pathway analysis and compare the results between using
those "outliers" and not using them. My result (validated by domain
knowledge, since they are unsupervised learning) shows the former is
better, which agrees with your suggestion. but i still do not think
the one with -14k and some numbers shown in the summary in the first
email make sense to me.
weiwei
On 9/19/06, Weiwei Shi <helprhelp at="" gmail.com=""> wrote:
> my current way is using mahalanobis() distance.
>
> to Sean:
> do u think that example: -14k is ok?
>
>
> On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote:
> > Dear Weiwei,
> > The definition of outlier is not clear, and no data point should
be
> > treated as outlier unless there is reason to believe so. The
simple way to
> > detect it is that 1.5IQR criteria, which you can write your own
code (one
> > or two lines). Update me if there are any other method to detect
outliers.
> >
> > Fangxin
> >
> >
> > > dear listers:
> > >
> > > I have a question on whether bioconductor has some tool-kit to
detect
> > > outliers and remove them.
> > >
> > > my original dataset looks like this:
> > > V1 V51 V53 V55 V57
> > > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> > > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> > > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> > > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> > > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> > > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> > > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> > > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> > > 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> > > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
> > >
> > > V1: internal probe id
> > > the rests are different samples. the cells are fold-change of
> > > disease/normal.
> > >
> > > summary of the sample columns( V51, ... V57) gives the
following:
> > > V51 V53 V55
V57
> > > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
> > > :-14086.750
> > > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st
Qu.:
> > > -1.831
> > > Median : -1.199 Median : -1.0416 Median : -1.200
Median :
> > > -1.080
> > > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean
:
> > > -1.874
> > > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd
Qu.:
> > > 1.521
> > > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max.
:
> > > 683.519
> > >
> > >
> > > My question is, is there any package which can detect those
outliers
> > > (like -14086.750)and remove them and get an "average" for each
gene
> > > (instead of each probe)?
> > >
> > > Thank you.
> > >
> > > Weiwei
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > > Research Scientist
> > > GeneGO, Inc.
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> > >
http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> > >
> >
> >
> > --------------------
> > Fangxin Hong Ph.D.
> > Plant Biology Laboratory
> > The Salk Institute
> > 10010 N. Torrey Pines Rd.
> > La Jolla, CA 92037
> > E-mail: fhong at salk.edu
> > (Phone): 858-453-4100 ext 1105
> >
> >
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
some added info:
V1 is gene id, but each row represents a probe. so there could be
multiple rows with the same V1 since they (those probes) correspond to
the same gene.
On 9/19/06, Weiwei Shi <helprhelp at="" gmail.com=""> wrote:
> dear listers:
>
> I have a question on whether bioconductor has some tool-kit to
detect
> outliers and remove them.
>
> my original dataset looks like this:
> V1 V51 V53 V55 V57
> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
>
> V1: internal probe id
> the rests are different samples. the cells are fold-change of
disease/normal.
>
> summary of the sample columns( V51, ... V57) gives the following:
> V51 V53 V55 V57
> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
:-14086.750
> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.:
-1.831
> Median : -1.199 Median : -1.0416 Median : -1.200 Median :
-1.080
> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean :
-1.874
> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.:
1.521
> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. :
683.519
>
>
> My question is, is there any package which can detect those outliers
> (like -14086.750)and remove them and get an "average" for each gene
> (instead of each probe)?
>
> Thank you.
>
> Weiwei
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
On 9/19/06 1:02 PM, "Weiwei Shi" <helprhelp at="" gmail.com=""> wrote:
> dear listers:
>
> I have a question on whether bioconductor has some tool-kit to
detect
> outliers and remove them.
>
> my original dataset looks like this:
> V1 V51 V53 V55 V57
> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
>
> V1: internal probe id
> the rests are different samples. the cells are fold-change of
disease/normal.
>
> summary of the sample columns( V51, ... V57) gives the following:
> V51 V53 V55 V57
> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min.
:-14086.750
> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.:
-1.831
> Median : -1.199 Median : -1.0416 Median : -1.200 Median :
-1.080
> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean :
-1.874
> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.:
1.521
> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. :
683.519
>
>
> My question is, is there any package which can detect those outliers
> (like -14086.750)and remove them and get an "average" for each gene
> (instead of each probe)?
Hi, Weiwei.
The better option, probably, is to remove datapoints that are
questionable
BEFORE making a ratio using good quality control, plots, etc. Extreme
ratios may be biologically very important, so simply removing them is
probably not the best option. I would suggest looking at the two data
values that went into making the ratios that you think are in question
and
see if there is an explanation (for example, one probe of the two
failed,
for example). Simply removing ratios because they look like outliers
is
potentially removing your most interesting data.
Sean