Entering edit mode
On Mon, May 31, 2010 at 9:41 AM, Mervi Kinnunen
<mervi.kinnunen@wri.fi>wrote:
> Hi,
>
>
>
> Thanks for helping me out. However, I couldn't get the script to
work.
> Below
> is the description. How does the t(sapply . script select the
minimum
> p-value? I understand that the split creates a list where each
occurring
> geneSymbol is present in a separate data frame. How does the script
then
> compare the p-values within each frame and merge the data back into
a
> single
> data frame?
>
>
>
> -Mervi
>
>
>
> > dd <- read.table("Myfile", sep='\t', h=T, as.is=T,
> colClasses=c("character","numeric","numeric","numeric"))
>
> > str(dd)
>
> 'data.frame': 6 obs. of 4 variables:
>
> $ geneSymbol: chr "ABC1" "ABC1" "AB" "ABCD1" ...
>
> $ A : num 12 2 4 15 11 9
>
> $ B : num 44 32 55 25 27 18
>
> $ pvalue : num 1e-02 5e-02 2e-01 5e-03 2e-03 1e-04
>
> > bb<- dd
>
> > bbs <- split(bb,bb[,1])
>
> > d<- t(sapply(bbs, function(x)x[which.min(x$originalpvalue),]))
>
>
there is no column in dd called 'originalpvalue' so your variation
must
fail. use 'pvalue'
> > str(d)
>
> List of 12
>
> $ : chr(0)
>
> $ : chr(0)
>
> $ : chr(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> $ : num(0)
>
> - attr(*, "dim")= int [1:2] 3 4
>
> - attr(*, "dimnames")=List of 2
>
> ..$ : chr [1:3] "AB" "ABC1" "ABCD1"
>
> ..$ : chr [1:4] "geneSymbol" "A" "B" "pvalue"
>
> > head(d)
>
> geneSymbol A B pvalue
>
> AB Character,0 Numeric,0 Numeric,0 Numeric,0
>
> ABC1 Character,0 Numeric,0 Numeric,0 Numeric,0
>
> ABCD1 Character,0 Numeric,0 Numeric,0 Numeric,0
>
> From: Vincent Carey [mailto:stvjc@channing.harvard.edu]
> Sent: 29. toukokuuta 2010 0:25
> To: mervi.alanne@wri.fi
> Cc: bioconductor@stat.math.ethz.ch
> Subject: Re: [BioC] finding and deleting repeated observations
>
>
>
> suppose you save your data as in the email to a file b.txt -- i
ignore
> niceties of delimiter choice
>
> there are many ways of doing it, but here is one possibility
>
> > bb = read.table("b.txt", h=TRUE, colClasses=c("character",
"numeric",
> "numeric", "numeric"))
> > bbs = split(bb, bb[,1])
> > t(sapply(bbs, function(x) x[which.min(x$pvalue),]))
> GeneSymbol A B pvalue
> AB "AB" 4 55 0.2
> ABC1 "ABC1" 12 44 0.01
> ABCD1 "ABCD1" 9 18 1e-04
>
> it does what you ask, but the solution you gave below doesn't seem
right
> (picked wrong values of A and B for correct ABC1 candidate?)
>
> On Fri, May 28, 2010 at 1:27 PM, mervi.alanne@wri.fi
<mervi.alanne@wri.fi>
> wrote:
>
> Dear all,
>
> I'm a novice with R and could use some help. How could I find
repeated
> observations based on one column and select the one to keep based on
> another column?
>
> In more detail, this is the thing I want to achieve:
> -data.frame has 4 columns GeneSymbol, A, B, pvalue
> -data in column GeneSymbol may be repeated 1-6 times
> -data also contains unique observations
> -Of the repeated obs, keep the obs which has the lowest pvalue
> -Do not discard data from cols A and B
>
> Example input data:
> GeneSymbol A B pvalue
> ABC1 12 44 0.01
> ABC1 2 32 0.05
> AB 4 55 0.2
> ABCD1 15 25 0.005
> ABCD1 11 27 0.002
> ABCD1 9 18 0.0001
>
> I'd like the output to look like this:
> GeneSymbol A B pvalue
> ABC1 2 32 0.01
> AB 4 55 0.2
> ABCD1 9 18 0.0001
>
> Any suggestions?
>
> -Mervi
> Wihuri Research Institute
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]