Question

unique values for repeated geneIDs

0

Entering edit mode

Hernando Martínez ▴ 100

@hernando-martinez-4124

Last seen 10.5 years ago

Hello everyone, my name is Hernando, and I am new to R. I have a little problem that maybe you can help me with, as I have been looking through the packages with no success, and it shouldn't be very difficult to solve. I have a text file containing a list of genes, with expression values for each along a set of microarray experiments. Ex: geneID sample1 sample 2 .... gene1 45 58 .... gene1 43 63 ..... gene2 32 21 .... ...... ..... ...... ..... In this list, there are some genes repeated, but with different values (like in the example). This repetitions come from different probes targeting the same gene. What I want is a new text file, but with each gene appearing only once, and with three possibilities for the expression values of repeated genes: - Each value (for each column (sample)) is the average of the previous values (in the example, sample 1 for gene1 should be 44, and 60,5 in sample 2) - Instead of the average, the median. - The highest values. I would prefer the median or the average, but I don't know if getting the highest values is easier. I have seen this function: "findLargest" of "genefilter" package, but it works with probes and I have already converted files (to geneIDs). I hope you can help me or letting me know any function or package to start with. Many thanks -- Hernando Martínez Vergara [[alternative HTML version deleted]]

Microarray Microarray • 1.2k views

ADD COMMENT • link updated 14.7 years ago by Sean Davis 21k • written 14.7 years ago by Hernando Martínez ▴ 100

score 0 · Answer 1 · 2010-06-11

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 7 days ago

United States

On Fri, Jun 11, 2010 at 8:12 AM, Hernando MartÃnez <hernybiotec@gmail.com>wrote: > Hello everyone, my name is Hernando, and I am new to R. I have a little > problem that maybe you can help me with, as I have been looking through the > packages with no success, and it shouldn't be very difficult to solve. > I have a text file containing a list of genes, with expression values for > each along a set of microarray experiments. Ex: > > geneID sample1 sample 2 .... > > gene1 45 58 .... > > gene1 43 63 ..... > > gene2 32 21 .... > > ...... ..... ...... ..... > > In this list, there are some genes repeated, but with different values > (like > in the example). This repetitions come from different probes targeting the > same gene. > What I want is a new text file, but with each gene appearing only once, and > with three possibilities for the expression values of repeated genes: > > - Each value (for each column (sample)) is the average of the previous > values (in the example, sample 1 for gene1 should be 44, and 60,5 in sample > 2) > - Instead of the average, the median. > - The highest values. > > I would prefer the median or the average, but I don't know if getting the > highest values is easier. > > I have seen this function: "findLargest" of "genefilter" package, but it > works with probes and I have already converted files (to geneIDs). > > Hi, Hernando. Have a look at the aggregate() function. Sean [[alternative HTML version deleted]]

ADD COMMENT • link 14.7 years ago Sean Davis 21k

0

Entering edit mode

Thank you very much Sean, I have been working with function aggregate and it is exactly what I need. However, there is still a painful detail that I cannot get rid of. I hope you can help me too with this. I have this text file: A B C D d1 2 23 2 d1 4 22 2 d1 5 24 2 d2 10 7 2 d2 20 8 3 d1 7 23 2 d3 2 14 30 d3 4 14 50 d2 30 8 4 d4 12 13 15 d5 1 5 90 d2 40 7 3 d6 34 2 5 (I use it as a test) If I type: > data<-read.table("test.txt",sep="\t") > agr<-aggregate(data[2:4], by=list(data$V1), FUN=mean) I get 21 warning messages and all the values are "NA", including header B, C, and D. However, if I remove A,B,C,D from the previous file, and type the same commands, it works perfectly fine, getting what I wanted. The problem is that the real datasets I need to work with are really large and it is difficult to remove and add the headers without danger of doing something wrong. Is there any command or parameter that I should introduce to the function in order to solve this issue? Thank you so much, Hernando 2010/6/11 Sean Davis <sdavis2@mail.nih.gov> > > > On Fri, Jun 11, 2010 at 8:12 AM, Hernando Martínez <hernybiotec@gmail.com>wrote: > >> Hello everyone, my name is Hernando, and I am new to R. I have a little >> problem that maybe you can help me with, as I have been looking through >> the >> packages with no success, and it shouldn't be very difficult to solve. >> I have a text file containing a list of genes, with expression values for >> each along a set of microarray experiments. Ex: >> >> geneID sample1 sample 2 .... >> >> gene1 45 58 .... >> >> gene1 43 63 ..... >> >> gene2 32 21 .... >> >> ...... ..... ...... ..... >> >> In this list, there are some genes repeated, but with different values >> (like >> in the example). This repetitions come from different probes targeting the >> same gene. >> What I want is a new text file, but with each gene appearing only once, >> and >> with three possibilities for the expression values of repeated genes: >> >> - Each value (for each column (sample)) is the average of the previous >> values (in the example, sample 1 for gene1 should be 44, and 60,5 in >> sample >> 2) >> - Instead of the average, the median. >> - The highest values. >> >> I would prefer the median or the average, but I don't know if getting the >> highest values is easier. >> >> I have seen this function: "findLargest" of "genefilter" package, but it >> works with probes and I have already converted files (to geneIDs). >> >> > Hi, Hernando. Have a look at the aggregate() function. > > Sean > > -- Hernando Martínez Vergara [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Hernando Martínez ▴ 100

0

Entering edit mode

Hernando; > I have this text file: > > A B C D > d1 2 23 2 > d1 4 22 2 > d1 5 24 2 > d2 10 7 2 > d2 20 8 3 > d1 7 23 2 > d3 2 14 30 > d3 4 14 50 > d2 30 8 4 > d4 12 13 15 > d5 1 5 90 > d2 40 7 3 > d6 34 2 5 > > > data<-read.table("test.txt",sep="\t") > > agr<-aggregate(data[2:4], by=list(data$V1), FUN=mean) > > I get 21 warning messages and all the values are "NA", including > header B, C, and D. However, if I remove A,B,C,D from the previous > file, and type the same commands, it works perfectly fine, getting > what I wanted. Add header=TRUE to the read.table command: data<-read.table("test.txt",sep="\t", header=TRUE) agr<-aggregate(data[2:4], by=list(data$A), FUN=mean) Try help(read.table) to learn more about the available options. Brad

ADD REPLY • link 14.7 years ago Brad Chapman ▴ 20