annotationTools: character vector clean-up
1
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 2 days ago
Wageningen University, Wageningen, the …
Hi, I have a simple problem that's driving me nuts... Any hints are appreciated! I am retrieving the human homologues of rat genes. I use the functions 'getHOMOLOG' and 'listToCharacterVector' from the library annotationTools. Everything is going fine, except for one thing: Some rows (genes) contain multiple entries (homologues); for such row I would like to get rid of all entries except the first one. Example: for row 18634 I currently have: [18634] "6173 /// 100529097" I would like to get rid of everything except the first entry, so to get this: [18634] "6173" How to do this for all relevant rows? Basically, I thus would like to remove everything positioned after the first number, starting with space-3xforwardslash-etc. Thanks, Guido library(annotationTools) library(hugene11stv1hsentrezg.db) library(ragene11stv1rnentrezg.db) #Download HomoloGene data from: #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ homologene<-read.delim("homologene.data.121212.data",header=FALSE) # (date of file manually added to name when saving download) colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", "Symbol", "ProteinGI", "ProteinAcc") # Read rat probesets that are on the array as Entrez IDs; this returns a list which is converted to a character vector # Next the probesets that don't have an EntrezID are removed rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID), ragene11stv1rnentrezgENTREZID) rat.eg.array <- listToCharacterVector(rat.eg.array) rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)] # Convert rat EG IDs into human (9606) homologs; this returns a list which is converted to a character vector > rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes some time Warning messages: 1: In getHOMOLOG(rat.eg.array, 9606, homologene) : One or more gene input gene ID/cluster not found in homologue table 2: In getHOMOLOG(rat.eg.array, 9606, homologene) : One or more gene ID/cluster with no target provided in homologue table > rat2human <- listToCharacterVector(rat2human) > class(rat2human) [1] "character" > > head(rat2human) [1] "54552" "80212" "11277" "10663" "199692" "399947" > > #example of multiple entries > rat2human[18634] [1] "6173 /// 100529097" > --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld@wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 [[alternative HTML version deleted]]
convert convert • 759 views
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 2 days ago
Icahn School of Medicine at Mount Sinai…
You can try this: library(stringr) x <- str_replace(string=x, pattern=" /// .*$", replacement="") stopifnot(!any(str_detect(x, "///")) You might want to adjust the pattern to allow arbitrary spacing rather than just single spaces. On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote: > Hi, > I have a simple problem that's driving me nuts... Any hints are appreciated! > > I am retrieving the human homologues of rat genes. I use the functions 'getHOMOLOG' and 'listToCharacterVector' from the library annotationTools. Everything is going fine, except for one thing: > Some rows (genes) contain multiple entries (homologues); for such row I would like to get rid of all entries except the first one. > Example: for row 18634 I currently have: > [18634] "6173 /// 100529097" > > I would like to get rid of everything except the first entry, so to get this: > [18634] "6173" > > How to do this for all relevant rows? Basically, I thus would like to remove everything positioned after the first number, starting with space-3xforwardslash-etc. > Thanks, > Guido > > > library(annotationTools) > library(hugene11stv1hsentrezg.db) > library(ragene11stv1rnentrezg.db) > > #Download HomoloGene data from: > #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ > homologene<-read.delim("homologene.data.121212.data",header=FALSE) # (date of file manually added to name when saving download) > colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", "Symbol", "ProteinGI", "ProteinAcc") > > # Read rat probesets that are on the array as Entrez IDs; this returns a list which is converted to a character vector > # Next the probesets that don't have an EntrezID are removed > rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID), ragene11stv1rnentrezgENTREZID) > rat.eg.array <- listToCharacterVector(rat.eg.array) > rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)] > > # Convert rat EG IDs into human (9606) homologs; this returns a list which is converted to a character vector >> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes some time > Warning messages: > 1: In getHOMOLOG(rat.eg.array, 9606, homologene) : > One or more gene input gene ID/cluster not found in homologue table > 2: In getHOMOLOG(rat.eg.array, 9606, homologene) : > One or more gene ID/cluster with no target provided in homologue table >> rat2human <- listToCharacterVector(rat2human) >> class(rat2human) > [1] "character" >> >> head(rat2human) > [1] "54552" "80212" "11277" "10663" "199692" "399947" >> >> #example of multiple entries >> rat2human[18634] > [1] "6173 /// 100529097" >> > > > > --------------------------------------------------------- > Guido Hooiveld, PhD > Nutrition, Metabolism & Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > email: guido.hooiveld at wur.nl > internet: http://nutrigene.4t.com > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > http://www.researcherid.com/rid/F-4912-2010 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Oh, and x = rat2human, of course. On Thu 28 Feb 2013 02:23:31 PM PST, Ryan C. Thompson wrote: > You can try this: > > library(stringr) > x <- str_replace(string=x, pattern=" /// .*$", replacement="") > stopifnot(!any(str_detect(x, "///")) > > You might want to adjust the pattern to allow arbitrary spacing rather > than just single spaces. > > On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote: >> Hi, >> I have a simple problem that's driving me nuts... Any hints are >> appreciated! >> >> I am retrieving the human homologues of rat genes. I use the >> functions 'getHOMOLOG' and 'listToCharacterVector' from the library >> annotationTools. Everything is going fine, except for one thing: >> Some rows (genes) contain multiple entries (homologues); for such row >> I would like to get rid of all entries except the first one. >> Example: for row 18634 I currently have: >> [18634] "6173 /// 100529097" >> >> I would like to get rid of everything except the first entry, so to >> get this: >> [18634] "6173" >> >> How to do this for all relevant rows? Basically, I thus would like to >> remove everything positioned after the first number, starting with >> space-3xforwardslash-etc. >> Thanks, >> Guido >> >> >> library(annotationTools) >> library(hugene11stv1hsentrezg.db) >> library(ragene11stv1rnentrezg.db) >> >> #Download HomoloGene data from: >> #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ >> homologene<-read.delim("homologene.data.121212.data",header=FALSE) # >> (date of file manually added to name when saving download) >> colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", >> "Symbol", "ProteinGI", "ProteinAcc") >> >> # Read rat probesets that are on the array as Entrez IDs; this >> returns a list which is converted to a character vector >> # Next the probesets that don't have an EntrezID are removed >> rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID), >> ragene11stv1rnentrezgENTREZID) >> rat.eg.array <- listToCharacterVector(rat.eg.array) >> rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)] >> >> # Convert rat EG IDs into human (9606) homologs; this returns a list >> which is converted to a character vector >>> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes >>> some time >> Warning messages: >> 1: In getHOMOLOG(rat.eg.array, 9606, homologene) : >> One or more gene input gene ID/cluster not found in homologue table >> 2: In getHOMOLOG(rat.eg.array, 9606, homologene) : >> One or more gene ID/cluster with no target provided in homologue >> table >>> rat2human <- listToCharacterVector(rat2human) >>> class(rat2human) >> [1] "character" >>> >>> head(rat2human) >> [1] "54552" "80212" "11277" "10663" "199692" "399947" >>> >>> #example of multiple entries >>> rat2human[18634] >> [1] "6173 /// 100529097" >>> >> >> >> >> --------------------------------------------------------- >> Guido Hooiveld, PhD >> Nutrition, Metabolism & Genomics Group >> Division of Human Nutrition >> Wageningen University >> Biotechnion, Bomenweg 2 >> NL-6703 HD Wageningen >> the Netherlands >> tel: (+)31 317 485788 >> fax: (+)31 317 483342 >> email: guido.hooiveld at wur.nl >> internet: http://nutrigene.4t.com >> http://scholar.google.com/citations?user=qFHaMnoAAAAJ >> http://www.researcherid.com/rid/F-4912-2010 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 296 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6