annotationTools: character vector clean-up

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 23 days ago

Wageningen University, Wageningen, the …

Hi, I have a simple problem that's driving me nuts... Any hints are appreciated! I am retrieving the human homologues of rat genes. I use the functions 'getHOMOLOG' and 'listToCharacterVector' from the library annotationTools. Everything is going fine, except for one thing: Some rows (genes) contain multiple entries (homologues); for such row I would like to get rid of all entries except the first one. Example: for row 18634 I currently have: [18634] "6173 /// 100529097" I would like to get rid of everything except the first entry, so to get this: [18634] "6173" How to do this for all relevant rows? Basically, I thus would like to remove everything positioned after the first number, starting with space-3xforwardslash-etc. Thanks, Guido library(annotationTools) library(hugene11stv1hsentrezg.db) library(ragene11stv1rnentrezg.db) #Download HomoloGene data from: #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ homologene<-read.delim("homologene.data.121212.data",header=FALSE) # (date of file manually added to name when saving download) colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", "Symbol", "ProteinGI", "ProteinAcc") # Read rat probesets that are on the array as Entrez IDs; this returns a list which is converted to a character vector # Next the probesets that don't have an EntrezID are removed rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID), ragene11stv1rnentrezgENTREZID) rat.eg.array <- listToCharacterVector(rat.eg.array) rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)] # Convert rat EG IDs into human (9606) homologs; this returns a list which is converted to a character vector > rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes some time Warning messages: 1: In getHOMOLOG(rat.eg.array, 9606, homologene) : One or more gene input gene ID/cluster not found in homologue table 2: In getHOMOLOG(rat.eg.array, 9606, homologene) : One or more gene ID/cluster with no target provided in homologue table > rat2human <- listToCharacterVector(rat2human) > class(rat2human) [1] "character" > > head(rat2human) [1] "54552" "80212" "11277" "10663" "199692" "399947" > > #example of multiple entries > rat2human[18634] [1] "6173 /// 100529097" > --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld@wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 [[alternative HTML version deleted]]

convert convert • 853 views

ADD COMMENT • link updated 12.2 years ago by Ryan C. Thompson ★ 7.9k • written 12.2 years ago by Guido Hooiveld ★ 4.1k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

You can try this: library(stringr) x <- str_replace(string=x, pattern=" /// .*$", replacement="") stopifnot(!any(str_detect(x, "///")) You might want to adjust the pattern to allow arbitrary spacing rather than just single spaces. On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote: > Hi, > I have a simple problem that's driving me nuts... Any hints are appreciated! > > I am retrieving the human homologues of rat genes. I use the functions 'getHOMOLOG' and 'listToCharacterVector' from the library annotationTools. Everything is going fine, except for one thing: > Some rows (genes) contain multiple entries (homologues); for such row I would like to get rid of all entries except the first one. > Example: for row 18634 I currently have: > [18634] "6173 /// 100529097" > > I would like to get rid of everything except the first entry, so to get this: > [18634] "6173" > > How to do this for all relevant rows? Basically, I thus would like to remove everything positioned after the first number, starting with space-3xforwardslash-etc. > Thanks, > Guido > > > library(annotationTools) > library(hugene11stv1hsentrezg.db) > library(ragene11stv1rnentrezg.db) > > #Download HomoloGene data from: > #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ > homologene<-read.delim("homologene.data.121212.data",header=FALSE) # (date of file manually added to name when saving download) > colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", "Symbol", "ProteinGI", "ProteinAcc") > > # Read rat probesets that are on the array as Entrez IDs; this returns a list which is converted to a character vector > # Next the probesets that don't have an EntrezID are removed > rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID), ragene11stv1rnentrezgENTREZID) > rat.eg.array <- listToCharacterVector(rat.eg.array) > rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)] > > # Convert rat EG IDs into human (9606) homologs; this returns a list which is converted to a character vector >> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes some time > Warning messages: > 1: In getHOMOLOG(rat.eg.array, 9606, homologene) : > One or more gene input gene ID/cluster not found in homologue table > 2: In getHOMOLOG(rat.eg.array, 9606, homologene) : > One or more gene ID/cluster with no target provided in homologue table >> rat2human <- listToCharacterVector(rat2human) >> class(rat2human) > [1] "character" >> >> head(rat2human) > [1] "54552" "80212" "11277" "10663" "199692" "399947" >> >> #example of multiple entries >> rat2human[18634] > [1] "6173 /// 100529097" >> > > > > --------------------------------------------------------- > Guido Hooiveld, PhD > Nutrition, Metabolism & Genomics Group > Division of Human Nutrition > Wageningen University > Biotechnion, Bomenweg 2 > NL-6703 HD Wageningen > the Netherlands > tel: (+)31 317 485788 > fax: (+)31 317 483342 > email: guido.hooiveld at wur.nl > internet: http://nutrigene.4t.com > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > http://www.researcherid.com/rid/F-4912-2010 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Oh, and x = rat2human, of course. On Thu 28 Feb 2013 02:23:31 PM PST, Ryan C. Thompson wrote: > You can try this: > > library(stringr) > x <- str_replace(string=x, pattern=" /// .*$", replacement="") > stopifnot(!any(str_detect(x, "///")) > > You might want to adjust the pattern to allow arbitrary spacing rather > than just single spaces. > > On Thu 28 Feb 2013 02:11:21 PM PST, Hooiveld, Guido wrote: >> Hi, >> I have a simple problem that's driving me nuts... Any hints are >> appreciated! >> >> I am retrieving the human homologues of rat genes. I use the >> functions 'getHOMOLOG' and 'listToCharacterVector' from the library >> annotationTools. Everything is going fine, except for one thing: >> Some rows (genes) contain multiple entries (homologues); for such row >> I would like to get rid of all entries except the first one. >> Example: for row 18634 I currently have: >> [18634] "6173 /// 100529097" >> >> I would like to get rid of everything except the first entry, so to >> get this: >> [18634] "6173" >> >> How to do this for all relevant rows? Basically, I thus would like to >> remove everything positioned after the first number, starting with >> space-3xforwardslash-etc. >> Thanks, >> Guido >> >> >> library(annotationTools) >> library(hugene11stv1hsentrezg.db) >> library(ragene11stv1rnentrezg.db) >> >> #Download HomoloGene data from: >> #ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/ >> homologene<-read.delim("homologene.data.121212.data",header=FALSE) # >> (date of file manually added to name when saving download) >> colnames (homologene) <- c ("HomologyGroupID", "TaxonID", "EgID", >> "Symbol", "ProteinGI", "ProteinAcc") >> >> # Read rat probesets that are on the array as Entrez IDs; this >> returns a list which is converted to a character vector >> # Next the probesets that don't have an EntrezID are removed >> rat.eg.array <- mget(ls(ragene11stv1rnentrezgENTREZID), >> ragene11stv1rnentrezgENTREZID) >> rat.eg.array <- listToCharacterVector(rat.eg.array) >> rat.eg.array <- rat.eg.array[!is.na(rat.eg.array)] >> >> # Convert rat EG IDs into human (9606) homologs; this returns a list >> which is converted to a character vector >>> rat2human <- getHOMOLOG(rat.eg.array,9606,homologene) #this takes >>> some time >> Warning messages: >> 1: In getHOMOLOG(rat.eg.array, 9606, homologene) : >> One or more gene input gene ID/cluster not found in homologue table >> 2: In getHOMOLOG(rat.eg.array, 9606, homologene) : >> One or more gene ID/cluster with no target provided in homologue >> table >>> rat2human <- listToCharacterVector(rat2human) >>> class(rat2human) >> [1] "character" >>> >>> head(rat2human) >> [1] "54552" "80212" "11277" "10663" "199692" "399947" >>> >>> #example of multiple entries >>> rat2human[18634] >> [1] "6173 /// 100529097" >>> >> >> >> >> --------------------------------------------------------- >> Guido Hooiveld, PhD >> Nutrition, Metabolism & Genomics Group >> Division of Human Nutrition >> Wageningen University >> Biotechnion, Bomenweg 2 >> NL-6703 HD Wageningen >> the Netherlands >> tel: (+)31 317 485788 >> fax: (+)31 317 483342 >> email: guido.hooiveld at wur.nl >> internet: http://nutrigene.4t.com >> http://scholar.google.com/citations?user=qFHaMnoAAAAJ >> http://www.researcherid.com/rid/F-4912-2010 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.2 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.