extracting character string

0

Entering edit mode

Hari Easwaran ▴ 240

@hari-easwaran-3510

Last seen 9.7 years ago

United States

Hi all, I am working with Agilent microarray data and trying to extract only the accession numbers from the output probe annotation. Basically I have a column detailing the probe as follows: ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 mgc|BC034752:79 ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:45605 |mirna|hsa-mir-375:5790 ... I am trying to extract only the Refseq IDs (in this case NM_004564, NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column with the IDs. I am not able to figure out how to do this. I tried using the function 'strsplit', but it doesn't work. I am a newbie to R/Bioconductor and would appreciate if someone can help. Thanks. Hari [[alternative HTML version deleted]]

Microarray Annotation probe Microarray Annotation probe • 973 views

ADD COMMENT • link updated 15.7 years ago by Mark Robinson ★ 1.1k • written 15.7 years ago by Hari Easwaran ▴ 240

0

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 10.5 years ago

Hi Hari. strsplit() will work, its just sensitive. For starters, you might try: > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158", + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79") > > strsplit(x,"\\|") [[1]] [1] "ref" "NM_004564" "ref" "PET112L:2131" [5] "mgc" "BC130348:2158" [[2]] [1] "ref" "NM_007266" "ref" "XAB1:2255" [5] "mgc" "BC007451:2239" [[3]] [1] "mgc" "BC034752:79" And, for extracting the first 2 columns, maybe you'll want to migrate towards something like: > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], USE.NAMES=FALSE)) [,1] [,2] [1,] "ref" "NM_004564" [2,] "ref" "NM_007266" [3,] "mgc" "BC034752:79" Hope that gets you started. Cheers, Mark On 17/06/2009, at 7:54 AM, Hari Easwaran wrote: > Hi all, > I am working with Agilent microarray data and trying to extract only > the > accession numbers from the output probe annotation. Basically I have a > column detailing the probe as follows: > > ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 > ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 > mgc|BC034752:79 > ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref| > NM_194302:45605|mirna|hsa-mir-375:5790 > ... > > I am trying to extract only the Refseq IDs (in this case NM_004564, > NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new > column > with the IDs. I am not able to figure out how to do this. I tried > using the > function 'strsplit', but it doesn't work. > I am a newbie to R/Bioconductor and would appreciate if someone can > help. > > Thanks. > Hari > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852

ADD COMMENT • link 15.7 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Hi Hari, Mark, Mark Robinson wrote: > Hi Hari. > > strsplit() will work, its just sensitive. For starters, you might try: > > > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158", > + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79") > > > > strsplit(x,"\\|") > [[1]] > [1] "ref" "NM_004564" "ref" "PET112L:2131" > [5] "mgc" "BC130348:2158" > > [[2]] > [1] "ref" "NM_007266" "ref" "XAB1:2255" > [5] "mgc" "BC007451:2239" > > [[3]] > [1] "mgc" "BC034752:79" Note that it's better here to use strsplit() with fixed=TRUE. Then no need to escape the | and in addition strsplit() will be much faster... Cheers, H. > > > And, for extracting the first 2 columns, maybe you'll want to migrate > towards something like: > > > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], > USE.NAMES=FALSE)) > [,1] [,2] > [1,] "ref" "NM_004564" > [2,] "ref" "NM_007266" > [3,] "mgc" "BC034752:79" > > Hope that gets you started. > > Cheers, > Mark > > > On 17/06/2009, at 7:54 AM, Hari Easwaran wrote: > >> Hi all, >> I am working with Agilent microarray data and trying to extract only the >> accession numbers from the output probe annotation. Basically I have a >> column detailing the probe as follows: >> >> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 >> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 >> mgc|BC034752:79 >> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:45 605|mirna|hsa-mir-375:5790 >> >> ... >> >> I am trying to extract only the Refseq IDs (in this case NM_004564, >> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column >> with the IDs. I am not able to figure out how to do this. I tried >> using the >> function 'strsplit', but it doesn't work. >> I am a newbie to R/Bioconductor and would appreciate if someone can help. >> >> Thanks. >> Hari >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson at garvan.org.au > e: mrobinson at wehi.edu.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 15.7 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Mark and Hervé, Thanks a lot. I will try that. I was using strsplit(x,"|"), without the backslashes. Thanks again. Sincerely, Hari 2009/6/16 Hervé Pagès <hpages@fhcrc.org> > Hi Hari, Mark, > > Mark Robinson wrote: > >> Hi Hari. >> >> strsplit() will work, its just sensitive. For starters, you might try: >> >> > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158", >> + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79") >> > >> > strsplit(x,"\\|") >> [[1]] >> [1] "ref" "NM_004564" "ref" "PET112L:2131" >> [5] "mgc" "BC130348:2158" >> >> [[2]] >> [1] "ref" "NM_007266" "ref" "XAB1:2255" >> [5] "mgc" "BC007451:2239" >> >> [[3]] >> [1] "mgc" "BC034752:79" >> > > Note that it's better here to use strsplit() with fixed=TRUE. Then no > need to escape the | and in addition strsplit() will be much faster... > > Cheers, > H. > > > >> >> And, for extracting the first 2 columns, maybe you'll want to migrate >> towards something like: >> >> > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], >> USE.NAMES=FALSE)) >> [,1] [,2] >> [1,] "ref" "NM_004564" >> [2,] "ref" "NM_007266" >> [3,] "mgc" "BC034752:79" >> >> Hope that gets you started. >> >> Cheers, >> Mark >> >> >> On 17/06/2009, at 7:54 AM, Hari Easwaran wrote: >> >> Hi all, >>> I am working with Agilent microarray data and trying to extract only the >>> accession numbers from the output probe annotation. Basically I have a >>> column detailing the probe as follows: >>> >>> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 >>> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 >>> mgc|BC034752:79 >>> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:4 5605|mirna|hsa-mir-375:5790 >>> >>> ... >>> >>> I am trying to extract only the Refseq IDs (in this case NM_004564, >>> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column >>> with the IDs. I am not able to figure out how to do this. I tried using >>> the >>> function 'strsplit', but it doesn't work. >>> I am a newbie to R/Bioconductor and would appreciate if someone can help. >>> >>> Thanks. >>> Hari >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> ------------------------------ >> Mark Robinson, PhD (Melb) >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: m.robinson@garvan.org.au >> e: mrobinson@wehi.edu.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 15.7 years ago Hari Easwaran ▴ 240

Login before adding your answer.