Gene names
5
0
Entering edit mode
@narendra-kaushik-1390
Last seen 10.3 years ago
I have gene file in this format, everything in one column (no spaces at all): SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B Is there any way to convert it in this format (into four columns) except manually? SFTPB NM_000542.1 4506904 surfactant, pulmonary-associated protein B Any suggestions? Narendra Dr. Narendra Kaushik School of Biosciences, University of Cardiff, Museum Avenue, Cardiff CF10 3US Tel: 029 20 875 153
convert convert • 1.3k views
ADD COMMENT
0
Entering edit mode
John Zhang ★ 2.9k
@john-zhang-6
Last seen 10.3 years ago
>I have gene file in this format, everything in one column (no spaces at all): >SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B >Is there any way to convert it in this format (into four columns) except >manually? > >SFTPB NM_000542.1 4506904 >surfactant, pulmonary-associated protein B try: unlist(strsplit(yourString, "\\|")) > >Any suggestions? > >Narendra > >Dr. Narendra Kaushik >School of Biosciences, >University of Cardiff, >Museum Avenue, >Cardiff CF10 3US >Tel: 029 20 875 153 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States
Narendra Kaushik wrote: > I have gene file in this format, everything in one column (no spaces at all): > SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B > Is there any way to convert it in this format (into four columns) except > manually? > > SFTPB NM_000542.1 4506904 > surfactant, pulmonary-associated protein B > > Any suggestions? Does data.frame(scan("filename", what = "c", sep = "|")) do what you want? Best, Jim > > Narendra > > Dr. Narendra Kaushik > School of Biosciences, > University of Cardiff, > Museum Avenue, > Cardiff CF10 3US > Tel: 029 20 875 153 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623
ADD COMMENT
0
Entering edit mode
@jdelasherasedacuk-1189
Last seen 9.4 years ago
United Kingdom
Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">: > I have gene file in this format, everything in one column (no spaces at all): > SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B > Is there any way to convert it in this format (into four columns) except > manually? > > SFTPB NM_000542.1 4506904 > surfactant, pulmonary-associated protein B > > Any suggestions? > > Narendra Maybe too obvious, but Excel is very good for this sort of thing. Functions like Search allow you to obtain the position of a particulat character (like "|") and knowing that you can select the text to the left or right to it... if you do that consecutively you can sort it like that. It'll take a minute. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
ADD COMMENT
0
Entering edit mode
Hi Narendra, R is also very good for this sort of thing. Have a look at the strsplit function. x = readLines("yourfile") sp = strsplit(x, split="|") (see the man page of strsplit) and from this you can construct e.g. a vector with the j-th column through sapply(sp, "[", j) Cheers Wolfgang ------------------------------------- Wolfgang Huber European Bioinformatics Institute European Molecular Biology Laboratory Cambridge CB10 1SD England Phone: +44 1223 494642 Fax: +44 1223 494486 Http: www.ebi.ac.uk/huber ------------------------------------- J.delasHeras at ed.ac.uk wrote: > Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">: > > >>I have gene file in this format, everything in one column (no spaces at all): >>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B >>Is there any way to convert it in this format (into four columns) except >>manually? >> >>SFTPB NM_000542.1 4506904 >>surfactant, pulmonary-associated protein B >> >>Any suggestions? >> >>Narendra > > > Maybe too obvious, but Excel is very good for this sort of thing. > Functions like > Search allow you to obtain the position of a particulat character (like > "|") and > knowing that you can select the text to the left or right to it... if you do > that consecutively you can sort it like that. It'll take a minute. >
ADD REPLY
0
Entering edit mode
Seth Falcon ★ 7.4k
@seth-falcon-992
Last seen 10.3 years ago
On 6 Nov 2005, christopher.wilkinson at adelaide.edu.au wrote: > > If you want to do this in R, the function you want is strsplit, > telling it to split on the "|" character. However "|" is special in > character splitting (regular expressions) so we have to protect it > with backslashes. For using strsplit in this way, you can also pass the fixed=TRUE option and then you do not need to do any escaping. + seth
ADD COMMENT
0
Entering edit mode
@christopher-wilkinson-309
Last seen 10.3 years ago
If you want to do this in R, the function you want is strsplit, telling it to split on the "|" character. However "|" is special in character splitting (regular expressions) so we have to protect it with backslashes. As a word of advice look up regular expressions - they are extremely powerful for manipulating strings (?regexp) > geneName <- "SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B" > strsplit(geneName,"\\|") [[1]] [1] "SFTPB" "NM_000542.1" [3] "4506904" "surfactant, pulmonary-associated protein B" note it returns a list, where you probably want a vector or array, so something like t(as.matrix(strsplit(geneName,"\\|")[[1]])) or unlist(strsplit(geneName,"\\|") will give "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" Now lets assume you have a vector of genenames to be split, you can use the sapply function. geneNames <- rep(geneName,3) geneNamesAsMatrix <- t(sapply(geneNames,function(x){unlist(strsplit(x,"\\|"))})) > rownames(geneNamesAsMatrix) <- NULL ## otherwise whole str is the row name > geneNamesAsMatrix [,1] [,2] [,3] [,4] [1,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" [2,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" [3,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" Of course you could do this on the command line with perl using something like perl -ne 'my @F=split /\|/,$_;print join("\t", at F)' infile > outfile Cheers Chris >Date: Sun, 06 Nov 2005 02:13:39 +0000 >From: J.delasHeras at ed.ac.uk >Subject: Re: [BioC] Gene names >To: bioconductor at stat.math.ethz.ch >Message-ID: <20051106021339.3x6viekhogs0w8w0 at www.staffmail.ed.ac.uk> >Content-Type: text/plain; charset=ISO-8859-1; format="flowed" > >Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">: > > > >>I have gene file in this format, everything in one column (no spaces at all): >>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B >>Is there any way to convert it in this format (into four columns) except >>manually? >> >>SFTPB NM_000542.1 4506904 >>surfactant, pulmonary-associated protein B >> >>Any suggestions? >> >>Narendra >> >> > >Maybe too obvious, but Excel is very good for this sort of thing. >Functions like >Search allow you to obtain the position of a particulat character (like >"|") and >knowing that you can select the text to the left or right to it... if you do >that consecutively you can sort it like that. It'll take a minute. > >Jose > > > -- Dr Chris Wilkinson Senior Research Officer | ARC Research Associate Child Health Research Institute (CHRI)| Microarray Analysis Group 7th floor, Clarence Rieger Building | Room 121 Women's and Children's Hospital | School of Mathematical Sciences 72 King William Rd, | The University of Adelaide, 5005 North Adelaide, 5006 | CRICOS Provider Number 00123M Math's Office (Room 121) Ph: 8303 3714 CHRI Office (CR2 52A) Ph: 8161 6363 Christopher.Wilkinson at adelaide.edu.au http://mag.maths.adelaide.edu.au/crwilkinson.html Organising Committee Member, 5th Australian Microarray Conference 29th Sept to 1st Oct 2005, Novatel Barossa Valley Resort http://www.sapmea.asn.au/conventions/microarray/index.html
ADD COMMENT

Login before adding your answer.

Traffic: 532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6