Importing a .txt file with multiple headers into R
2
1
Entering edit mode
a.afshinfard ▴ 10
@aafshinfard-7617
Last seen 9.5 years ago
Iran, Islamic Republic Of

Hi everyone

i have a massive report from Mummer, of "multiple" sequences. the starting lines of the file ( to understand the common format ) :

> 1  Len = 354
  203757         1         1        35
  122132         1         1        87
  203756         1         1       354
  1              1         1       354
  42364         12         1        89
  203757        37        37        91
> 1 Reverse  Len = 354
> 2  Len = 127
  203754         1         1       127
  2              1         1       127
  122133         1        19        80
  203753         1        19       109

a bigger example : http://m.uploadedit.com/ba3c/1429271308686.txt

and all i want to do is to importing this report into R, but the problem is that the report has multiple headers as you see! so i can't use read.table() that only supports single header files

i've to mention that the headers are informative ( the first number in the headers are informative ) and i dont want to read the whole file as a string and write a parser for parsing and extracting data.

some of the tables are empty ( like the 1 Reverse here ) but maybe we have a "Reverse" table with records

is there any common solution ?

thanks

input files read.table maxmatch Mummer • 7.2k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

I'm not sure whether this is disqualified by your desire not to read the whole file as as string and write a parser. Iread the data in

lns = readLines("http://m.uploadedit.com/ba3c/1429271308686.txt")

Then found all the 'header' lines

idx = grepl(">", lns)

I removed the header lines and input the remainder into a data.frame

df = read.table(text=lns[!idx])

Then added a column to the data frame telling me the header line that the row came from. To do this I had to figure out how many times the header line needed to be replicated

wd = diff(c(which(idx), length(idx) + 1)) - 1
df$label = rep(lns[idx], wd)

I'm not sure what a massive file looks like, but the above is probably good enough for anything that'll be convenient to manipulate in some down-stream way. Hope that helps!

 

ADD COMMENT
0
Entering edit mode

thanks a lot !

ADD REPLY
2
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 4 months ago
United States

This is a perfect time to dust of your perl one-liners to pre-process the input so it can be read in a tidy fashion with read.csv.

The R pipe function is helpful here.

Assuming you have curl installed to process your example data:

l<-read.csv(pipe("curl -s  http://m.uploadedit.com/ba3c/1429271308686.txt  |  perl -lane 'BEGIN{$,=qq{,}}; unless(m/^> (?<id>\\d+) (?<strand>.)/) {%v=%+; $v{strand} =~ y/R /-+/; print($v{id},$v{strand}, @F)}'")
            ,col.names=c('QueryID','Strand','i','j','k','l'))

l

works quite nicely 
recodes the Reverse into a standard(ish) +/- strand
and produces output as:

    QueryID Strand      i    j    k    l
1         1      + 122132    1    1   87
2         1      + 203756    1    1  354
3         1      +      1    1    1  354
4         1      +  42364   12    1   89
5         1      + 203757   37   37   91
6         1      + 122132   90   90   38
7         1      +  42364  102   91   37
8         1      + 203757  129  129  168
9         1      +  42364  140  129  212
10        1      + 122132  129  129  212
11        1      + 203757  298  298   43
12        2      + 203754    1    1  127
13        2      +      2    1    1  127
14        2      + 122133    1   19   80
15        2      +      3    1   19  109
16        2      + 203758    1   19  109
17        2      + 203753    1   19  109
18        2      +  42363    1   19   30
19        2      +  42363   32   50   78
20        2      + 203755    1   52   52
21        2      +      4    1   52   52
22        2      + 122133   82  100   28
23        3      + 122133    1    1   80

ADD COMMENT
0
Entering edit mode

and what about reading a file instaed of reading data from internet ?

for example the data is in a file named Seqs.txt in the current working directory

thanks

ADD REPLY
0
Entering edit mode

instead of calling 'curl' on the URL, call 'cat' on the filepath, in your case "./Seqs.txt", like this:

l<-read.csv(pipe("cat ./Seqs.txt  |  perl -lane 'BEGIN{$,=qq{,}}; unless(m/^> (?<id>\\d+) (?<strand>.)/) {%v=%+; $v{strand} =~ y/R /-+/; print($v{id},$v{strand}, @F)}'")
            ,col.names=c('QueryID','Strand','i','j','k','l'))

ADD REPLY
0
Entering edit mode

allright ! thanks a lot !

ADD REPLY

Login before adding your answer.

Traffic: 462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6