Importing a .txt file with multiple headers into R
Entering edit mode
a.afshinfard ▴ 10
Last seen 9.1 years ago
Iran, Islamic Republic Of

Hi everyone

i have a massive report from Mummer, of "multiple" sequences. the starting lines of the file ( to understand the common format ) :

> 1  Len = 354
  203757         1         1        35
  122132         1         1        87
  203756         1         1       354
  1              1         1       354
  42364         12         1        89
  203757        37        37        91
> 1 Reverse  Len = 354
> 2  Len = 127
  203754         1         1       127
  2              1         1       127
  122133         1        19        80
  203753         1        19       109

a bigger example :

and all i want to do is to importing this report into R, but the problem is that the report has multiple headers as you see! so i can't use read.table() that only supports single header files

i've to mention that the headers are informative ( the first number in the headers are informative ) and i dont want to read the whole file as a string and write a parser for parsing and extracting data.

some of the tables are empty ( like the 1 Reverse here ) but maybe we have a "Reverse" table with records

is there any common solution ?


input files read.table maxmatch Mummer • 6.9k views
Entering edit mode
Last seen 3 days ago
United States

I'm not sure whether this is disqualified by your desire not to read the whole file as as string and write a parser. Iread the data in

lns = readLines("")

Then found all the 'header' lines

idx = grepl(">", lns)

I removed the header lines and input the remainder into a data.frame

df = read.table(text=lns[!idx])

Then added a column to the data frame telling me the header line that the row came from. To do this I had to figure out how many times the header line needed to be replicated

wd = diff(c(which(idx), length(idx) + 1)) - 1
df$label = rep(lns[idx], wd)

I'm not sure what a massive file looks like, but the above is probably good enough for anything that'll be convenient to manipulate in some down-stream way. Hope that helps!


Entering edit mode

thanks a lot !

Entering edit mode
Malcolm Cook ★ 1.6k
Last seen 9 weeks ago
United States

This is a perfect time to dust of your perl one-liners to pre-process the input so it can be read in a tidy fashion with read.csv.

The R pipe function is helpful here.

Assuming you have curl installed to process your example data:

l<-read.csv(pipe("curl -s  |  perl -lane 'BEGIN{$,=qq{,}}; unless(m/^> (?<id>\\d+) (?<strand>.)/) {%v=%+; $v{strand} =~ y/R /-+/; print($v{id},$v{strand}, @F)}'")


works quite nicely 
recodes the Reverse into a standard(ish) +/- strand
and produces output as:

    QueryID Strand      i    j    k    l
1         1      + 122132    1    1   87
2         1      + 203756    1    1  354
3         1      +      1    1    1  354
4         1      +  42364   12    1   89
5         1      + 203757   37   37   91
6         1      + 122132   90   90   38
7         1      +  42364  102   91   37
8         1      + 203757  129  129  168
9         1      +  42364  140  129  212
10        1      + 122132  129  129  212
11        1      + 203757  298  298   43
12        2      + 203754    1    1  127
13        2      +      2    1    1  127
14        2      + 122133    1   19   80
15        2      +      3    1   19  109
16        2      + 203758    1   19  109
17        2      + 203753    1   19  109
18        2      +  42363    1   19   30
19        2      +  42363   32   50   78
20        2      + 203755    1   52   52
21        2      +      4    1   52   52
22        2      + 122133   82  100   28
23        3      + 122133    1    1   80

Entering edit mode

and what about reading a file instaed of reading data from internet ?

for example the data is in a file named Seqs.txt in the current working directory


Entering edit mode

instead of calling 'curl' on the URL, call 'cat' on the filepath, in your case "./Seqs.txt", like this:

l<-read.csv(pipe("cat ./Seqs.txt  |  perl -lane 'BEGIN{$,=qq{,}}; unless(m/^> (?<id>\\d+) (?<strand>.)/) {%v=%+; $v{strand} =~ y/R /-+/; print($v{id},$v{strand}, @F)}'")

Entering edit mode

allright ! thanks a lot !


Login before adding your answer.

Traffic: 665 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6