Hi, I want to use the riboSeqR package on the output of a TopHat mapping but I have problems getting the right format for this file.
My file looks like this:
@HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr2 LN:243199373 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chrM LN:16569 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @PG ID:TopHat VN:2.0.7 CL:/opt/tophat-2.0.7.Linux_x86_64/tophat --seed 42 -n 2 -m 1 --no-novel-juncs --no-novel-indels --no-coverage-search --segment-length 25 --transcriptome-index /srv/data/indexes/gencode.v19.annotation.basic_only -G /srv/data/indexes/gencode.v19.annotation.basic_only.gtf -o /srv/data2/deepseq/codonusage_TGFb/KR20140905/data/tophat_out//2935_6_R98S__A_CCGTCCC_L001_R1_001 -p 2 /srv/data/indexes/hg19.Ensembl_coordinates /srv/data2/deepseq/codonusage_TGFb/KR20140905/data/clean_fastq/2935_6_R98S__A_CCGTCCC_L001_R1_001.cleaned.fastq.gz 16 chr1 13461 GAAGTAGGCCTCATCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCC 0 chr1 14888 CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG 0 chr1 14888 CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG 0 chr1 14888 CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG 0 chr1 14919 GACAGAATTACGAGGTGCTGGCCCAGGGCGGGCAGCGGCCCTTCCTCCTAC 0 chr1 16438 GCTCTACAGTTTGAAAACCACTATTTTATGAACCAAGTAGAACAAGATATT 0 chr1 17487 CCACTCGTCACCCCCTGGCTCCTGGCCTATGTGCTGTACCTGTGTCTGATG
If I don't eleiminate the first 27 lines with the headers the file can't be read with the readRibodata function (Reading ribosomal files...Error in `[.data.frame`(read.delim(file, as.is = TRUE, header = header),: undefined columns selected).
If I eliminate these first 27 header lines the file is successfully read but then downstream analyses fail.
I guess this is because I don't have transcripts names in the second column but the chromosomes. How can I obtain them from my Tophat mapping file? Or how can I generate it running TopHat again (the Bowtie option –suppress 1,6,7,8 recommended in the manual doesn't seem to exist umnder tophat).
Thanks a lot for your help!
Laura
P.S. Note that the fCs object ( fCs <- frameCounting(riboDat, fastaCDS)) look like this:
> fCs GRanges object with 1864591 ranges and 1 metadata column: seqnames <Rle> [1] ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918| [2] ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661| [3] ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661| [4] ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661| [5] ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661| ... ... [1864587] ENST00000361567.2|ENSG00000198786.2|-|-|MT-ND5-201|MT-ND5|1812|CDS:1-1812| [1864588] ENST00000361681.2|ENSG00000198695.2|-|-|MT-ND6-201|MT-ND6|525|CDS:1-525| [1864589] ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141| [1864590] ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141| [1864591] ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141| ranges strand | frame <IRanges> <Rle> | <numeric> [1] [ 1, 918] * | 0 [2] [166, 252] * | 0 [3] [298, 594] * | 0 [4] [604, 609] * | 0 [5] [754, 837] * | 0 ... ... ... ... ... [1864587] [1551, 1592] * | 2 [1864588] [ 243, 260] * | 2 [1864589] [ 87, 650] * | 2 [1864590] [ 813, 857] * | 2 [1864591] [1134, 1141] * | 2 ------- seqinfo: 95309 sequences from an unspecified genome; no seqlengths Slot "replicates" [1] WT M Levels: M WT Hits array; dimension 1864591 2 3 6 Unique hits array; dimension 1864591 2 3 6 fCs
And the fs object (fS <- readingFrame(rC = fCs)) looks like this:
> fS 26 27 28 29 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 frame.ML 0 0 0 0 0
Why don't you just align using bowtie, rather than tophat? Tophat is intended to find alternate splicing, which IIRC doesn't occur with rRNAs, so I don't see the rationale for using tophat anyway.