Question

Problem Importing STAR fusion data into chimera

0

Entering edit mode

am26 • 0

@am26-6994

Last seen 10.5 years ago

United Kingdom

Hi there,

I'm comparing software for the detection of fusion genes; star, tophat-fusion and defuse.

I've read about the Bioconductor chimera package which would be a convenient way of evaluating the software on an equal footing as the outputs of each tool can be imported into and analysed by the chimera module.

I've been able to import defuse data with no issues but, so far, I've not had any luck with STAR (Chimeric.out.junction) (or Tophat). First it says there aren't enough supporting reads (even though it's the same sample that's been analysed as defuse) and when I lower the value to min.support=2 and run the import command it produced the following error message after about 20-30 mins:

tst<-importFusionData("star","Chimeric.out.no.MT",parallel=TRUE, org="hs", min.support=2)
Error: 17688 errors; first error:
  Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence 1 not found

For more information, use bplasterror(). To resume calculation, re-call
  the function and set the argument 'BPRESUME' to TRUE or wrap the
  previous call in bpresume().

First traceback:
  36: importFusionData("star", "Chimeric.out.no.MT", parallel = TRUE,
          org = "hs", min.support = 2)
  35: .starImport(filename, ...)
  34: bplapply(tmp.loc.counts, function(x, z, j, k) .starFset(x, z,
          j, k), z = grHs, j = chr.sym, k = org, BPPARAM = p)
  33: bplapply(tmp.loc.counts, function(x, z, j, k) .starFset(x, z,
          j, k), z = grHs, j = chr.sym, k = org, BPPARAM = p)
  32: mclapply(X = X, FUN = FUN, ..., mc.set.seed = BPPARAM$setSeed,
          mc.silent = !BPPARAM$verbose, mc.cores = bpworkers(BPPARAM),
          mc.cleanup = if (BPPARAM$cleanup) BPPARAM$cleanupSignal else FALSE)
  31: lapply(seq_len(cores), inner.do)
  30: lappl

The command I used was:

tst<-importFusionData("star","Chimeric.out.no.MT",parallel=TRUE, org="hs", min.support=2)

The initial STAR command used to generate the Chimeric.out.junction file was:

STAR --genomeDir ${REF_DIR}/star-index --readFilesIn /13617_3.no.multihit.filt.R2.fastq 13617_3.no.multihit.filt.R1.fastq  --runThreadN 6 --chimSegmentMin 15 --chimJunctionOverhangMin 15

As I'm testing chimera for the first time, I'm assuming I'm diong something basic wrong but was wondering if you could provide any advice?

The Chimeric.output.junction file is very large, would you recommend some initial filtering or processing before attempting to import it?

Thanks

Angela

fusion genes chimera star • 3.3k views

ADD COMMENT • link updated 10.1 years ago by richa_r • 0 • written 10.5 years ago by am26 • 0

score 0 · Answer 1 · 2014-11-07

Hi Angela,

It seems that some of the detected sequences involved in the fusion cannot be extracted from the human genome present in bioconductor.

Which genome version did you used for STAR mapping? It is important that chimera version of genome and the one used for the mapping are the same. In chimera at the present time hg19 is used.

The size of STAR output is bery big and I added the min support spanning reads to reduce the input size. Please try with min.support=10 to start to define the area of the error . ThenI will be pleased if you could share your data to check the chimera input problem.

Cheers

Raf

score 0 · Answer 2 · 2014-11-10

0

Entering edit mode

am26 • 0

@am26-6994

Last seen 10.5 years ago

United Kingdom

Thanks very much for the response.

The genome.fa file I used to generate the STAR index was GRCh37 downloaded from NCBI which should be the same as hg19 shouldn't it? Currently I'm tied to this file as it's used by the group I am working in.

When I try to import the STAR Chimeric.out.junction file using default parameters, I get a message saying something like "not enough supporting reads". When I run an awk statement on the file though and count up the number of times a fusion is seen, many of them (>100) are represented by more than 10 rows in the file. Some of them have over 500 records referring to the same fusion.

I'm going to try running STAR on a difference sample and will test the output of that to see if I'm getting the same issue but I think I probably will if it's a reference sequence problem.

ADD COMMENT • link 10.4 years ago am26 • 0

0

Entering edit mode

There are 2 differences between GRCh37 and hg19:

They don't use the same chromosome naming convention: 1, 2, 3, ... for the former, chr1, chr2, chr3, ... for the latter. Given the error message you get (sequence 1 not found), this is likely what your problem is.
Another difference is that GRCh37 doesn't contain the mitochondrial sequence (MT). It was only added later in GRCh37.p1. However, the MT sequence in GRCh37.p1 is not the same as the chrM sequence in hg19. See http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19 for the details.

Cheers,

H.

ADD REPLY • link 10.3 years ago Hervé Pagès 16k

0

Entering edit mode

One more thing: it is my understanding that importFusionData() uses getSeq() behind the scene to get the sequences from the hg19 genome. However, getSeq() should have warned you that something is suspicious:

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
getSeq(genome, GRanges("1", IRanges(11, 50)))
# Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  : 
#  sequence 1 not found
# In addition: Warning message:
# In .Seqinfo.mergexy(x, y) :
#   The 2 combined objects have no sequence levels in common. (Use
#   suppressWarnings() to suppress this warning.)

But you didn't get the warning so I'm not sure what's going on. Maybe you removed it from your post? Or your version of the BSgenome software package is old? (latest release version is 1.34.0). Make sure you're using BioC 3.0 (requires R 3.1.z) and that all your packages are up to date (call biocLite() with no args to check).

H.

ADD REPLY • link 10.3 years ago Hervé Pagès 16k

score 0 · Answer 3 · 2015-03-07

Hi

This perhaps is not directly but kind of related to the issue been discussed here.

So, I have used STAR for mapping using hg19 genome reference and Gencode v19 for annotation with the aim of detecting chimeric junctions. I tried importing the STAR output into Chimera package for further filtering. But I realised that the annotation used in Chimera is of UCSC genes. Is there a way to change the annotation used in Chimera to Gencode?