Question

SomaticSignatures mutationContext (char ',') not in lookup table

2

Entering edit mode

tesa.severson ▴ 10

@tesaseverson-8731

Last seen 9.7 years ago

Netherlands

Hello all,

I am trying to use SomaticSignatures to analyze my exome sequencing data. I have a starting table of variants identified (data.frame), rather than metadata and a VRanges object like in the tutorial online (http://www.bioconductor.org/packages/release/bioc/vignettes/SomaticSignatures/inst/doc/SomaticSignatures-vignette.html).

I converted the table using GRanges

file <- read.csv("_all_variants_SomaticSignatures.csv", head=TRUE, stringsAsFactors=TRUE)
file2 <- makeGRangesFromDataFrame(file, keep.extra.columns=TRUE)

And followed the tutorial to make a VRanges object

file3 = unname(subset(file2, Variant_Type %in% "SNV"))
file4 = keepSeqlevels(file3, hsAutosomes())
ex_vr = VRanges(
    seqnames = seqnames(file4),
    ranges = ranges(file4),
    ref = file4$Reference_Allele,
    alt = file4$Tumor_Seq_Allele2,
    sampleNames = file4$Patient_ID,
    seqinfo = seqinfo(file4),
    study = file4$study)


ex_vr

VRanges object with 68095 ranges and 1 metadata column:
          seqnames               ranges strand         ref              alt
             <Rle>            <IRanges> <Rle> <character> <characterOrRle>
      [1]     chr1     [762353, 762353]      +           G                C
      [2]     chr1     [876516, 876516]      +           C                A
      [3]     chr1     [878314, 878314]      +           G                C
      [4]     chr1     [878314, 878314]      +           G                C
      [5]     chr1     [878314, 878314]      +           G                C
      ...      ...                  ...    ...         ...              ...
[68091]    chr22 [51171667, 51171667]      +           G                A
[68092]    chr22 [51171667, 51171667]      +           G                A
[68093]    chr22 [51176616, 51176616]      +           G                A
[68094]    chr22 [51178065, 51178065]      +           G                A
[68095]    chr22 [51219006, 51219006]      +           G                A
              totalDepth       refDepth       altDepth   sampleNames
          <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>
      [1]           <NA>           <NA>           <NA>        RTN102
      [2]           <NA>           <NA>           <NA>        RTN126
      [3]           <NA>           <NA>           <NA>        RTN122
      [4]           <NA>           <NA>           <NA>        RTN127
      [5]           <NA>           <NA>           <NA>        RTN045
      ...            ...            ...            ...           ...
[68091]           <NA>           <NA>           <NA>        RTN133
[68092]           <NA>           <NA>           <NA>        RTN045
[68093]           <NA>           <NA>           <NA>        RTN133
[68094]           <NA>           <NA>           <NA>        RTN039
[68095]           <NA>           <NA>           <NA>        RTN024
          softFilterMatrix   |          study
                  <matrix>   |       <factor>
      [1]                    | non-BRCA1-like
      [2]                    |     BRCA1-like
      [3]                    |     BRCA1-like
      [4]                    | non-BRCA1-like
      [5]                    |   BRCA1-mutant
      ...              ... ...            ...
[68091]                    |     BRCA1-like
[68092]                    |   BRCA1-mutant
[68093]                    |     BRCA1-like
[68094]                    |   BRCA1-mutant
[68095]                    |     BRCA1-like
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
hardFilters: NULL

I think this looks good and the table looks good

sort(table(ex_vr$study), decreasing=TRUE)

BRCA1-like non-BRCA1-like BRCA1-mutant
45065 15235 7795

But when I try to do mutationContext I get an error

ex_motifs=mutationContext(ex_vr, BSgenome.Hsapiens.UCSC.hg19, unify=TRUE)
Error in .Call2("new_XStringSet_from_CHARACTER", ans_class, ans_elementType,  :
  key 44 (char ',') not in lookup table

Can anyone help me figure out what I've done wrong?

Thanks very much,

Tesa

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] ggdendro_0.1-15                   ggplot2_1.0.1                    
 [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.3                  
 [5] rtracklayer_1.28.9                SomaticCancerAlterations_1.4.0   
 [7] SomaticSignatures_2.4.7           Biobase_2.28.0                   
 [9] VariantAnnotation_1.14.13         Rsamtools_1.20.4                 
[11] Biostrings_2.36.4                 XVector_0.8.0                    
[13] GenomicRanges_1.20.6              GenomeInfoDb_1.4.2               
[15] IRanges_2.2.7                     S4Vectors_0.6.4                  
[17] BiocGenerics_0.14.0               BiocInstaller_1.18.4             

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0             biovizBase_1.16.0       lattice_0.20-33        
 [4] digest_0.6.8            foreach_1.4.2           gridBase_0.4-7         
 [7] plyr_1.8.3              futile.options_1.0.0    acepack_1.3-3.3        
[10] RSQLite_1.0.0           zlibbioc_1.14.0         GenomicFeatures_1.20.3
[13] rpart_4.1-10            labeling_0.3            ggbio_1.16.1           
[16] proto_0.3-10            splines_3.2.2           BiocParallel_1.2.20    
[19] stringr_1.0.0           foreign_0.8-66          RCurl_1.95-4.7         
[22] biomaRt_2.24.0          munsell_0.4.2           proxy_0.4-15           
[25] pkgmaker_0.22           pcaMethods_1.58.0       nnet_7.3-11            
[28] gridExtra_2.0.0         Hmisc_3.16-0            codetools_0.2-14       
[31] XML_3.98-1.3            reshape_0.8.5           GenomicAlignments_1.4.1
[34] MASS_7.3-44             bitops_1.0-6            grid_3.2.2             
[37] RBGL_1.44.0             exomeCopy_1.14.0        xtable_1.7-4           
[40] GGally_0.5.0            gtable_0.1.2            registry_0.3           
[43] DBI_0.3.1               magrittr_1.5            scales_0.3.0           
[46] graph_1.46.0            stringi_0.5-5           reshape2_1.4.1         
[49] doParallel_1.0.8        latticeExtra_0.6-26     futile.logger_1.4.1    
[52] Formula_1.2-1           lambda.r_1.1.7          RColorBrewer_1.1-2     
[55] NMF_0.20.6              iterators_1.0.7         tools_3.2.2            
[58] dichromat_2.0-0         OrganismDbi_1.10.0      rngtools_1.2.4         
[61] survival_2.38-3         AnnotationDbi_1.30.1    colorspace_1.2-6       
[64] cluster_2.0.3

SomaticSignatures mutationContext VRanges • 2.5k views

ADD COMMENT • link updated 9.7 years ago by Julian Gehring ★ 1.3k • written 9.7 years ago by tesa.severson ▴ 10

1

Entering edit mode

Hi Tesa, thank you for the detailed report. I have a few questions that can help us to narrow down the issue:

Can you please rerun your example and run traceback() directly after the error occurs? This can tell us in which of the underlying functions the issue occurs.
Can you rerun the command with unify=FALSE and see if the issue still occurs?
You can try to rerun the command with a subset of the VRanges (e.g. ex_vr[1:1000]). If you do this with different subset, you can most likely narrow it down to a single problematic entry.

In case you can share the data with me, I can also have a look at it.

Regarding (3), running

for(i in seq_along(ex_vr)) {
  v = ex_vr[i]
  tryCatch(mutationContext(v, BSgenome.Hsapiens.UCSC.hg19, unify=TRUE), error = function(e) print(v))
}

will step through each variant individually, and show the ones that cause the error.

ADD REPLY • link 9.7 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Hi Julian, thanks for your quick reply. I won't give you traceback details because the problem was that there were still some INDELs in my file. I ran your code for stepping though each variant. The 'error' also occurred also with unify=FALSE. The INDELS were the offending lines. But now I have a new, unrelated problem. I still can't get the mutationContext function to work with my data. I can send you my data and you can have a look to see if you can figure out what might be wrong besides the INDELs.

Thanks again,

Tesa

ADD REPLY • link 9.7 years ago tesa.severson ▴ 10

0

Entering edit mode

I'm glad to hear that you could find the issue. Can you give me an example how a "typical offending" InDel looks like in your data? (Either paste it here or send it to me via e-mail). The mutationContext function performs checks for catching InDels, but it seems to miss some in your data. I would like to extend the internal checks, such that they also work seamlessly with your data.

ADD REPLY • link 9.7 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Hi Julian,

I'm not sure if it is because of the somewhat weird way SnpEff annotates indels. here are a few offenders. I have found an additional one that my code didn't find which was causing the last function problem. I can now use mutationContext when I input the right data ;)

1

211465056

1

*

RCOR3

26155

sequence_feature

INDEL

ACATAGGCCA

ACA

NA

Untested

Somatic

RTN038

304

BRCA1-like

3

157820753

1

*

SHOX2

26155

5_prime_UTR_variant

INDEL

ACAGCA

ACA

NA

Untested

Somatic

RTN039

139

BRCA1-mutant

4

185580420

1

*

PRIMPOL

26155

intron_variant

INDEL

ACATTCA

ACA

NA

Untested

Somatic

RTN097

305

BRCA1-like

4

185580420

1

*

PRIMPOL

26155

intron_variant

INDEL

ACATTCA

ACA

NA

Untested

Somatic

RTN102

305

non-BRCA1-like

4

185580420

1

*

PRIMPOL

26155

intron_variant

INDEL

ACATTCA

ACA

NA

Untested

Somatic

RTN126

305

BRCA1-like

7

12428493

1

*

VWDE

26155

3_prime_UTR_variant

INDEL

ACACTCA

ACA

NA

Untested

Somatic

RTN133

1092

BRCA1-like

14

37717240

1

*

MIPOL1

26155

intron_variant

INDEL

ACATGCACACAGAGACCCA

ACA

NA

Untested

Somatic

RTN039

860

BRCA1-mutant

17

7809187

1

*

CHD3

26155

frameshift_variant

INDEL

ACACCCGTCA

ACA

NA

Untested

Somatic

RTN051

624

BRCA1-like

14

71202835

1

*

MAP3K9

26155

intron_variant

INDEL

ACACCAC

ACAC

NA

Untested

Somatic

RTN005

170

BRCA1-like

14

71202835

1

*

MAP3K9

26155

intron_variant

INDEL

ACACCAC

ACAC

NA

Untested

Somatic

RTN028

170

non-BRCA1-like

14

71202835

1

*

MAP3K9

26155

intron_variant

INDEL

ACACCAC

ACAC

NA

Untested

Somatic

RTN039

170

BRCA1-mutant

Thanks for your help

ADD REPLY • link 9.7 years ago tesa.severson ▴ 10

0

Entering edit mode

Thanks for the data. I'm however not familiar with the output format of SnpEff. For e.g. the first variant, would ACATAGGCCA be the reference and ACA be the alternative allele?

ADD REPLY • link 9.7 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Yes, that's my interpretation of it. So, the alternative allele is missing 'TAGGCCA'. But there doesn't seem to be any rhyme or reason how long the reference section is in the output. We've actually found it so difficult to interpret we re-map indels if needed with another tool.

ADD REPLY • link 9.7 years ago tesa.severson ▴ 10

0

Entering edit mode

As I mentioned before, the mutationContext function already has some checks that should detect cases like these. I don't fully understand why they did not work here, perhaps due to the representation of these specific indels in the VRanges. Could you please send the a VRanges with the offending entries that you have shown above via e-mail? That should help me diagnosing the issue, since SnpEff is used by quite a number of groups.

ADD REPLY • link 9.7 years ago Julian Gehring ★ 1.3k

score 0 · Answer 1 · 2015-09-03

0

Entering edit mode

tesa.severson ▴ 10

@tesaseverson-8731

Last seen 9.7 years ago

Netherlands

Hi Julian, no problem. I don't have access to the computer I did the work on until later this afternoon, but I'll email it to you then.

Cheers,

Tesa

ADD COMMENT • link 9.7 years ago tesa.severson ▴ 10

score 0 · Answer 2 · 2015-09-04

From the data that you have sent me, the issue appears to be caused by alternative alleles that contain "," characters:

head(alt(vr)[grep(",", alt(vr))])
[1] "TGGGGGGGGG,TGGGGGGGGGGGGGG"                    
[2] "TTCCTCCTCCTCCTCCTCCTCCTC,TTCCTCCTCCTCCTCCTCCTC"
[3] "TTCCTCCTCCTCCTCCTCCTCCTC,TTCCTCCTCCTCCTCCTCCTC"
[4] "TTCCTCCTCCTCCTCCTCCTCCTC,TTCCTCCTCCTCCTCCTCCTC"
[5] "GAA,GA"                                        
[6] "GAA,GA"

You can remove those simply with

vrx = vr[!grepl(",", alt(vr))]

and the analysis works fine. I suspect that different alternative alleles are encoded this way in the "csv" file that you are starting with; however, it does not work with the concept of VRanges. As I mentioned before, I will change the checks of the respective functions, such that you will get a more informative error message in the future.

Update: SomaticSignatures 2.4.8 implements more extensives checks and would provide a more informative error message here. It should become available once the build is completed, which will happen over the weekend.