Question

seqCAT profiler filtering out all variants?

0

Entering edit mode

abe • 0

@abe-18006

Last seen 6.0 years ago

I'm getting the same error message as found in this post when running the create_profile function from seqCAT:

> create_profile(vcf, "SRR7789342", "SRR7789342_profile.txt")
Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  :
  replacement has 1 row, data has 0
In addition: Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
3: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
4: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr15_KI270727v1_random, chrY_KI270740v1_random, chrUn_KI270304v1, chrUn_KI270305v1, chrUn_KI270322v1, chrUn_KI270320v1, chrUn_KI270312v1, chrUn_KI270412v1, chrUn_KI270419v1, chrUn_KI270420v1, chrUn_KI270424v1, chrUn_KI270417v1, chrUn_KI270422v1, chrUn_KI270425v1, chrUn_KI270528v1, chrUn_KI270539v1, chrUn_KI270548v1, chrUn_KI270334v1, chrUn_KI270335v1, chrUn_KI270338v1, chrUn_KI270340v1, chrUn_KI270336v1, chrUn_KI270364v1, chrUn_KI270366v1, chrUn_KI270379v1, chrUn_KI270389v1, chrUn_KI270395v1, chrUn_KI270388v1, chrUn_KI270386v1, chrUn_KI270376v1, chrUn_KI270374v1, chrUn_KI270372v1, chrUn_KI270373v1, chrUn_KI270375v1, chrUn_KI270371v1, chrUn_KI270752v1, chrUn_KI270755v1, chr1_KI270762v1_alt, chr1_KI270766v1_alt, chr1_KI270760v1_alt, chr1_GL383518v1_alt, chr1_KI270761v1_alt, chr2_KI270770v1_alt, chr2_KI270773v1_alt, chr2_KI270775v1_alt, chr2_KI270771v1_alt, chr2_KI270776v1_alt, chr2_KI270767v1_alt, chr [... truncated]

It looks like all of my variants are being filtered out? I double-checked the source code for the create_profile function, and I understand that the following filters exist: 1. Remove variants without "PASS" in the FILTER column. 2. Remove variants with depth below default of 10. 3. Remove non-variants, i.e. any row with greater than one char in the REF or ALT columns.

I went back to my .vcf to inspect for these filters, and I found a row that passes all those criteria:

chr1    187485  .       G       A       855.6   PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-0.656;DP=37;ExcessHet=3.0103;FS=2.005;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=23.12;ReadPosRankSum=2.135;SOR=0.283;ANN=A|synonymous_variant|LOW|FO538757.1|ENSG00000279457|transcript|ENST00000623083.3|protein_coding|7/11|c.771C>T|p.Asn257Asn|771/1687|771/1395|257/464||,A|synonymous_variant|LOW|FO538757.1|ENSG00000279457|transcript|ENST00000623834.3|protein_coding|8/10|c.768C>T|p.Asn256Asn|768/1374|768/1080|256/359||,A|synonymous_variant|LOW|FO538757.1|ENSG00000279457|transcript|ENST00000624735.1|protein_coding|9/15|c.804C>T|p.Asn268Asn|1040/1604|804/1368|268/455||,A|downstream_gene_variant|MODIFIER|FO538757.2|ENSG00000279928|transcript|ENST00000624431.1|protein_coding||c.*3327G>A|||||3327|,A|downstream_gene_variant|MODIFIER|MIR6859-2|ENSG00000273874|transcript|ENST00000612080.1|miRNA||n.*406C>T|||||406|  GT:AD:DP:GQ:PL  0/1:5,32:37:62:863,0,62

So, why would this row be filtered out? I know that create_profile calls filter_annotations to filter for the highest impact variants. Is this row getting filtered out because its impact in marked as LOW?

For reference, I used HaplotypeCaller, VariantFiltration and snpEff to get the final .vcf here: https://drive.google.com/open?id=1vfih6-QWPuczqHYDRhx6ygzzD8EjeRDm.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
 [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] seqCAT_1.4.1                VariantAnnotation_1.27.9
 [3] Rsamtools_1.33.7            Biostrings_2.49.2
 [5] XVector_0.21.4              SummarizedExperiment_1.11.6
 [7] DelayedArray_0.7.48         BiocParallel_1.15.15
 [9] matrixStats_0.54.0          Biobase_2.41.2
[11] GenomicRanges_1.33.14       GenomeInfoDb_1.17.4
[13] IRanges_2.15.18             S4Vectors_0.19.22
[15] BiocGenerics_0.27.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0               lattice_0.20-35          prettyunits_1.0.2
 [4] ps_1.2.0                 assertthat_0.2.0         rprojroot_1.3-2
 [7] digest_0.6.18            R6_2.3.0                 backports_1.1.2
[10] RSQLite_2.1.1            httr_1.3.1               zlibbioc_1.27.0
[13] rlang_0.3.0.1            GenomicFeatures_1.33.4   progress_1.2.0
[16] curl_3.2                 callr_3.0.0              blob_1.1.1
[19] Matrix_1.2-14            desc_1.2.0               devtools_2.0.1
[22] stringr_1.3.1            RCurl_1.95-4.11          bit_1.1-14
[25] biomaRt_2.37.9           munsell_0.5.0            compiler_3.5.1
[28] rtracklayer_1.41.8       pkgconfig_2.0.2          base64enc_0.1-3
[31] pkgbuild_1.0.2           GenomeInfoDbData_1.2.0   XML_3.98-1.16
[34] crayon_1.3.4             withr_2.1.2              GenomicAlignments_1.17.3
[37] bitops_1.0-6             grid_3.5.1               gtable_0.2.0
[40] DBI_1.0.0                magrittr_1.5             cli_1.0.1
[43] stringi_1.2.4            fs_1.2.6                 remotes_2.0.1
[46] testthat_2.0.1           tools_3.5.1              bit64_0.9-7
[49] glue_1.3.0               BSgenome_1.49.5          hms_0.4.2
[52] processx_3.2.0           pkgload_1.0.1            AnnotationDbi_1.44.0
[55] colorspace_1.3-2         BiocManager_1.30.3       sessioninfo_1.1.0
[58] memoise_1.1.0            usethis_1.4.0

seqCat • 1.0k views

ADD COMMENT • link 6.0 years ago abe • 0

0

Entering edit mode

Hi, abe!

There are a number of things you can check. First, are you using the correct sample names when creating the profile? You can check the sample names with VariantAnnotation::scanVcfHeader(vcf)

Secondly, what type of variant calling have you performed? If you have done joint genotyping that might explain the issue; in that case, the latest development version should fix it (1.5.1 and above). Bioconductor also has its upcoming release at April 30th, so that change will be included in the next release as well.

Having only LOW variant impacts does not mean that they will be filtered out; seqCAT only filters out those variants that are below the highest impact, i.e. any MODIFIER variant impacts would be filtered out here, as they have less impact than LOW. Also, seqCAT does not remove "non-variants", but rather non-SNVs. So any variant that is longer than 1 bp, as you correctly stated, i.e. indels.

I don't have permission to access the Google Drive folder, but if you provide me with access (I have requested it) I can try to look in your file to see if I can find the error as well.

ADD REPLY • link 6.0 years ago erikfas ▴ 30

0

Entering edit mode

I believe that this is a bug fixed in Rhtslib 1.15.6, but I am not exactly sure because your packages seem a mix of release and 'devel' versions. Be sure that BiocManager::valid() returns TRUE.

ADD REPLY • link 6.0 years ago Martin Morgan 25k