Hi all,
I am quite new to R/Rstudio, and trying to use it in combination with VariantAnnotation/Bioconductor to extract structural variant data and flanking sequence from available VCF and genome (fasta) files.
Quite recently, VCF's (VCFv4.1 source = sniffles) of over 100 tomato accessions were uploaded on the Solgenomics website. In combination with the SL4.0 genome fasta, I would like extract structural variant data and flanking sequences per tomato accession in a semi-automated method, with an output as followed.
>StructuralVariantID1
ACGTTGTCTTCAAGCTAAAGGCTCGTGGAATGAATGCGGC[G/A]GATCTCGGAAAACTTGGAAGATCAACTACTTTGAAAAGT
Eventually, the goal would be using this data for possible marker design or similar activities.
I have tried various manuals, help pages and forums, however, since I am still a rookie when it comes to R, these are often quite dense in information that it is overwhelming. Therefore, I was hoping if someone could point me in a direction, or help me on my way with writing a code, and/or provide some explanation.
Thank you very much in advance!
- Willem
Can you provide a link to the specific files that you are working with?
The link for the VCF files: ftp://ftp.solgenomics.net/genomes/tomato100/March022020svlandscape/variants/
Link for the M82 VCF file that I was using: ftp://ftp.solgenomics.net/genomes/tomato100/March022020svlandscape/variants/M82.ont.v1.0.s.vcf.gz
Link for the SL4.0 genome fasta: ftp://ftp.solgenomics.net/genomes/Solanumlycopersicum/assembly/build4.00/