readVcf does not recognize tabix index
0
0
Entering edit mode
naive • 0
@e3519b66
Last seen 3.7 years ago

Hello everyone,

I am trying to read a vcf file with the readVcf command of the package VariantAnnotation with R. Before, I generated a tbi file from my original vcf file using the following commadns in the shell:

  1. extract the markers of the 1st chromosome tabix -h SNPs.vcf.bgz chr1H > chr1.vcf

  2. create tabix index for the vcf file created in step 1 tabix -p vcf chr1.vcf

The next step is executed in R

vcffile = open(VcfFile(file = "chr1.vcf",
                       index = "chr1.vcf.bgz.tbi",
                       yieldSize = 1000))
vcf = readVcf(vcffile)

This returns the following error: Error: scanVcf: scanVcf: scanTabix: [internal] hmm.. this doesn't look like a tabix file, sorry

Can anybody tell the meaning of this error? How can I find out what is wrong with the tabix file?

Thank you!

VariantAnnotation readvcf tabix • 3.7k views
ADD COMMENT
0
Entering edit mode

Maybe try using VariantAnnotation::indexVcf to create the needed index file?

ADD REPLY
0
Entering edit mode

Thank you for your answer! I am completely new to working with vcf files so any hint is valuable for me.

 indexVcf(x = "chr1.vcf.bgz")

returns:

class: VcfFile 
path: chr1.vcf.bgz
index: chr1.vcf.bgz.tbi
isOpen: FALSE 
yieldSize: NA 

and

vcffile = open(VcfFile(file = "chr1.vcf",
                       index = indexVcf(x = "chr1.vcf.bgz" ),
                       yieldSize = 1000))

returns an error:

Error in open.TabixFile(VcfFile(file = "chr1.vcf", index = indexVcf(x = "chr1.vcf.bgz"),  : 
  failed to open index file: chr1.vcf.bgz
In addition: Warning message:
In is.na(index) : is.na() applied to non-(list or vector) of type 'S4'

Maybe there is an error in the vcf file? In the header of the vcf file it says that the fileformat is version 4.0. It seems that the SNPs are sorted according to their physical position...

ADD REPLY
0
Entering edit mode

Does adding the full name with the tbi at the end of the index file change the output?

vcffile = open(VcfFile(file = "chr1.vcf",
                       index = indexVcf(x = "chr1.vcf.bgz.tbi" ),
                       yieldSize = 1000))
ADD REPLY
0
Entering edit mode

It returns

[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "TBI?"
Warning message:
In is.na(index) : is.na() applied to non-(list or vector) of type 'S4'
ADD REPLY
0
Entering edit mode

think there are some additional arguments that can be passed when using indexVcf . The function will take the additional arguments found in indexTabix perhaps the format would be useful to pass in? It looks like there may be a specific format option for vcf4. See ?indexTabix in R for more information

ADD REPLY
0
Entering edit mode

OK, this looks better. However, the initial error seems to remain

vcffile = open(VcfFile(file = "chr1.vcf",
                           index = indexTabix(file = "chr1.vcf.bgz", format = "vcf4"),
                           yieldSize = 1000))
    > vcffile
    class: VcfFile 
    path: chr1.vcf
    index: chr1.vcf.bgz.tbi
    isOpen: TRUE 

Next step:

>   vcf = readVcf(vcffile, "chr1H")
Error: scanVcf: scanVcf: scanTabix: [internal] hmm.. this doesn't look like a tabix file, sorry
 path: chr1.vcf
 index: chr1.vcf.bgz.tbi
  path: chr1.vcf
ADD REPLY
0
Entering edit mode

hmm. I'm wondering if the reading and indexing of the original file works? And then filter for chr1 after? There is more information on doing this in the vignettes found on the landing page . Does creating the index on the full SNPs.vcf.bgz then allow you to read in?
Or depending on what you intend to do you might be able to load the vcf without specifying the index?

ADD REPLY
0
Entering edit mode

Initially, this was what I intended to do. However, SNPs.vcf.bgz seems to be too large for indexing. Therefore, I came up with the solution to go for each chromosome by itself.

ADD REPLY
1
Entering edit mode

Would you mind sharing a subset of the vcf file so I can try and debug what is going on (lori.shepherd @ roswellpark.org). So I can narrow it down what the issue might be, does loading the vcf without the index works? VcfFile(file = "chr1.vcf")

ADD REPLY
0
Entering edit mode

Thank you for your offer! I will try to generate a subset. Thank you for your support!

> VcfFile(file = "chr1.vcf")
class: VcfFile 
path: chr1.vcf
index: NA
isOpen: FALSE 
yieldSize: NA
ADD REPLY
0
Entering edit mode

As far as I understand, tabix only works for a smaller data set. CSI index is then recommended, if the number of SNPs excedes a certain threshold.

~/htslib/bin/tabix -C -p vcf SNPs.vcf.bgz

this creates a CSI indexed file. However, it seems that this cannot be read into R with VariantAnnotation...

ADD REPLY
0
Entering edit mode

See Answer: VCF File too large for tabix. Best option to make it usable with R? for further discussion of ingestion of CSI-indexed VCF

ADD REPLY

Login before adding your answer.

Traffic: 852 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6