I have two large SNP data sets stored as vcf.gz files. So far, I found thet the tabix index derived from htslib is a good way to get access to genomic data that are too large for my RAM. However, it seems that both vcf.gz files are even too large to create a tabix index for them. Therefore, htslib recommends to create a CSI index.
My question is: how can I access my CSI indexed data so that I can manipulate them with R to conduct for example GWAS?
We don't expose the CSI functionality of htslib at this time. You mention RAM issues. If I understand correctly we are dealing with a chromosome with more than 2^29 positions and so tabix TBI indexing cannot work. But I would like to see the error message and version of tabix. We will look into what is required to support CSI at our end but it will take some time.
I don't have a solution for the tabix-based behavior that you may be looking for. However, for ingestion of CSI-indexed
vcf, with opportunities for RAM-limited region extraction, the following may help. This uses https://github.com/brentp/cyvcf2.
We don't expose the CSI functionality of htslib at this time. You mention RAM issues. If I understand correctly we are dealing with a chromosome with more than 2^29 positions and so tabix TBI indexing cannot work. But I would like to see the error message and version of tabix. We will look into what is required to support CSI at our end but it will take some time.