Entering edit mode
Julian Gehring
★
1.3k
@julian-gehring-5818
Last seen 5.8 years ago
Hi Valerie and Herve,
When providing a 'Seqinfo' object as the 'genome' argument to
'readVcf',
would it be possible to not have the requirement of the VCF header and
the seqinfo object to have the same ordering? As the moment
(VariantAnnotation_1.8.2), the import will fail if this is not the
case.
However, this requires the user to know the ordering of the seqnames
in the VCF file beforehand. In the ideal case, the readVcf function
could reorder the seqnames in the way as provided by the 'genome'
argument.
Best wishes
Julian
On 09/28/2013 12:57 AM, Hervé Pagès wrote:
> Hi Julian,
>
> With the latest devel version of GenomicRanges (1.13.48, should
become
> available via biocLite() in the next 24 hours or so), setting the
> seqinfo of a big VCF object should be much faster but your mileage
> may vary. Let us know if you still find it slow enough to justify a
> mechanism for letting the user specify the seqinfo upfront when
using
> readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and
accept
> everything it supports now plus a Seqinfo object).
>
> Cheers,
> H.
>
> On 09/24/2013 10:00 AM, Julian Gehring wrote:
>> Hi Valerie,
>>
>>> Thanks for clarifying. No, there is not currently a way to do
this.
>>>
>>> The 'seqinfo' on the rowData(vcf) should not be difficult to
change. Can
>>> you provide more detail as to (1) why you are changing it (did
readVcf()
>>> import something incorrectly or ?) and (2) what operations on the
>>> 'seqinfo' are taking a long time.
>>
>> 'readVcf' did everything in a correct way. I needed to add the
>> information about the genome, circularity, and seqlengths based on
>> external annotitation, since it is not part of the VCF file.
>>
>> I can't tell you at the moment where 'seqinfo <-' spends most of
its
>> time. I'll profile it and get back to you.
>>
>> Thank your for your quick answer and your help!
>>
>> Best wishes
>> Julian
>>
>>
>>> Thanks.
>>> Valerie
>>>
>>>>
>>>> Best wishes
>>>> Julian
>>>>
>>>>
>>>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote:
>>>>> Hi Julian,
>>>>>
>>>>> On 09/24/2013 02:29 AM, Julian Gehring wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for
the
>>>>>> import of a VCF file using 'readVcf'?
>>>>>
>>>>> I think the question is how to read in a subset of
>>>>> chromosomes/positions
>>>>> from a vcf file without an accompanying tabix index. You can't.
>>>>> readVcf() requires an index when subsets are defined by
>>>>> chromosome/position. However you can read in subsets defined by
INFO
>>>>> and/or GENO fields without an index.
>>>>>
>>>>> Approaches:
>>>>> (1) create index with ?indexTabix and specify 'which' in
ScanVcfParam
>>>>> (2) use ?filterVcf to write out a new file of records of
interest
>>>>>
>>>>>> I'm aware that one can change it
>>>>>> with the 'seqinfo' method afterwards, but for large VCF files
this
>>>>>> can
>>>>>> take a significant amount of time.
>>>>>
>>>>> What operation is taking along time? Subsetting the VCF object
by
>>>>> chromosome?
>>>>>
>>>>> Valerie
>>>>>
>>>>>>
>>>>>> An alternative would be to sneak it in by the 'which'
arguments, such
>>>>>> as:
>>>>>>
>>>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info,
"GRanges")))
>>>>>>
>>>>>> but this requires the file to be indexed beforehand.
>>>>>>
>>>>>> Best wishes
>>>>>> Julian
>>>>>>
>>>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>