hi, I cannot reproduce the problem, your code works fine with my R/BioC installation and my local copy of the 1000 Genomes Project genotype VFC files:
library(VariantAnnotation)
vcf <- readVcf("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", genome="hg19")
vcf
class: CollapsedVCF
dim: 1103547 0
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, MEND, MLEN, MSTART, SVLEN, SVTYPE, TSD, AC, AF, N...
info(header(vcf)):
Number Type Description
CIEND 2 Integer Confidence interval around END for imprecise variants
CIPOS 2 Integer Confidence interval around POS for imprecise variants
CS 1 String Source call set.
END 1 Integer End coordinate of this variant
IMPRECISE 0 Flag Imprecise structural variation
MC . String Merged calls.
MEINFO 4 String Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support f...
MEND 1 Integer Mitochondrial end coordinate of inserted sequence
MLEN 1 Integer Estimated length of mitochondrial insert
MSTART 1 Integer Mitochondrial start coordinate of inserted sequence
SVLEN . Integer SV length. It is only calculated for structural variation MEIs. For other types of SVs; one ...
SVTYPE 1 String Type of structural variant
TSD 1 String Precise Target Site Duplication for bases, if unknown, value will be NULL
AC A Integer Total number of alternate alleles in called genotypes
AF A Float Estimated allele frequency in the range (0,1)
NS 1 Integer Number of samples with data
AN 1 Integer Total number of alleles in called genotypes
EAS_AF A Float Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)
EUR_AF A Float Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)
AFR_AF A Float Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)
AMR_AF A Float Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)
SAS_AF A Float Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)
DP 1 Integer Total read depth; only low coverage data were counted towards the DP, exome data were not used
AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ...
VT . String indicates what type of variant the line represents
EX_TARGET 0 Flag indicates whether a variant is within the exon pull down target boundaries
MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
geno(vcf):
SimpleList of length 0:
head(rowRanges(vcf), 3)
GRanges object with 3 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <CharacterList> <numeric> <character>
rs587697622 22 [16050075, 16050075] * | <NA> A G 100 PASS
rs587755077 22 [16050115, 16050115] * | <NA> G A 100 PASS
rs587654921 22 [16050213, 16050213] * | <NA> C T 100 PASS
-------
seqinfo: 86 sequences from hg19 genome
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)
locale:
[1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
[5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8 LC_PAPER=en_US.UTF8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.14.1 Rsamtools_1.20.2 Biostrings_2.36.1 XVector_0.8.0
[5] GenomicRanges_1.20.3 GenomeInfoDb_1.4.0 IRanges_2.2.1 S4Vectors_0.6.0
[9] BiocGenerics_0.14.0 vimcom_1.2-3 setwidth_1.0-3 colorout_1.1-0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.30.1 zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.1 BSgenome_1.36.0
[6] tools_3.2.0 Biobase_2.28.0 DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[11] rtracklayer_1.28.2 futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.6 biomaRt_2.24.0
[16] RSQLite_1.0.0 GenomicFeatures_1.20.1 XML_3.98-1.1
the output of your call to traceback() seems incomplete to me. It should show all the intermediate function calls that lead to the error. See the following toy example:
g <- function(x) 2*x
f <- function(x) g(x)
f("hello world\n")
Error in 2 * x : non-numeric argument to binary operator
traceback()
2: g(x) at #1
1: f("hello world\n")
although if the software actually works for me, it is likely that the problem may be related to your local copy of the VCF file. Could you try to download it again and see whether it works then?
cheers,
robert.
rowData works fine. However, other functions provide empty outputs. For example,see output below. I downloaded the package this morning using: source("http://bioconductor.org/biocLite.R")
so it should be up to date.
geno(vcf)
List of length 0
names(0):
info(vcf)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
CIEND CIPOS CS END IMPRECISE
<IntegerList> <IntegerList> <character> <integer> <logical>
1 NA,NA NA,NA NA NA FALSE
2 NA,NA NA,NA NA NA FALSE
3 NA,NA NA,NA NA NA FALSE
4 NA,NA NA,NA NA NA FALSE
biocLite() will download the package version appropriate for your version of R / Bioconductor. If you're using release it will get VariantAnnotation 1.14.1 (Bioconductor 3.1). If you are using devel it will get VariantAnnotation 1.15.9 (Bioconductor 3.2). The output of sessionInfo() let's us see not only the version of the package you're having trouble with but other dependencies as well. Please provide this when you ask a question on the support site.
The dimensions of 'vcf' are
There are 1103547 rows (positions) and no samples. When there are no samples there is no geno() so the empty list is expected. It is also expected that an sapply over the classes in geno() would return an empty list.
The call info(vcf)[1:4, 1:5] returns a DataFrame of the appropriate variables and it looks like your VCF file doesn't have data for these fields for the first 4 variants. That doesn't mean all the info() fields are empty. Have you looked at your raw file to confirm?
I misspoke and it looks like the replacement of 'rowData' with 'rowRanges' was back-ported to release. Version 1.14.1 does expect the use of rowRanges() as an accessor and you should get a warning if you try to use rowData():
If you are using 1.15.9 you should see a defunct message:
You're code doesn't show either warning. I need the output of sessionInfo() to help you further. It would also be useful to see the error from just rowRanges(vcf) without wrapping it in head().
Valerie
Here is the error I get from rowRanges and sessionInfo() output:
rowRanges(vcf)
Error in rowRanges(vcf) : Argument 'x' must be a matrix or a vector.
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] matrixStats_0.14.0 VariantAnnotation_1.12.9 Rsamtools_1.18.3 Biostrings_2.34.1
[5] XVector_0.6.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.5 IRanges_2.0.1
[9] S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.2 base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9
[5] Biobase_2.26.0 BiocParallel_1.0.3 biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.1 checkmate_1.5.3 codetools_0.2-8
[13] DBI_0.3.1 digest_0.6.8 fail_1.2 foreach_1.4.2
[17] GenomicAlignments_1.2.2 GenomicFeatures_1.18.7 iterators_1.0.7 magrittr_1.5
[21] RCurl_1.95-4.6 RSQLite_1.0.0 rtracklayer_1.26.3 sendmailR_1.2-1
[25] stringi_0.4-1 stringr_1.0.0 tools_3.1.1 XML_3.98-1.1
[29] zlibbioc_1.12.0
Thanks for providing sessionInfo(). Your R/Bioconductor is out of date. The current Bioconductor release and devel use R 3.2. VariantAnnotation 1.12.9 is from 2 release cycles ago and does not support rowRanges() as an accessor. If you look at the man page
there is no mention of rowRanges(), only rowData(). So it makes sense that rowRanges() on a VCF object throws an error.
If you need help updating R/Bioconductor instructions can be found here:
http://www.bioconductor.org/install/
Valerie
Ok thanks. I installed the new version of R and updated Bioconductor. I am wondering if VariantAnnotation is the package I need. It was suggested to me by a user on Biostars. I work mainly with population genetics concepts. I need to calculate Fst values from 1000 genomes VCF files or LD. Are there packages that I can use to do this?
After updating R and VariantAnnotation, I get the following error message:
library(VariantAnnotation)
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called ‘XML’
Error: package or namespace load failed for ‘VariantAnnotation’
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252
[4] LC_NUMERIC=C LC_TIME=Italian_Italy.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] matrixStats_0.14.0 Rsamtools_1.20.2 Biostrings_2.36.1 XVector_0.8.0
[5] GenomicRanges_1.20.3 GenomeInfoDb_1.4.0 IRanges_2.2.1 S4Vectors_0.6.0
[9] BiocGenerics_0.14.0 BiocInstaller_1.18.2
loaded via a namespace (and not attached):
[1] bitops_1.0-6 DBI_0.3.1 RSQLite_1.0.0 zlibbioc_1.14.0 tools_3.2.0 Biobase_2.28.0
[7] RCurl_1.95-4.6
For built in fst() and ld() functions you might try snpStats or SNPrelate.
For your case, VariantAnnotation does not have built in fst() and ld() but does have a flexible interface for reading/parsing data and manipulation. Data in a VCF object can be used in downstream analysis in snpRelate and snpStats. If you're interested in a certain region (chromosome or position) or specific INFO fields use ScanVcfParam() when calling readVcf():
Other functions in VariantAnnotation you may find useful:
FYI, a useful way to search for topic-related software and annotation packages is through the BiocViews interface.
http://www.bioconductor.org/packages/release/BiocViews.html#___Software
As for the XML, did you get an error with biocLite("XML")?
Valerie
The function readVcf now works fine. However, I still have problems with
head(rowRange(vcf), 3)
Error in head(rowRange(vcf), 3) :
error in evaluating the argument 'x' in selecting a method for function 'head': Error: could not find function "rowRange"
The error message tells you what's wrong - there is no rowRange() function. You're missing an 's'.
We're back to square one. I used rowRanges() and still get the same error message I had at the beginning and which prompted me to open this thread:
head(rowRanges(vcf), 3)
Error in head(rowRanges(vcf), 3) :
error in evaluating the argument 'x' in selecting a method for function 'head': Error in rowRanges(vcf) : Argument 'x' must be a matrix or a vector.
Sorry if this is tiring your patience I understand, but it's tiring mine too. I really appreciate your help though.
Thanks
hi, notice that this operation works fine with one of the example VCF files in VariantAnnotation:
so, there might be something going on on your particular VCF object. Could you paste the output of the function 'traceback()' calling it right after the error happens? Could you also paste the output of how your VCF object is shown after typing its name ('vcf') on the R shell?
I am using Chr.22 from phase 3 1000 Genomes: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/
> traceback()
1: head(rowRanges(vcf), 3)
> vcf
class: CollapsedVCF
dim: 1103547 0
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, MEND, MLEN, M...
info(header(vcf)):
Number Type Description
CIEND 2 Integer Confidence interval around END for imprecise variants
CIPOS 2 Integer Confidence interval around POS for imprecise variants
CS 1 String Source call set.
END 1 Integer End coordinate of this variant
IMPRECISE 0 Flag Imprecise structural variation
MC . String Merged calls.
MEINFO 4 String Mobile element info of the form NAME,START,END<POLARITY;...
MEND 1 Integer Mitochondrial end coordinate of inserted sequence
MLEN 1 Integer Estimated length of mitochondrial insert
MSTART 1 Integer Mitochondrial start coordinate of inserted sequence
SVLEN . Integer SV length. It is only calculated for structural variatio...
SVTYPE 1 String Type of structural variant
TSD 1 String Precise Target Site Duplication for bases, if unknown, v...
AC A Integer Total number of alternate alleles in called genotypes
AF A Float Estimated allele frequency in the range (0,1)
NS 1 Integer Number of samples with data
AN 1 Integer Total number of alleles in called genotypes
EAS_AF A Float Allele frequency in the EAS populations calculated from ...
EUR_AF A Float Allele frequency in the EUR populations calculated from ...
AFR_AF A Float Allele frequency in the AFR populations calculated from ...
AMR_AF A Float Allele frequency in the AMR populations calculated from ...
SAS_AF A Float Allele frequency in the SAS populations calculated from ...
DP 1 Integer Total read depth; only low coverage data were counted to...
AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ance...
VT . String indicates what type of variant the line represents
EX_TARGET 0 Flag indicates whether a variant is within the exon pull down...
MULTI_ALLELIC 0 Flag indicates whether a site is multi-
hi, since the space for comment replies gets smaller and smaller, I'm gonna jump to an answer below to facilitate reading output.