Subsetting info columns from a VCF and getting rsID for some row names?
1
0
Entering edit mode
@emilymccann-10861
Last seen 8.5 years ago

I am working in R with the VariantAnnotation package and the vcf file from the ExAC consortium, ExAC.r0.3.1.sites.vep.vcf (release 0.3). My aim is to extract simple data frames for the "AC_Adj" and "AN_Adj" info columns. I have used the below code to achieve this:

 

# define paramaters on how to filter the vcf file
AC.adj.param <- ScanVcfParam(info="AC_Adj")
AN.adj.param <- ScanVcfParam(info="AN_Adj")

# load ALL allele counts (AN) and alt allele counts (AC) from exac vcf file 
raw.exac.AC.adj. <- readVcf("/Volumes/Emilly\ 1TB/ExAC.r0.3.1.sites.vep.vcf", "hg19", param=AC.adj.param)
raw.exac.AN.adj. <- readVcf("/Volumes/Emilly\ 1TB/ExAC.r0.3.1.sites.vep.vcf", "hg19", param=AN.adj.param)

# extract just the allele counts and corressponding chr location with allele tags from exac vcf file - in dataframe/s4 class
sclass.exac.AC.adj <- (info(raw.exac.AC.adj.))
sclass.exac.AN.adj <- (info(raw.exac.AN.adj.))

# make the allele counts into a plain data frame - chr positions are only row names and not their own column
df.exac.AC.adj <- as.data.frame(sclass.exac.AC.adj)
df.exac.AN.adj <- as.data.frame(sclass.exac.AN.adj)

# make the chr positions an actual column and remove the row names to make it cleaner
df.exac.AC.adj$chr.position <- row.names(df.exac.AC.adj)
df.exac.AN.adj$chr.position <- row.names(df.exac.AN.adj)

row.names(df.exac.AC.adj) <- NULL
row.names(df.exac.AN.adj) <- NULL

 

However, while most of the row names are correct, some of the row names are returned are rsIDs rather than chr:position.

Essentially, a subset of the data returned looks like this:

   AC_Adj chr.position
70      2      1:69395
71      9      1:69409
72   1985  rs140739101
73      4      1:69438
74      7  rs142004627
75      1      1:69474
76      1      1:69478
77      1      1:69486
78      2      1:69487
79     10      1:69489
80     59  rs150690004

While I would like the returned data frame to look like this:

   AC_Adj chr.position
70      2      1:69395
71      9      1:69409
72   1985   1:69428
73      4      1:69438
74      7      1:69453
75      1      1:69474
76      1      1:69478
77      1      1:69486
78      2      1:69487
79     10      1:69489
80     59      1:69496

I believe there is some error in my code which is introducing the bug but have not been able to identify it and would greatly appreciate any input as to why this is happening. There may also be a more efficient way to achieve my goal that I have overlooked. Many thanks for any help anyone can provide.

R variantannotation vcf scanvcfparam readvcf • 1.8k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

The rownames are probably coming from the ID column in the VCF. When missing, the position is used. To get the positions, use the rowRanges() accessor. You can combine the INFO columns with the ranges. 

gr <- rowRanges(raw.exac.AC.adj.)
mcols(gr) <- info(raw.exac.AC.adj.)

An alternative would to use VRanges:

vr <- as(raw.exac.AC.adj., "VRanges")

If you really want a data.frame, you can coerce either of those, but please be aware that Bioconductor is more than a means of converting semantically rich files to semantically poor data structures.

df <- as.data.frame(gr)
df$chr.position <- as.character(gr)

 

 

ADD COMMENT
0
Entering edit mode

Thank you , that very elegantly solves my issue. I will keep that in mind in future.

ADD REPLY

Login before adding your answer.

Traffic: 755 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6