Question

Subsetting info columns from a VCF and getting rsID for some row names?

0

Entering edit mode

emily.mccann • 0

@emilymccann-10861

Last seen 8.8 years ago

I am working in R with the VariantAnnotation package and the vcf file from the ExAC consortium, ExAC.r0.3.1.sites.vep.vcf (release 0.3). My aim is to extract simple data frames for the "AC_Adj" and "AN_Adj" info columns. I have used the below code to achieve this:

# define paramaters on how to filter the vcf file
AC.adj.param <- ScanVcfParam(info="AC_Adj")
AN.adj.param <- ScanVcfParam(info="AN_Adj")

# load ALL allele counts (AN) and alt allele counts (AC) from exac vcf file 
raw.exac.AC.adj. <- readVcf("/Volumes/Emilly\ 1TB/ExAC.r0.3.1.sites.vep.vcf", "hg19", param=AC.adj.param)
raw.exac.AN.adj. <- readVcf("/Volumes/Emilly\ 1TB/ExAC.r0.3.1.sites.vep.vcf", "hg19", param=AN.adj.param)

# extract just the allele counts and corressponding chr location with allele tags from exac vcf file - in dataframe/s4 class
sclass.exac.AC.adj <- (info(raw.exac.AC.adj.))
sclass.exac.AN.adj <- (info(raw.exac.AN.adj.))

# make the allele counts into a plain data frame - chr positions are only row names and not their own column
df.exac.AC.adj <- as.data.frame(sclass.exac.AC.adj)
df.exac.AN.adj <- as.data.frame(sclass.exac.AN.adj)

# make the chr positions an actual column and remove the row names to make it cleaner
df.exac.AC.adj$chr.position <- row.names(df.exac.AC.adj)
df.exac.AN.adj$chr.position <- row.names(df.exac.AN.adj)

row.names(df.exac.AC.adj) <- NULL
row.names(df.exac.AN.adj) <- NULL

However, while most of the row names are correct, some of the row names are returned are rsIDs rather than chr:position.

Essentially, a subset of the data returned looks like this:

AC_Adj chr.position
70 2 1:69395
71 9 1:69409
72 1985 rs140739101
73 4 1:69438
74 7 rs142004627
75 1 1:69474
76 1 1:69478
77 1 1:69486
78 2 1:69487
79 10 1:69489
80 59 rs150690004

While I would like the returned data frame to look like this:

AC_Adj chr.position
70 2 1:69395
71 9 1:69409
72 1985 1:69428
73 4 1:69438
74 7 1:69453
75 1 1:69474
76 1 1:69478
77 1 1:69486
78 2 1:69487
79 10 1:69489
80 59 1:69496

I believe there is some error in my code which is introducing the bug but have not been able to identify it and would greatly appreciate any input as to why this is happening. There may also be a more efficient way to achieve my goal that I have overlooked. Many thanks for any help anyone can provide.

R variantannotation vcf scanvcfparam readvcf • 1.9k views

ADD COMMENT • link updated 8.8 years ago by Michael Lawrence ★ 11k • written 8.8 years ago by emily.mccann • 0

score 0 · Answer 1 · 2016-06-08

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.3 years ago

United States

The rownames are probably coming from the ID column in the VCF. When missing, the position is used. To get the positions, use the rowRanges() accessor. You can combine the INFO columns with the ranges.

gr <- rowRanges(raw.exac.AC.adj.)
mcols(gr) <- info(raw.exac.AC.adj.)

An alternative would to use VRanges:

vr <- as(raw.exac.AC.adj., "VRanges")

If you really want a data.frame, you can coerce either of those, but please be aware that Bioconductor is more than a means of converting semantically rich files to semantically poor data structures.

df <- as.data.frame(gr)
df$chr.position <- as.character(gr)

ADD COMMENT • link 8.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you , that very elegantly solves my issue. I will keep that in mind in future.

ADD REPLY • link 8.8 years ago emily.mccann • 0