SNPRelate handling of missing genotypes
0
0
Entering edit mode
serpalma.v ▴ 60
@serpalmav-8912
Last seen 2.8 years ago
Germany

Hello!

I am trying to understand how SNPRelate operates under the hood when samples have missing values.

Specifically, in my VCF I have 150 samples, split into 6 groups, 25 samples each (for each group, 10 samples were sequenced at 30x and 15 at 5x). When I conduct PCA (snpgdsPCA), I see samples cluster according to their groups, as follows:

enter image description here

However, there is also a substructure within groups according to the sequence coverage (and therefore the missing rate).

This gives the PCA plot a star-like shape, where the 30x-samples form a blob towards the margins of the plot and the 5x-samples form a line towards the center of plot.

enter image description here

There is one group where this blob-line pattern does not occur (black-group). Samples in this group have much less missingness.

I try to replicate the PCA using the prcomp function in R, using the genotype count matrix I extract as follows:

seqVCF2GDS(vcf, "vcf.gds") #convert VCF to gds
genofile <- seqOpen("vcf.gds") # open gds
geno_cts <- read.gdsn(index.gdsn(genofile, "genotype/data")) # get allele counts (separated by chromosomes)

# Since for each SNP, there are two vectors with allele counts, add them up to get:
# homref = 0, het = 1, homalt = 2
# missing (.) encoded as 3, therefore, 6 = ./.
gt_cts_ma <- matrix(nrow = max(dim(geno_cts)), ncol = 150)

for(i in 1:max(dim(geno_cts))){
  gt_cts_ma[i,] <- geno_cts[,,i] %>% apply(2, sum) 
}

colnames(gt_cts_ma) <- read.gdsn(index.gdsn(genofile, "sample.id"))

gt_cts_ma_pca <- prcomp(t(gt_cts_ma))

plot(gt_cts_ma_pca$x[,"PC1"], gt_cts_ma_pca$x[,"PC2"])

...but the plot does not resemble the SNPRelate PCA plot at all.

enter image description here

Looking at the function snpgdsPCA, the function gnrPCA is invoked, but I cannot find what it does anywhere.

Could you please tell me where I could learn more about the detailed steps used by snpgdsPCA in order to figure out what is behind the pattern I observe in my data?

SNPRelate • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6