NA values in snpgdsDiss dissimilarity matrix
1
0
Entering edit mode
blackgore ▴ 10
@blackgore-3871
Last seen 9.1 years ago
Ireland

Hello,

Within SNPRelate, I have been trying to compute a dissimilarity matrix from input VCF data using the snpgdsDiss function. The resulting matrix, though, has NaN values for a small number of the 80 or so input samples, and I cannot proceed to compute a clustering (snpgdsHCluster). The VCF data ranges from 1-219 variants per sample, but the lower-sized samples are not exclusively the ones affected. Other than removing the affected samples from the study, is there anything else I can do to create a complete dissimilarity matrix? 

 

 

vcf_data<- file.path("VCFSorts","multisample.vcf")

gds_data <- file.path("VCFSorts","multisample.gds")
if(file.exists(gds_data)){file.remove(gds_data)}
snpgdsVCF2GDS(vcf_data, gds_data, method="biallelic.only")
snpgdsSummary(gds_data)
geno_data <- snpgdsOpen(gds_data)

pop_data <- read.xls("Sample Sheet.xlsx", sheet=1,header=TRUE)
pop_code <- pop_data[["Group"]]
pop_list <- read.gdsn(index.gdsn(geno_data, path="sample.id")) 

# show that the sample order is the same as the population order
print(cbind(pop_data, pop_code, pop_list))


# # run PCA - THIS WORKS FINE
pca<-snpgdsPCA(geno_data, num.thread=8)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
 
# make a data.frame
tab <- data.framesample.id = pca$sample.id,
                 pop = factor(pop_code)[match(pca$sample.id, pop_list)],
                 EV1 = pca$eigenvect[,1],    # the first eigenvector
                 EV2 = pca$eigenvect[,2],    # the second eigenvector
                 stringsAsFactors = FALSE)
plot(tab$EV2, tab$EV1, pch=16, cex=2, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("topright", legend=levels(tab$pop), pch=15, cex=1.5 , col=1:nlevels(tab$pop))



# Hierarchical Clustering  - FAIL
diss<-snpgdsDiss(geno_data, sample.id=NULL,snp.id=NULL,autosome.only=TRUE,remove.monosnp=TRUE,maf=NaN,missing.rate=NaN,num.thread=6,verbose=TRUE)
hc<-snpgdsHCluster(diss, sample.id=NULL,need.mat=TRUE,hang=0.25)

Error in hclust(as.dist(dist), method = "average") : 
  NA/NaN/Inf in foreign function call (arg 11)

 

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_IE.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_IE.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_IE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SNPRelate_1.4.0 gdsfmt_1.6.2    gdata_2.17.0   

loaded via a namespace (and not attached):
[1] tools_3.2.2  gtools_3.5.0
 
 
snprelate • 2.7k views
ADD COMMENT
0
Entering edit mode
zhengx ▴ 30
@zhengx-7950
Last seen 5.4 years ago
United States

Are you able to run snpgdsIBS Identity-By-State analysis? Is there any missing value in the result of IBS analysis also?

 

ADD COMMENT
0
Entering edit mode

Hello zhengx,

I ran the snpgdsIBS function on the geno_data object, above. Just like snpgdsDiss, the function ran to completion, and yes, there are NaNs in the output. These NaNs are in the same positions in both matrices.  

ADD REPLY
0
Entering edit mode

Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.

ADD REPLY
0
Entering edit mode

Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.

ADD REPLY

Login before adding your answer.

Traffic: 573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6