I have a highly-related set of samples that I am attempting to run GENESIS on for association analysis of a phenotype. When I initially ran pcair on the genodata with my kinship matrix my initial pcair result had only 3 PC's. This appears to be due to the related set containing nearly all individuals and only 3 individuals assorting into the unrelated set. After relaxing my kinship threshold to kin.thresh = 0.077 I was able to generate a pcair structure with 20 PC's, our general goal for analysis. Using pcair in ScanAnnotationDataFrame and attempting to run fitNullMM with resulting ScanAnnot data structure, I receive an error. Here is the full output of running fitNullMM:
nullmod <- fitNullMM(scanData = scanAnnot, outcome = p, covars = c("pc1"), covMatList = covMatList, family=varType)
Reading in Phenotype and Covariate Data...
Fitting Model with 410 Samples
Computing Variance Component Estimates using AIREML Procedure...
Sigma^2_Kin Sigma^2_E logLik RSS
Error in chol.default(Sigma) :
the leading minor of order 402 is not positive definite
I have several questions. 1) Is reducing relaxing the kin.thresh during pcair an appropriate way to deal with the relatedness in these data? 2) Is there a more appropriate way to analyze highly-related individuals for GWAS using GENESIS? -- I would prefer to stick with GENESIS as it is our labs current standard for analysis. 3) What does this error message mean and how do I go about correcting the problem? -- I would appreciate any insight into what is going on here. I'm not even certain what data-structure is causing the problem or where in the process.
Thank you for any help you can provide.
Note: I am using version GENESIS_2.4.0 on R 3.3.1 and am currently unable to update to the current version. (I'm in process of upgrading but I need to run with the current tool-chain right now).
Hello.
1) Regarding the very high levels of relatedness: what estimator are you using for your initial kinship coefficient estimates to run PC-AiR? It's either the case that all of your samples are truly very related to each other (so much so that only 3 mutually unrelated samples can be identified), or it's possible that your initial estimator has overestimated the relatedness, causing you to over filter your unrelated set. We typically use the KING estimator for an initial kinship estimate; it is implemented in the SNPRelate R package that also uses GDS genotype files. I might try the following iterative process:
2) Regarding 20 PCs: I wouldn't feel obligated to use a pre-defined number of PCs for your analysis. I would generally recommend looking at scatter plots, parallel coordinates, and the eigenvalues of the PCA to attempt to determine how many PCs are actually capturing population structure. If you have any reported race or ancestry information, that can be helpful.
3) The error message basically means that the covariance matrix of your phenotype (\Sigma = \sigma^2g*KIN + \sigma^2e*Identity) is non-positive definite; i.e. not invertible. This is occurring in the process of running AI-REML to fit the model. I've seen this come up occasionally in other analyses. There are a couple of things to look at for possible solutions:
Let me know what you find, and I'll be happy to try to help troubleshoot further.
Matt
Thanks for getting back to me so quickly!
1) I'm using KING 2.1.6 for the relatedness estimate but not from within R, at the command line. I do have evidence to suggest that the samples should have a fairly high degree of relatedness; they come from a small island population and the study design is likely to have contributed to a more related sample. I'll try using the work-flow you have suggested here and see if that helps.
2) I agree that the 20 PC's I was aiming for is somewhat arbitrary but I am a little uncomfortable only using three. I have 410 individuals. How many PC's would you recommend in this situation?
3) I'm uncertain whether the covariates are co-linear but I will look into it. I believe I have removed all of the duplicate samples but I will double check that this is the case. I'll try your suggestions and let you know how it goes.
One of my colleagues suggested that I may be filtering too much out of the data-set due to an incorrect assumption I had made in my QC steps. I'm also testing this. Thank you again for your help!
Matt thanks for getting back to me so quickly. I really appreciate all of your help. Unfortunately, I wasn't able to pin down exactly what the problem was.
I went back through all of my QC steps and realized that I had made a couple of errors. When I returned to my GENESIS analysis after redoing the QC on my data everything seems to work. As a secondary note, I'm now getting more reasonable relatedness estimates out of KING, with approximately half of my data in the related set for PC-AiR.
Again, thanks for your help!