Differential expression analysis - additive genetic model (how to implement it)
2
0
Entering edit mode
jlerga • 0
@jlerga-9341
Last seen 9.0 years ago

Hello everyone:

I have a question related to the analysis of those genes that are differentially expressed depending on the genotype, employing different programs such as DESeq(2) or edgeR; specifically, my question is about how I can implement an additive genetic model for these analyses. 

I read the manuals of DESeq and edgeR, and I understand pretty well how to make simple comparisons, but I did not figure out how to apply an additive model.

 

General idea of the experiment

I am studying a specific human polymorphic inversion (that is, a segment of DNA that can be inverted in some individuals). An inversion is, therefore, like an allele, and each individual could be A/A (inverted allele/inverted allele), A/B (inverted/standard), B/B (standard/standard). So, inversions can be treated simply as SNPs or other type of alleles.

I have the genotype of several individuals and also their RNA-seq data. Therefore, I have two tables:

* A table of counts: the expression counts of each gene in each individual

  Individual A Individual B Individual C
Gene A 4 50 30
Gene B 1 24 24
Gene C 34 51 27

... (with more genes and more individuals)

* The design:

Individuals Population Sex Genotype
Individual A European Male A/A
Individual B African Female A/B
Individual C African Male B/B

... (with many more individuals)

 

Objective

We want to discover those genes that are differentially expressed depending on the genotype of the inversion.

 

Problem: Should I use an additive model?

I guess that to determine the appropriate test for association, we must first specify a genetic model. Currently, it is quite established in the scientific community that an additive model is the most correct model that should be used (not dominance, recessive or general model), which impose a structure in which each additional copy of the variant allele increases the response by the same amount.

In other words, if there are two alleles at a gene locus, then the additive effect (A/B) is half of the the difference between the mean of all cases that are homozygous for one version of the allele (A/A) compared to the mean of all cases that are homozygous for the other allele (B/B):

Genotypes A/A A/B B/B
Effect? 0 1 2


I read that a flexible way of carrying out such tests is by use of logistic regression. I do not know if I can carry out a logistic regression with the counts of a specific gene as the outcome variable and genotype as an explanatory variable. Anyway, I do not how to implement this in a particular program such as DESeq or edgeR.

 

Confounding factors

I have individuals from three populations and two genders. Therefore, I should correct the analysis for these important confounding factors, shouldn't I? I will put these factors in the design formula, this is clear.

 

R code (DESeq and edgeR)

I put here the R code I used; however, I did three comparations (A/A vs A/B, A/B vs B/B, and B/B vs A/A), which is not exactly what I wanted.

I guess that the important comparison should be A/A vs B/B, because if we assume an additive model, the greatest differences should be captured in this comparison. However, if we use a real additive model (with your help) maybe it should be more sensible and I will have more power to capture real differentially expressed genes.

I hope you can help me!
Many many thanks in advance!!!!
 

DESeq code

pairs<-combn(levels(design$Condition),2)

dds<-DESeqDataSetFromMatrix(countData=counts,
                            colData=design,
                            design=~Pop+Sex+Condition)

dds<-dds[rowSums(counts(dds))>1, ]

dds<-DESeq(dds)

list_comp_deseq<-list()
for(s in 1:length(pairs[1,])){
  first_c<-pairs[1,s]
  second_c<-pairs[2,s]
  res<-results(dds,contrast=c("Condition",first_c,second_c))
  res<-res[!is.na(res$padj),]
  res<-res[order(res$padj),]
  list_comp_deseq[[s]]<-res
  write.table(res, file=paste("DESeq_",first_c,"_",second_c,sep=""))
}

save(dds,file="dds")

 

edgeR code

y<-DGEList(counts,group=design$Condition)

keep <- rowSums(cpm(y)>1) >= 20
y <- y[keep, , keep.lib.sizes=FALSE]

y<-calcNormFactors(y)
design_edge<-model.matrix(~Sex+Pop+Condition,design)

y<-estimateDisp(y,design_edge)

fit<-glmFit(y,design_edge)

numbers<-grep("Condition",colnames(fit$coefficients))
pairs_edge<-list()
counter<-1
for(u in numbers[1]:numbers[length(numbers)]){
  cond<-substr(colnames(fit$coefficients)[u],10,12)
  cond<-c(cond,u)
  pairs_edge[[counter]]<-cond
  counter<-counter+1
}
names_edge<-names(table(design$Condition))

if(length(names_edge)==2){
  if(names_edge[1]==pairs_edge[[1]][1]){
    intercept<-names_edge[2]
  }else{
    intercept<-names_edge[1]
  }
}else{
  if(names_edge[1]!=pairs_edge[[1]][1] && names_edge[1]!=pairs_edge[[2]][1]){
    intercept<-names_edge[1]
    
  }else if(names_edge[2]!=pairs_edge[[1]][1] && names_edge[2]!=pairs_edge[[2]][1]){
    intercept<-names_edge[2]
   
  }else{
    intercept<-names_edge[3]

  }
}


list_comp_edge<-list()
for(s in 1:length(pairs_edge)){
 
  lrt<- glmLRT(fit, coef=as.numeric(pairs_edge[[s]][2]))
  write.table(topTags(lrt,n=50000), file=paste("edgeR_",pairs_edge[[s]][1],"_",intercept,sep=""))
  list_comp_edge[[s]]<-topTags(lrt,n=50000)
 
  if(s==2){
    vector<-rep(0,as.numeric(pairs_edge[[2]][2]))
    vector[as.numeric(pairs_edge[[1]][2])]<-(-1)
    vector[as.numeric(pairs_edge[[2]][2])]<-(1)
    lrt<- glmLRT(fit, contrast=vector)
    write.table(topTags(lrt,n=50000),  file=paste("edgeR_",pairs_edge[[1]][1],"_",pairs_edge[[2]][1],sep=""))
    list_comp_edge[[3]]<-topTags(lrt,n=50000)
  }
}

save(fit,file="fit")

 

Thanks in advance again!!!!!!!!

Jon

deseq2 edger differential gene expression additive model • 3.6k views
ADD COMMENT
5
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

Currently, it is quite established in the scientific community that an additive model is the most correct model that should be used (not dominance, recessive or general model).

Really? I can easily imagine situations where gene expression will not exhibit a linear response to dosage, so an additive model with respect to your genotype would do rather poorly. In fact, it's arguable that genes without a non-linear responses are probably more relevant, as they are probably being subjected to interesting regulatory processes. You'd overlook them if you forced everyone into an additive set-up.

I would stick with what you've already got, where you set up a design matrix with one coefficient per genotype.  This is more flexible than an additive model; not just for DE testing, but also in dispersion estimation, where an overly restrictive model will result in inflated estimates of the dispersion and loss of DE detection power.

I have individuals from three populations and two genders. Therefore, I should correct the analysis for these important confounding factors, shouldn't I?

Well, perhaps you should look at the MDS or PCA plot and see if the samples are segregating according to population and sex. If not, then you don't need to do anything, but if they are, then you should block on them in the design matrix as you've already done:

design_edge<-model.matrix(~Sex+Pop+Condition,design)

Note that blocking in the design only works if your genotypes are mixed across the confounding factors. If, for example, all A/A samples are male, then you won't be able to distinguish between the effect of sex and that of the A/A genotype.

I put here the R code I used; however, I did three comparations (A/A vs A/B, A/B vs B/B, and B/B vs A/A), which is not exactly what I wanted.

You've got too much code and it's difficult to read, so I'm just going to use your three-sample example table above to demonstrate. Assume that the table is stored in design. We do some editing to make it nicer, and then we make our design matrix:

design$Genotype <- sub("/", "", design$Genotype)
design.edge <- model.matrix(~0 + Population + Sex + Genotype, design)

Running colnames(design.edge) gives us:

[1] "PopulationAfrican"  "PopulationEuropean" "SexMale"           
[4] "GenotypeAB"         "GenotypeBB"        

The first three coefficients represent the blocking factors for the confounding variables, and can be ignored for the sake of simplicity in interpretation. The last two coefficients represent the log-fold change of AB over AA and BB over AA, respectively. If you want to test for AB vs AA or BB vs AA, you can do that with coef=4 or coef=5 to drop the respective coefficients in glmLRT. If you want to test for BB vs AB, you can do that by setting the contrast argument in glmLRT to:

con.bbvab <- makeContrasts(GenotypeBB-GenotypeAB, levels=design.edge)

If you want to test for any DE across the genotypes, you can do that with an ANOVA-like test by specifying coef=4:5. This is probably the most general test that will give you all genes that respond to changes in genotype. If you want to test for non-additive changes in expression with respect to the genotype (i.e., allele A dosage), you need to do something like:

con.nonadd <- makeContrasts(GenotypeBB/2 - GenotypeAB, levels=design.edge)

This asks if the AB vs. AA log-fold change is equal to half the BB vs. AA log-fold change, which is what you would expect under an additive model. Conversely, you can informally identify genes that exhibit additive effects by looking for those genes that are not significant under con.nonadd, but are still significant for the individual pairwise comparisons between genotypes.

P.S. Note that n=Inf will retrieve all genes in topTags. This is safer than using an arbitrarily large number - in your case, you use 50000, but this would fail to return everything if you actually had more than 50000 genes.

P.P.S. Note that design.edge has more coefficients than samples, such that the former values cannot be estimated. Hopefully, your real data set has more samples so that this is not a problem. For future reference, it is better to post an example that is big enough to represent what you're working with. For the majority of cases (and especially in this case), a table with three samples is too small.

ADD COMMENT
0
Entering edit mode

Wow, many many thanks, Aaron!!

I am sorry for the delay. I have been doing many other things and I completely forgot this issue.

I like your answer; however, I have some other ideas. The thing is, I understand that not all the models have to be additive. And the contrast AA vs BB should be the appropriate one if I want to obtain those genes that are differentially expressed, if I am assuming that AB genotype has an intermediate effect on gene expression.

Nonetheless, when I compare AA vs BB, I think I do not have complete power, because the major part of the samples are AB. When I asked this question, I was looking for a method that take the power of these AB samples in the comparison of AA vs BB in order to obtain those genes that are truly differentially expressed (I thought that an additive model was OK). I do not know if I am explaining this idea correctly.

In addition, I read somewhere that I should use a glm function (generalized linear model) if I would like to compare the three genotypes. Does it make sense?

If you have more ideas, I would be very grateful!!

 

 

ADD REPLY
0
Entering edit mode

Well, if you fit an additive model and it doesn't fit well for the AB samples (e.g., because of some heterozygote effect that results in non-intermediate behaviour, haplo(in)sufficiency such that expression in the heterozygote is equivalent to that of one of the the homozygotes, etc.), then you'll end up inflating your dispersion estimates. This will result in loss of power for these genes. Moreover, the EB approach shares information between genes, so any increase in the dispersions for the affected genes will propagate to all other genes in the data set. This will end up reducing DE detection power for all genes.

In general, it's not unreasonable to expect that the heterozygote has intermediate behaviour between the two homozygous extremes. However, to try to take advantage of this, you need to make some quantitative statements about how "intermediate" it is expected to be. This would require some fairly strong assumptions; for example, would the expression of a gene in the AB samples be equal to the average of the expressions in AA and BB? There's no guarantee (and in fact, it's probably unlikely) that changing the dosage of alleles between genotypes leads to directionally proportional effects in their regulatory targets.

Finally, I don't know why you're using glm, you should use glmFit if you're using edgeR.

ADD REPLY
0
Entering edit mode

In summary, I should make pairwise comparisons between genotypes (AAvsBB, ABvsAA and ABvsBB), shouldn't I? And also, it could be interesting make comparisons like (AA+AB)vsBB, (AA+BB)vsAB and (BB+AB)vsAA in order to see if there is dominance, co-dominance or it is a recessive model, isn't it?

However, I am still quite worried about the AAvsBB comparison, because the major part of the samples have the AB genotype, and it is like.. maybe I could have more DE detection power if there is a method that can use these AB samples in the AAvsBB comparison...

Well, I really appreciate your advice and your knowledge. Many thanks!

 

ADD REPLY
0
Entering edit mode
jlerga • 0
@jlerga-9341
Last seen 9.0 years ago

What about: (in edgeR)?

> my.contrast <- makeContrasts((AA+(1/2)*AB)-(BB+(1/2)*AB), levels=design)
> lrt <- glmLRT(fit, contrast=my.contrast)

??

ADD COMMENT
0
Entering edit mode

This won't do anything, as the AB terms will cancel out:

(AA+(1/2)*AB)-(BB+(1/2)*AB) = AA + (1/2)*AB - BB - (1/2)*AB = AA - BB
ADD REPLY

Login before adding your answer.

Traffic: 523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6