Using edgeR::voomLmFit for the analysis of paired clinical RNA-seq samples with multiple cell types
1
0
Entering edit mode
Devin • 0
@507495b5
Last seen 5 hours ago
United States

Dear all,

I am looking for guidance using edgeR::voomLmFit to analyze clinical RNA-seq datasets from matched remission & disease samples. Briefly, the study design is:

20 patients, each patient has a remission and disease time point (biospecimen). Each biospecimen is sorted into two cell populations. The (ideal) objectives in layperson terms are to:

  • A) Identify transcriptomic changes between disease states and remission shared by all cell types and patients
  • B) Identify transcriptomic changes between disease states and remission for individual cell types, shared by all patients
  • C) Identify transcriptomic changes between disease and remission for individual patients and cell types

Code:

library(edgeR)

# Prepare metadata
set.seed(1)
metadata <- data.frame(
  BIOSPECIMEN=factor(paste0('S',1:20)),
  INDIVIDUAL=factor(paste0('Patient',c(rep(1,4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)))),
  CELLTYPE=factor(paste0('Cell',c(rep(c(1,2),5)))),
  DISEASESTATE=factor(c('Remission','Remission','D1','D1',
                        'Remission','Remission','D1','D1',
                        'Remission','Remission','D2','D2',
                        'Remission','Remission','D3','D3',
                        'Remission','Remission','D1','D1'),
                      levels=c('Remission','D1','D2','D3')),
  DISEASESTATEBIN=factor(rep(c('Remission','Remission','D','D'),5)),
  RIN=rnorm(20,mean=6,sd=1.5),
  AGE=unlist(lapply(rnorm(5,mean=45,sd=8),rep,4))
)

# Center and scale RIN score
metadata$RINS <- scale(metadata$RIN)

# Generate random counts
counts <- sapply(1:nrow(metadata),function(i) {
  rnbinom(n=1000,size=sample(seq(0.1,1,by=0.1),1),prob=0.1)
})
colnames(counts) <- metadata$BIOSPECIMEN
row.names(counts) <- paste0('gene',1:nrow(counts))

Specific Question 1: Is there a preferred paradigm for voomLmFit for paired RNA-seq between blocking on individual and including individual in the model design (Option 1 vs Option 2 below):

### Analysis A) Option 1
y1 <- DGEList(counts=counts,samples=metadata)
design1 <- model.matrix(~0 + DISEASESTATE + RINS,
                        data=y1$samples)
keep1 <- filterByExpr(y1,design1)
y1 <- y1[keep1,,keep.lib.sizes=FALSE]
y1 <- normLibSizes(y1)
fit1 <- voomLmFit(y1,design1,block=y1$samples$INDIVIDUAL,sample.weights=TRUE)
contrasts1 <- makeContrasts('DISEASESTATED1 - DISEASESTATERemission',
                            levels=design1)
fit1 <- contrasts.fit(fit1,contrasts1)
fit1 <- eBayes(fit1)
res1 <- topTable(fit1,sort.by='P',n=Inf)

### Analysis A) Option 2
y2 <- DGEList(counts=counts,samples=metadata)
design2 <- model.matrix(~0 + DISEASESTATE + INDIVIDUAL + RINS,
                        data=y2$samples)
keep2 <- filterByExpr(y2,design2)
y2 <- y2[keep2,,keep.lib.sizes=FALSE]
y2 <- normLibSizes(y2)
fit2 <- voomLmFit(y2,design2,sample.weights=TRUE)
contrasts2 <- makeContrasts('DISEASESTATED1 - DISEASESTATERemission',
                            levels=design2)
fit2 <- contrasts.fit(fit2,contrasts2)
fit2 <- eBayes(fit2)
res2 <- topTable(fit2,sort.by='P',n=Inf)

# Comparison
comp <- merge(res1,res2,by='row.names')
plot(comp$logFC.x,comp$logFC.y)
message(cor(comp$logFC.x,comp$logFC.y))

Specific Question 2: For analysis B), is the best approach to create a combined group as below, or should I instead split the counts into individual cell types first? The differences in cell types (as assessed by MDS/PCA) are much greater than differences between patients or disease status:

# Analysis B)
metadata$CELLTYPE_DISEASESTATE <- paste(metadata$CELLTYPE,metadata$DISEASESTATE,sep='_')
y3 <- DGEList(counts=counts,samples=metadata)
design3 <- model.matrix(~0 + CELLTYPE_DISEASESTATE + INDIVIDUAL + RINS,
                        data=y3$samples)
keep3 <- filterByExpr(y3,design3)
y3 <- y3[keep3,,keep.lib.sizes=FALSE]
y3 <- normLibSizes(y3)
fit3 <- voomLmFit(y3,design3,sample.weights=TRUE)
contrasts3 <- makeContrasts('CELLTYPE_DISEASESTATECell1_D1 - CELLTYPE_DISEASESTATECell1_Remission',
                            levels=design3)
fit3 <- contrasts.fit(fit3,contrasts3)
fit3 <- eBayes(fit3)
res3 <- topTable(fit3,sort.by='P',n=Inf)

General questions:

  • Is it appropriate to account for differences in RNA integrity (RIN) score using the centered/scaled RIN values in the model design as shown above?
  • It seems reasonable to default to sample.weights=TRUE for clinical samples. Are there downsides to this approach?

This is my first post. Please forgive me for any issues with this post, and thank you in advance for any help.

edgeR limma voomLmFit • 61 views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 40 minutes ago
WEHI, Melbourne, Australia

Specific Question 1: Is there a preferred paradigm for voomLmFit for paired RNA-seq between blocking on individual and including individual in the model design (Option 1 vs Option 2 below):

Since the experimental design is balanced (every individual has 4 samples), I recommend including individual in the model matrix.

Specific Question 2: For analysis B), is the best approach to create a combined group as below, or should I instead split the counts into individual cell types first?

It makes no difference, provided that you end up fitting all four groups for each individual. I recommended the combined group for simplicity.

Is it appropriate to account for differences in RNA integrity (RIN) score using the centered/scaled RIN values in the model design as shown above?

There are some publications that recommend doing so, but I don't find it very useful. It is up to you.

It seems reasonable to default to sample.weights=TRUE for clinical samples. Are there downsides to this approach?

No, there are no real downsides. I generally recommend sample weights for human clinical samples.

ADD COMMENT

Login before adding your answer.

Traffic: 994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6