Question

Help with design matrix for RNA-seq data using edgeR

0

Entering edit mode

erik.burchard • 0

@3746f5fd

Last seen 4 months ago

United States

Hello all,

I was wondering if someone could possibly help me with creating an appropriate design matrix for some RNA-seq data. There are 244 samples, 125 of phenotype "A" and 119 of phenotype "B" with 3 different sampling stages ("early", "late", and "mid") and 4 different tissue types. Basically, the data has the following relevant metadata categories: phenotype (pheno), tissue_type, sampling_stage, cultivar, and group, which is a concatenation of the phenotype and the sampling_stage and the primary comparison of interest here:

enter image description here

We are interested in discovering DEGs in the group comparisons only at each sampling stage between the two phenotypes (A_late - B_late, A_early - B_early, etc.). However, I'm not sure how to handle all of the other variables when creating a design matrix. The MDS plots indicate that there is very strong clustering of the data by both pheno and by cultivar (each cultivar has only one phenotype or the other), which we expected and were hoping to see. But there is also some degree of clustering by tissue_type. To further complicate things, tissue_type and sampling stage are somewhat related in that these are plant samples collected seasonally from the field, so some types of tissue are only available during certain sampling stages.

When I do the model like this..

design = model.matrix(~0 + group)

...I get a BCV of 1.35.

When I do it like this...

design.group.tissue_type = model.matrix(~0 + group + tissue_type)

...I get a slight improvement of BCV at 1.32.

Here is an image of the plot:

enter image description here

So my question is what is the appropriate model to use here? I think some of my confusion comes from the fact that I don't fully understand how to interpret BCV. It could be that the values I have are OK but I'm overthinking this since they seem a bit higher than those that I've seen in other forum threads and publications.

Thanks!

Code should be placed in three backticks as shown below


# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )

edgeR RNASeqData • 507 views

ADD COMMENT • link updated 4 months ago by Gordon Smyth 52k • written 4 months ago by erik.burchard • 0

score 1 · Answer 1 · 2024-09-06

This forum is intended to help with software usage and syntax rather than to answer research questions, and your question is a definitely a research question rather than a question about how to use edgeR. In any case, it is not possible to give you good advice about how to conduct the analysis from just the first few lines of the metadata table and without more knowledge of the scientific background. It isn't entirely clear for example what is the sampling unit of your data. How are the cultivars chosen for each phenotype? How are samples chosen for each cultivar?

I will make a couple of brief comments that I hope will be helpful. The BCV values that you show are huge and do indicate a problem either with the data or with the analysis. Either the fitted model is omitting important covariates, or your data contains outliers or batch effects, or the data simply isn't reproducible. I would break the samples down into groups sharing the same tissue_type, sampling_stage and cultivar, fit a model with a combined group from all those variables, and see if there is reasonable reproducibility between samples with identical metadata. The models you've fitted so far seem too simple to me.