Question

Can DESeq correlate genes expression patterns with a numerical sample factor?

0

Entering edit mode

matt.arno • 0

@mattarno-8491

Last seen 9.7 years ago

United Kingdom

Hi - is it possible to correlate gene expression patterns with a linear numerical factor in DESeq? Here's the background: I have used DESeq to analyse gene expression in 40 RNAseq samples; three sample groups.

I have noticed on the PCA that gender is a very strong determinant (=PC1) and one of the other components (PC3) also separates the samples relatively well, into 2 rough clusters, but this clustering pattern or sample grouping does not correlate with any of the known factors I have for the samples. I have an image which shows this pattern - not sure how to upload/attach it though; the image button wants a URL but this is a PNG file (not on the web).

I was thinking of either splitting the samples into two groups (low and high) based on an arbitrary cutoff roughly half way along the PC, or perhaps using the principle component factor value (from the pca$x table) to find genes that correlate their expression with this value across the samples, to find the genes responsible for this stratification in PC3.

This is kind of a long shot as my main factor(s) didn't show any significant changes, and I'm wondering whether this factor could show some biological significance in itself (depending on the types of genes changed) or by accounting for it in the design model matrix, and removing it's individual effects which could be muddying the waters for my main factors, e.g. ~gender+PC3_factor+condition.

As always, here's my disclaimer: I'm learning this as I go along, no formal training in maths or stats, just a biologist moving toward bioinformatics as a set of skills essential for my field. I hope I've asked the question in a clear and concise way and not broken too many rules. As I don't have a specific problem or error with a package there is no code or sessionInfo() to include (I don't think).

Cheers for any help or input!

Matt

deseq2 correlation deseq linear • 2.7k views

ADD COMMENT • link updated 9.7 years ago by Ryan C. Thompson ★ 7.9k • written 9.7 years ago by matt.arno • 0

score 2 · Answer 1 · 2015-08-14

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

First of all, yes, you can use both numerical and categorical covariates. But rather than try to use PCA as a method for batch effect correction (which it is not), why not use a package such as sva or RUVSeq that is specifically designed for that purpose?

ADD COMMENT • link 9.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

here's a pointer for sva + DESeq2

http://master.bioconductor.org/help/workflows/rnaseqGene/#batch

ADD REPLY • link 9.7 years ago Michael Love 43k

0

Entering edit mode

thanks - i'll look into that. I wasn't using PCA for batch error correction, just to examine sample relationships. Just to clarify: I don't think this is a batch effect per se, as there were no batches: all samples were processed together. no lane effect as all samples were spread equally across 8 lanes of single flowcell. it's weird, and that's why I think it may correlate with a as-yet-unknown sample characteristic. Samples were collected years ago, so I'm not sure how easy it will be to obtain extra info about them (to find a factor that correlates with this PCA clustering pattern). I suppose you could classify it as Unwanted Variation (à la RUVSeq), so I will give that a shot. Thanks again! Matt

ADD REPLY • link 9.7 years ago matt.arno • 0

0

Entering edit mode

just in case you are interested in my outcome with this? (might be helpful for others looking to do same thing)...

Firstly, I tried the sva method, followed the code exactly as Michael's guide on the RNAseqGene page, and here is my head(results(dds),4): ranked on pvalue

> head(resSva.O, 4)
log2 fold change (MAP): condition Non-ASD vs ASD 
Wald test p-value: condition Non-ASD vs ASD 
DataFrame with 4 rows and 6 columns
          baseMean log2FoldChange      lfcSE      stat       pvalue      padj
         <numeric>      <numeric>  <numeric> <numeric>    <numeric> <numeric>
TAS2R13  53.167524     -0.4453818 0.11195912 -3.978075 6.947541e-05 0.7090342
KANTR   149.987511     -0.3153299 0.08045521 -3.919323 8.879820e-05 0.7090342
POU4F1    1.210617      0.2682640 0.06850487  3.915984 9.003608e-05 0.7090342
SLPI     33.024806      0.4240727 0.11039906  3.841271 1.223988e-04 0.7229178

not great padj values...

however, when I look at different contrasts, as I have three groups under 'condition'...

> head(resSva.Low.non.O)

log2 fold change (MAP): condition Low Total Cast vs Non-ASD 
Wald test p-value: condition Low Total Cast vs Non-ASD 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange      lfcSE      stat       pvalue        padj
          <numeric>      <numeric>  <numeric> <numeric>    <numeric>   <numeric>
FCRL1    1098.90425      0.5110747 0.09597991  5.324809 1.010592e-07 0.001317307
MIR600HG  352.42715      0.4183308 0.08300525  5.039811 4.659913e-07 0.003037098
PTPRK     142.01688      0.4933695 0.10028901  4.919478 8.677551e-07 0.003462373
KIAA0125  407.44631      0.4316904 0.08846663  4.879698 1.062485e-06 0.003462373
TSTD3      22.97953      0.5605473 0.11698908  4.791450 1.655803e-06 0.004316678
AFF3      541.36563      0.4043529 0.08737233  4.627929 3.693415e-06 0.008023943

> head(resSva.ASD.Low.O)
log2 fold change (MAP): condition ASD vs Low Total Cast 
Wald test p-value: condition ASD vs Low Total Cast 
DataFrame with 6 rows and 6 columns
               baseMean log2FoldChange      lfcSE      stat       pvalue         padj
              <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
FOLR3        341.296934      0.6076173 0.10955879  5.546039 2.922144e-08 0.0004840532
COL19A1      420.415327     -0.4130494 0.08880545 -4.651172 3.300545e-06 0.0200067582
TPD52        309.831782     -0.3071897 0.06632051 -4.631896 3.623319e-06 0.0200067582
ANKRD20A11P    7.549347      0.4524846 0.10032811  4.510048 6.481296e-06 0.0245543233
AFF3         541.365633     -0.3407873 0.07604289 -4.481515 7.411507e-06 0.0245543233
FCRL1       1098.904245     -0.3645985 0.08498879 -4.289960 1.787052e-05 0.0451427384

...I get great padj! because certain contrasts give more significant results, I guess this is because the surrogate variables only affect some of the groups/contrasts and not all. which is interesting.

So thanks very much, Ryan and Michael: you've saved my day.

Cheers,

matt

ADD REPLY • link 9.7 years ago matt.arno • 0