Question

ballgown custom models

1

Entering edit mode

sbcn ▴ 80

@sbcn-4752

Last seen 2.7 years ago

Spain

Dear all,

I am using for the first time "ballgown" for differential expression of transcripts (after tophat2 + Stringtie).

I do have troubles interpreting the results (although it is not really ballgown specific but rather my confusion regarding complex/custom models).

Background:
I am working with 58 samples extracted from mice, so I have a "paired" effect to take into account.
Regardless of the paired effect, the samples are assigned to different experimental groups: parts of the brain (x 4), gender (x 2), and time points (x 3 , but I am at the moment not interested in a time course analysis).

The phenotype information is set as:

pData(bg) = data.frame(id=sampleNames(bg), brain=brainPart, gender=genderList, time=timePoints, pairs=samplePairs)

I run:

1.
stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='brain', adjustvars='pairs')

2.
mod = model.matrix(~ pData(bg)$brain + pData(bg)$gender + pData(bg)$time)
mod0 = model.matrix(~ pData(bg)$gender + pData(bg)$time)
custom_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod, adjustvars="pairs")

Are most significant results to be interpreted as:
stat_results: "transcripts that are most likely to change in expression across the 4 brain regions, regardless of gender and time"
custom_results: "transcripts that are most likely to change in expression across the 4 brain regions, taking into account gender and time"
?

Sorry if I'm completely confused here.

Also, I would like to know if it is possible to perform a simple pairwise comparison between 2 parts of the brain, even though the "brain" column contains 4 variables?

Thanks a lot!

Best,

sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux release 6.5 (Carbon)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ballgown_2.2.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.32.3       XVector_0.10.0
[3] GenomicRanges_1.22.4       BiocGenerics_0.16.1
[5] splines_3.2.3              zlibbioc_1.16.0
[7] GenomicAlignments_1.6.3    IRanges_2.4.7
[9] BiocParallel_1.4.3         xtable_1.8-2
[11] lattice_0.20-33            GenomeInfoDb_1.6.3
[13] tools_3.2.3                grid_3.2.3
[15] SummarizedExperiment_1.0.2 parallel_3.2.3
[17] nlme_3.1-122               Biobase_2.30.0
[19] mgcv_1.8-9                 DBI_0.3.1
[21] genefilter_1.52.1          lambda.r_1.1.7
[23] futile.logger_1.4.1        survival_2.38-3
[25] Matrix_1.2-3               sva_3.18.0
[27] rtracklayer_1.30.3         RColorBrewer_1.1-2
[29] S4Vectors_0.8.11           futile.options_1.0.0
[31] bitops_1.0-6               RCurl_1.95-4.7
[33] RSQLite_1.0.0              limma_3.26.8
[35] Biostrings_2.38.4          Rsamtools_1.22.0
[37] stats4_3.2.3               XML_3.98-1.3
[39] annotate_1.48.0

ballgown linear model model matrix custom model • 3.5k views

ADD COMMENT • link 9.1 years ago • updated 9.0 years ago sbcn ▴ 80

score 1 · Answer 1 · 2016-03-27

Let's see if I can clarify some of the ballgown-specific questions here!

In "stat_results", your specification will perform a differential expression test between brain regions, holding the "pairs" variable constant (this is equivalent to comparing a model including only pairs + library size to a model including pairs + library size + brain). This will *not* do any adjustment for time or gender. The top transcripts here would then be "transcripts showing the most significant expression differences across the 4 brain regions, ignoring gender and time and holding "pairs" constant."

It's not totally clear to me from this description what the pairs variable is, but I'm assuming you have two technical replicates for each mouse (?). In that case, R will interpret "pair" as approximately "is this sample replicate A or replicate B for this mouse," which means it will assume that being replicate A vs. replicate B is meaningful in some way. I am guessing this is not exactly what you want. Instead I would suggest either (a) combining expression measurements across technical replicates (i.e., taking the average), or (b) including "mouse" as a variable instead and adjusting for that. (Instead of labeling replicate A vs. replicate B, have mouse ID as a covariate.) If you have more questions on this model setup specifically, I'd recommend talking to a statistician at your institution as this is more a question of how to set up the linear model comparison and less a ballgown-specific question.

In "custom_results", your specification will perform a differential expression test between brain regions, holding the "time" and "gender" variables constant and ignoring pairs (and library size). Note that (per the documentation in ?stattest, in the 'Details' section in the paragraph beginning with "'mod' and 'mod0'...") the "adjustvars" argument is ignored when you provide custom models. Specifically (from the docs):

"...if you supply ‘mod’ and ‘mod0’, ‘covariate’, ‘timecourse’, ‘adjustvars’, and ‘df’ are ignored, so make sure your covariate of interest and all appropriate confounder adjustments, including library size, are specified in ‘mod’ and ‘mod0’."

The top transcripts in "custom_results" as it stands would be "transcripts showing the most significant expression differences across the 4 brain regions, holding gender and time constant and ignoring pairs."

As for pairwise comparisons between two brain regions, I'd recommend subsetting your ballgown object to only the two brain regions you are interested in (see ?subset -- there is a subset method for ballgown objects). An example would be:

`bg_regions12 = subset(bg, 'brain == 1 | brain == 2', genomesubset=FALSE)`

(to, e.g., subset to just regions 1 and 2 of the brain.) Then you could use "stattest" on "bg_regions12."

Hope this helps!

score 0 · Answer 2 · 2016-04-22

Hi Alyssa,

It definitely helps a lot, thanks!

I understand much better now and was able to run the analysis with more confidence.

What I mean by pairs - maybe I shouldn't name it this way - are not exactly technical replicates: different regions in the brain from the same animal were sampled. Each of these regions will be compared as they are of interest, but as they come from the same animal, it is a bias to take into account.

So essentially for animal A we will take regions 1 and 2, and from animal B we will take regions 1 and 2 as well, and so on. But when we check the difference between region 2 and 1, we want to take into account the fact that there might be an "individual/animal" bias.

So that is what I called pairs and I wanted to introduce in the model: samples from the same animal but that represent different experimental conditions.

I don't know if I am clear :-)

In any case, thanks a lot for your help!