StringTie + Ballgown: handling biological replicates
6
1
Entering edit mode
bhawley1991 ▴ 10
@bhawley1991-9841
Last seen 8.7 years ago

Hi all,

I've been trying to analyse an RNA-seq dataset, and I decided to try the newer HISAT2>StringTie>Ballgown approach instead of Tophat2>Cufflinks>CummeRbund etc.

I'm having real trouble working out how to handle my biological replicates, as there doesn't seem to be much documentation or discussion on these newer tools. It seems like most people would use Cuffnorm and it's easy to see why as you can very easily specify what are your repeats for each sample. I'm sure there's a way to do this in Ballgown but I'm far to inexperienced to spot it so any help would be fantastic.

Thanks in advance.

ballgown stringtie • 6.1k views
ADD COMMENT
1
Entering edit mode
Alyssa Frazee ▴ 210
@alyssa-frazee-6710
Last seen 4.0 years ago
San Francisco, CA, USA

Ballgown handles biological replicates. The idea is to run StringTie on each replicate (either biological or technical) separately using the -B option (for "ballgown"), constructing the output directoy structure as specified in which will give you a separate output directory for each replicate, which should look something like this: https://github.com/alyssafrazee/ballgown#loading-data-into-rWhen the data is loaded into R from there, ballgown and the associated statistical tests (in "stattest") assumes only that each sample (each separate output directory) is independent of the others. (So they can either be a set of technical replicates from one biological sample, or a set of biological replicates). 

If you have both biological and technical replicates, one way to handle this with ballgown is to read in the data as you normally would (one directory per bio/tech rep), but include a column in "pData" denoting bio rep ID. Then you could combine expression values across tech reps (e.g. using average expression) to get a data set with one row per bio rep, and you could use that data set with the stattest function. 

 

ADD COMMENT
0
Entering edit mode

Hi Alyssa,

I have a similar question to what was posted here, except I have 6 biological replicates (2 samples, 3 replicates each) and 4 technical replicates per biological replicates (for a total of 24). I have done as you stated for denoting the replicates in pData. How do I go about combining the expression values and getting the average expression? And at what step of the analysis do I do that for? 

Thanks. 

ADD REPLY
0
Entering edit mode

hi Alyssa, new to R and ballgown. Have 16 samples run thru hisat2 with the --dta  and stringtie with -B option, made pheno_data, and ballgown dir with the 16 sample dir with the .ctab and .gtf  files  for each sample. Got to run in ballgown ok, and made .csv files for genes and transcripts.   What I need to do now is tell ballgown how to handle the 16 samples. There are 2 biological reps per sample,  and two treatment groups, ctr and bmp2, and 4 time points. Could you give me some help on how to make pheno_data  csv file.  I need to deal with the varience  in the biology rep first, then the stats of diff between ctr and bmp2 treatments, then the stat of the changes between time points and treatment. Thanks so much, Enjoying the program. steveharris  

ADD REPLY
0
Entering edit mode
jnpitt • 0
@jnpitt-10172
Last seen 8.6 years ago

Alyssa, can you please demonstrate how you would add the bio rep ID to your built in extdata, to say treat your 20 provided samples as 10 independent biological replicates from 2 different treatments?   and then use stattest to look at the statistically significant changes between the 2 treatments.

ADD COMMENT
0
Entering edit mode
jnpitt • 0
@jnpitt-10172
Last seen 8.6 years ago

just to answer my own question from the ballgown docs:  

pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

 

here group= assigns the samples to either group 1 or 0, subsequent stattest calls compare groups 0 and 1.

 

 

ADD COMMENT
0
Entering edit mode
Alyssa Frazee ▴ 210
@alyssa-frazee-6710
Last seen 4.0 years ago
San Francisco, CA, USA

Yep, the above is the correct answer. You can edit pData directly. Each column of the data frame is a covariate and each row is a sample; the group each sample belongs to should be denoted by a covariate (column) exactly as you wrote. 

ADD COMMENT
0
Entering edit mode
jnpitt • 0
@jnpitt-10172
Last seen 8.6 years ago

another thing that wasn't clear is that ballgown also requires that the sample ids be independent, for example, a samples vector

filelist <-c("/data/wildtype/sample1", "/data/wildtype/sample2","/data/wildtype/sample3", "/data/mutant/sample1","/data/mutant/sample2", "/data/mutant/sample3") 

when loaded into ballgown thus:

bg = ballgown(samples= filelist,meas='all')

will NOT be treated as independent samples...however renaming the directories thus will:

filelist <-c("/data/wildtype/sample1", "/data/wildtype/sample2","/data/wildtype/sample3", "/data/mutant/sample4","/data/mutant/sample5", "/data/mutant/sample6") 

 

 

 

ADD COMMENT
0
Entering edit mode
@lindaboshans-12526
Last seen 7.6 years ago

Hi Alyssa,

I have a similar question to what was posted here, except I have 6 biological replicates (2 samples, 3 replicates each) and 4 technical replicates per biological replicates (for a total of 24). I have done as you stated for denoting the replicates in pData. How do I go about combining the expression values and getting the average expression? And at what step of the analysis do I do that for? 

Thanks. 

ADD COMMENT

Login before adding your answer.

Traffic: 655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6