Hi everybody!
I'm starting to introduce myself into the world of differential affinity analysis, but I have a series of problems (please, forgive me my english):
I'm interested in doing a differential afinity analysis for multiple samples of a H3K27ac ChIP cancer's cells. I have many cell lines and also each sample was taken from different labs, so samples from a particular cell line are no replicates extrictly and I don't have a real replicate for each sample (let's say some replicates were not good). Skiping this for a moment , my fist question is to identify diferencial activity in peaks between diferent cells subtype, such as: Luminal, Basal A, Basal B. So, going on, I think it could be a possiblilty to make a table, as the one I'm showing below, where I use this cells subtype as my condition so then I will only have one sample of each condition with many "replicates" that are all the true samples that fit on that condition. I don't have a clear decision of this. Maybe I could mix the labs and use the labs as replicate for each cell line. By the way, I'm interested on the peaks, not too much on the cells lines.
Well, I know that this may sound a little weird, but follow me couse my problem appears later.
So this is the table I made:
SampleID | Tissue | Factor | Condition | bamReads | ControlID | bamControl | Peaks | PeakCaller |
BT549-Schor | BT549 | H27K27Ac | BaB | /home/bam/BT549_Young | BT549-Young-c | /home/bamC/BT549_Young | /home/bed/BT549_Young | bed |
BT549-Young | BT549 | H27K27Ac | BaB | /home/bam/BT549_Young | BT549-Young-c | /home/bamC/BT549_Young | /home/bed/BT549_Young | bed |
HCC1569 | HCC1569 | H27K27Ac | BaA | /home/bam/HCC1569 | HCC1569-c | /home/bamC/HCC1569 | /home/bed/HCC1569 | bed |
HCC1569-Schor | HCC1569 | H27K27Ac | BaA | /home/bam/HCC1569 | HCC1569-c | /home/bamC/HCC1569 | /home/bed/HCC1569 | bed |
MDAMB231-Arc | MDAMB231 | H27K27Ac | BaB | /home/bam/MDAMB231_Hardy | MDAMB231-Hardy.merged-c | /home/bamC/MDAMB231_Hardy | /home/bed/MDAMB231_Hardy | bed |
MDAMB231-Hardy.merged | MDAMB231 | H27K27Ac | BaB | /home/bam/MDAMB231_Hardy | MDAMB231-Hardy.merged-c | /home/bamC/MDAMB231_Hardy | /home/bed/MDAMB231_Hardy | bed |
MDAMB468-Arc | MDAMB468 | H27K27Ac | BaA | /home/bam/MDAMB468_Young | MDAMB468-Young-c | /home/bamC/MDAMB468_Young | /home/bed/MDAMB468_Young | bed |
MDAMB468-Young | MDAMB468 | H27K27Ac | BaA | /home/bam/MDAMB468_Young | MDAMB468-Young-c | /home/bamC/MDAMB468_Young | /home/bed/MDAMB468_Young | bed |
MCF7-Hardy | MCF7 | H27K27Ac | Lu | /home/bam/MCF7_Hardy | MCF7-Hardy-c | /home/bamC/MCF7_Hardy |
/home/bed/MCF7_Hardy |
|
MCF7-Schor | MCF7 | H27K27Ac | Lu | /home/bam/MCF7_Schor | MCF7-Schor-c | /home/bamC/MCF7_Schor |
/home/bed/MCF7_Schor |
And this is the script I made until now (note that I have used a short table just to taste the readablility of it):
> GeneClusterDBA<- dba(sampleSheet="lines.name-type7.csv")
MDAMB468-Young MDAMB468 H27K27Ac BaA NA bed
HCC1569 HCC1569 H27K27Ac BaA NA bed
MDAMB231-Hardy.merged MDAMB231 H27K27Ac BaB NA bed
BT549-Young BT549 H27K27Ac BaB NA bed
....
> GeneClusterDBA
4 Samples, 22384 sites in matrix (58575 total):
ID Tissue Factor Condition Caller Intervals
1 MDAMB468-Young MDAMB468 H27K27Ac BaA bed 34448
2 HCC1569 HCC1569 H27K27Ac BaA bed 32683
3 MDAMB231-Hardy.merged MDAMB231 H27K27Ac BaB bed 3091
4 BT549-Young BT549 H27K27Ac BaB bed 30053
....
> GeneClusterDBA<-dba.count(GeneClusterDBA)
> GeneClusterDBA
4 Samples, 22384 sites in matrix:
ID Tissue Factor Condition Caller Intervals
1 MDAMB468-Young MDAMB468 H27K27Ac BaA counts 22384
2 HCC1569 HCC1569 H27K27Ac BaA counts 22384
3 MDAMB231-Hardy.merged MDAMB231 H27K27Ac BaB counts 22384
4 BT549-Young BT549 H27K27Ac BaB counts 22384
FRiP
1 0.37
2 0.44
3 0.09
4 0.43
.....
First, I don't know if the NA's that appears when the program reads de .csv are normal..Maybe there is something the program can not read.
Second, I also don't know if I should have named the column of the conditions "DBA_CONDITION" intead of "condition" so I could have called dba,contrats like this with out get the warning of: "No contrast have been made.." or something like that..
How Can I use "DBA_CONDITION'?
GeneClusterDBA<-dba.contrast(GeneClusterDBA, categories=DBA_CONDITION, block=...)
The three dots besides "block=" means that I think I should use a blocking factor but I don't understand very well how to use it. If I am interested to compare Lu with BaB and BaA, then BaB with Lu and BaA, and then BaA with Lu and BaB, Do I have to do 3 analysis? First blocking Lu, the BaB and finally BaA? or maybe Can I do it in the same analysis? Can I made all the possible comparisons in the same analysis?
Also using "block" I don't know If I have to specify any particular column name or somthing like that. How does the program know that I want to block this o that factor? Should I use a mask? Is that the only possibility?
Should I also have to specify a column of replicates using as replicates the first possiblity that I said before(use one sample for each condition)? Will the program understands them despite the fact that the Tissue column have diferent names but many of the represent the same condition..?
Also I want to know from where can i read more about edgeR and DESeq2? I want to understand them more and maybe get how to change parameters of them.
Sorry for all my basics questions, I'm really lost in this stuff.
I will appreciate very much your help
Thank you in advance
Camila