I have MeDIPseq results and,
I need to define states for each position in the genome (correspondent to a "CpG" sites) that vary between samples. The reason for this question is that the tools available, like "MeDIPS", define status to a "CpG window" and not for individual CpGs. When I tryed to set a window 2 (ws=2), the script was killed by synthesis error by bash.
This led me to make my own list of CpGs.
I have a table like this: The columns are: Chromosome number (chr), initial (start) and final (end) position of a interest base, the expected coverage or input (depth) , the observed coverage to different 6 animals (depth1-depth6).
Example:
data <- "chr start end depth depth1 depth2 depth3 depth4 depth5 depth6
chr1 3273 3273 7 200 35 1 200 850 0
chr1 3274 3274 3 50 25 5 300 1500 2
chr1 3275 3275 8 600 15 8 100 300 5
chr1 3276 3276 4 30 2 10 59 20 0
chr1 3277 3277 25 20 7 4 600 45 0"
data <- read.table(text=data, header=T)
I need to define a column with the states of each line, the states are: often metlylated, alternately methylated and rarely methylated.
To do that, first, I need to do a normalization of the depth between samples to obtain values that can be compared between the individuals. and, second, I have to define the range between the states (by now, any range is acceptable);
I have checked also this "edgeR package"function for R, which is used for RNAseq standardization data, and looks like, that it is used in MeDIPS too, that looks like this:
calcNormFactors(object, method=c("TMM","RLE","upperquartile","none"), refColumn = NULL,
logratioTrim = .3, sumTrim = 0.05, doWeighting=TRUE, Acutoff=-1e10, p=0.75)
but I could not apply to my data yet.
What I hope for my final result is something like this:
chr start State
chr1 3273 Often
chr1 3274 alternatively
chr1 3275 no
chr1 3276 often
chr1 3277 no
but... I would be really satisfied only with the normalized depth to each sample coverage.
Thank you very much. Normalized by "cpm" by "edgeR" package from R
Exemple:
{
}
Normalized! Even with normalization, I have 3 samples that have a lot of values equal to zero, since they have low coverage. I think I'll have to delete them from analysis. I thought about making a PCA test to see how these samples are grouped. I would like some feedback on the method used for normalization and for the second part of my problem