Hello,
I have to compute DNA shape features (MGW, Roll, ProT, HelT) for genomic ranges of my interest. My GRanges object is something like:
> head(gr) GRanges object with 6 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> 3575 chr10 [179300, 179500] * 3576 chr10 [179350, 179550] * 3577 chr10 [179400, 179600] * 3578 chr10 [179450, 179650] * 3579 chr10 [179500, 179700] * 3580 chr10 [179550, 179750] * ------- seqinfo: 20 sequences from an unspecified genome; no seqlengths
I used following code:
gr <- makeGRangesFromDataFrame(sample, keep.extra.columns = F, ignore.strand = T, seqinfo = NULL, seqnames.field = "chr", start.field = "start", end.field = "stop")
getFasta(gr, BSgenome = Hsapiens, width = 200, filename = "tmp.fa")
fn <- "tmp.fa"
pred2 <- getShape(fn)
The output I get is:
> pred2[1:4,1:5] MGW.1 MGW.2 MGW.3 MGW.4 MGW.5 1 NA NA 5.08 4.66 3.90 2 NA NA 2.85 3.75 4.43 3 NA NA 5.57 5.96 4.68 4 NA NA 5.89 5.21 4.52
The no.of rows in pred2 are equal to no.of rows of GRanges object (gr= 90) and no.of columns are 798 and I am not getting this idea why I am getting 798 columns. I think it is because I defined width=200. According to help information: width = A number indicating a fixed width of sequences. I kept width = 200 because if you notice that my genomic ranges are divided into bins of 200 neucleotides each.
My Questions:
- Why I am getting 798 columns?
- Why there are NA in MGW.1, MGW.2, MGW.199 and MGW.200 columns and similarly for HelT and so on for every feature?
- Ideally I want the feature values (4 features= MGW, Roll, ProT, HelT) for each nucleotide. For that what I expect is a matrix of 90 rows and 800 columns (1 column for each feature value in 200 nucleotides, 4*200 = 800) but instead what I am getting is 90*798 with NA values in 12 columns.
If somebody has used the DNAshapeR package then kindly guide me how to interpret the results. Thank you.
@tsupeich Does it make sense to take average of all the individual (1 neucletide resolution) features e.g. MGW.1 ... MGW.200 ? Isn't it that we loose the information content of true TF binding sites by taking average of feature value across the entire width of the given sequence (e.g. 200 in this case) ?
Hi,
Tsu-Pei was referring to a replacement for NA values that are naturally occurring at the boundaries of the genomic regions you specified, he was not suggesting to average nucleotide-resolution and base-pair-step-resolution parameters.
@tsupeich We could consider adding an option to automatically fill in values at boundaries.
Federico
@federico.comoglio Sure, we can do that. The NA values might not only happen at the boundaries but also at any position in the sequence (according to the input). Maybe the function should be able to handle all NA values.
yes, agreed.
Hi,
I feel like averaging 200 bps would loose some of the information content after all the MGW can go up and down along the region, which cannot be explained only by one value. The features generated byDNAshapeR are more likely to describe the local structure of DNA (instead of the global one). The way we usually analyze the MGW for TF binding sites is based on one base pair resolution, for example, the analyses in the Fig. 2 and 4 of [1] and the Fig. 5 of [2]. Other than that, we did average the MGW but only for the central five base pairs for Fis protein binding site in the Fig.1 of [3]. For your case, I feel like if you have to average the MGW for 200 base pairs, I would suggest you to do more statistics in addition to averaging the values, for example, also checking the distribution.
[1] N. Abe, I. Dror, L. Yang, M. Slattery, T. Zhou, H.J. Bussemaker, R. Rohs*, and R.S. Mann*: Deconvolving the recognition of DNA sequence from shape. Cell 161, 307-318 (2015)
[2] T.P. Chiu,&, L. Yang,&, T. Zhou, B.J. Main, S.C. Parker, S.V. Nuzhdin, T.D. Tullius, and R. Rohs*: GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res. 43, D103-109 (2015)
[3] T. Zhou, L. Yang, Y. Lu, I. Dror, A.C. Dantas Machado, T. Ghane, R. Di Felice, and R. Rohs*: DNAshape: a method for the high-throughput prediction of DNA structural features on a genome-wide scale. Nucleic Acids Res. 41, W56-62 (2013)
Best wishes,
Tsu-Pei