Hello,
I'm trying to find the nearest CpG islands around each miRNA of interest(gene) and also return distance between them.
For example for hsa-mir-137 the nearest CpG islands are 73 and 91 where CpG 73 overlaps with this miRNA gene and CpG 91 is ~7500 bp away. What is more is it possible to filter only miRNAs where CpG overlaps it or where distance between them is less than 1000 bp?
Here is what I got:
hub <- AnnotationHub()
query(hub, c("cpg","hg19"))
cpgs <- hub[["AH5086"]] ## the granges of all CpG's (hg19)
gffRangedData<-import.gff("allmirna.gff3")
mirna_granges<-as(gffRangedData, "GRanges") ## I downloaded all annotated miRNAs from miRBase and I converted them to granges object
sample_mirnas <- read.table("sample_data.txt", header = TRUE, sep ='\t') #here are my 3 miRNAs of interest (hsa-mir-106b,hsa-mir-150,hsa-mir-107)
final_mirnas <- mirna_granges[mirna_granges$Name %in% sample_mirnas$mirna] # now I wanted to have granges only for my miRNA so I subset data from miRBase
So as a result I have:
1) CpG's granges
> cpgs
GRanges object with 28691 ranges and 1 metadata column:
seqnames ranges strand | name
<Rle> <IRanges> <Rle> | <character>
[1] chr1 [ 28736, 29810] * | CpG:_116
[2] chr1 [135125, 135563] * | CpG:_30
[3] chr1 [327791, 328229] * | CpG:_29
[4] chr1 [437152, 438164] * | CpG:_84
[5] chr1 [449274, 450544] * | CpG:_99
... ... ... ... . ...
[28687] chr9_gl000201_random [ 15651, 15909] * | CpG:_30
[28688] chr9_gl000201_random [ 26397, 26873] * | CpG:_43
[28689] chr11_gl000202_random [ 16284, 16540] * | CpG:_23
[28690] chr17_gl000204_random [ 54686, 57368] * | CpG:_228
[28691] chr17_gl000205_random [117501, 117801] * | CpG:_23
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
2) Granges of miRNA's of interest
> final_mirnas
GRanges object with 3 ranges and 8 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] chr10 [ 89592747, 89592827] - | <NA> miRNA_primary_transcript <NA>
[2] chr19 [ 49500785, 49500868] - | <NA> miRNA_primary_transcript <NA>
[3] chr7 [100093993, 100094074] - | <NA> miRNA_primary_transcript <NA>
phase ID Alias Name Derives_from
<integer> <character> <CharacterList> <character> <character>
[1] <NA> MI0000114 MI0000114 hsa-mir-107 <NA>
[2] <NA> MI0000479 MI0000479 hsa-mir-150 <NA>
[3] <NA> MI0000734 MI0000734 hsa-mir-106b <NA>
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
Now, how to combine this data togesher to obtain something like (results are fake, made them up for question):
miRNA CpG_upstream CpG_downstream CpG_up_distance CpG_down_distance
hsa-mir-106b 57 23 852 15
hsa-mir-150 NA 77 NA 987
hsa-mir-107 12 NA -12 NA
I put `NA`If distance between miRNA and CpG is greater than 1000bp but it can be actual value, what is more i put negative number when CpG and miRNA gene overlap.
I tried to use distance() and follow() functions but there are only bare numbers, can you help me how to figure it out?