trying to annotate overlaps with external file information
1
1
Entering edit mode
@chrisclarkson100-11114
Last seen 2.6 years ago
United Kingdom

I have found the overlapping regions between the genomic regions I am interested in and genes with 'intersectBed'- external of R. 

intersectBed -a regions_of_interest.bed -b genes.bed > overlap_genes.bed

So now the file looks like this

>head overlap_genes.bed

chr1    180110925    180113591    Region_665    0    0

chr1    180110925    180113591    Region_665    0    0

chr1    180110925    180113591    Region_665    0    0

chr11    43113461    43117525    Region_837    0    0


This however is no good to me as I also need to include columns in this bed file which specify the gene that is being overlapped and where it starts and ends....

head genes.bed​

 

chr1    109408122    109422169    +    Serpinb2    

chr1    174405505    174408706    +    Slamf9    

chr1    133724589    133743546    +    Slc41a1    

chr1    173847813    173848810    +    Slamf6    

i.e.. what I would like to see is:

head overlap_genes.bed
chr1    180110925    180113591    Region_665    0    0    gene_name .....

chr1    180110925    180113591    Region_665    0    0    another_gene 

chr1    180110925    180113591    Region_665    0    0    one_more_gene

So how do I bind the according gene_names and their start, end and orientation to the overlap_genes object? My first thought was to iterate through each row and check if there were overlaps between the i-th range intervals and if so- bind the concerned gene-name and extra info to that i-th overlap_genes row.

genes=read.table('genes.bed', sep='\t', header=F, stringsAsFactors=F)
overlap_genes=read.table('overlap_genes.bed', sep='\t', header=F, stringsAsFactors=F)

informed=data.frame()
for(i in nrow(overlap_genes)){
  apply(genes,1, function(x){tryCatch(

                            if(intersect(seq(x[2], x[3]), seq(overlap_genes[i,1], overlap_genes[i,2])))

                            {rbind(informed,cbind(overlap_genes[i,],x))}

                            , error=function(e) e)})
  }

This however has not worked for me ... The informed data frame comes up empty.... which is obviously wrong as the 'overlap_genes' is the product of being overlapped with genes so all of its rows should overlap with 'genes'... I have tried using Iranges but it seems to posses the same functionality as intersectBed but cannot retain extra columns from external data.....

Does anyone know of a way around this?? Thanks

 

iranges granges • 1.5k views
ADD COMMENT
3
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

It looks like you want what comes from the bedtools -wb option. It would write out the gene ranges after the intersections. The HelloRanges package maps bedtools command lines to IRanges code. For example,

> bedtools_intersect("-a regions_of_interest.bed -b genes.bed -wb")
{
    genome <- Seqinfo(genome = NA_character_)
    gr_a <- import("regions_of_interest.bed", genome = genome)
    gr_b <- import("genes.bed", genome = genome)
    pairs <- findOverlapPairs(gr_a, gr_b, ignore.strand = TRUE)
    ans <- Pairs(pintersect(pairs, ignore.strand = TRUE), second(pairs))
    ans
}

That gives a Pairs object that pairs up the intersections and genes (as GRanges). You could also just stick the gene GRanges as a meta column on the intersection GRanges, like:

intersection <- first(ans)
intersection$genes <- second(ans)

Hopefully this gets you on the right path. Oh, and note the call to Seqinfo(). You should fill in your genome identifier, eg, "GRCh38" to get the full benefit of the framework.

ADD COMMENT
0
Entering edit mode

hi thank you- i should have looked through the intersectbed documentation and found out about the -wao option

ADD REPLY

Login before adding your answer.

Traffic: 641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6