Question

findOverlaps(), queryHits and subjectHits()

0

Entering edit mode

user_bioinfo • 0

@9ae4560e

Last seen 3 days ago

France

Hello, I am overlapping chipseq peaks with hichip peaks in RStudio using findOVerlaps() function. I performed the overlapping on each fragment of the hichip separately and extracted the query hits of each overlaps and then create empty columns in my chip peak file so I can fill it with the fragments that overlapped with it using the approach of queryHits and subjectHits. But I got kind of lost with query and subject hits. Does anyone have an idea about it plz.

hichip_gr <- GRangesList(frg1, frg2)
# Create GRanges for the first set of ranges
frg1 <- GRanges(seqnames = Rle(hichip$frg1),
               ranges = IRanges(start = hichip$start1, end = hichip$end1))

# Create GRanges for the second set of ranges
frg2 <- GRanges(seqnames = Rle(hichip$frg2),
               ranges = IRanges(start = hichip$start2, end = hichip$end2))

# Load my chip peak file   
S1 <- read.delim(file.path(path, "S1_Invi_fdr.bed") , header = FALSE)

# Convert it to Grange 
S1 <- GRanges(seqnames = Rle(S1$seqnames),
              ranges = IRanges(start = S1$start, end = S1$end))

# creating a data columns for hichip fragments
mcols(S1)$frg1 <- NA
mcols(S1)$start.1 <- NA
mcols(S1)$end.1 <- NA
mcols(S1)$frg2 <- NA
mcols(S1)$start.2 <- NA
mcols(S1)$end.2 <- NA

# find overlapping 
overlap1 <- findOverlaps(S1, frg1)

# extract queryHits from overlap1
query_hits <- overlap1$subjectHits
subset_frg1 <- frg1[query_hits]

# this is the result of overlap1
 overlap1
       queryHits subjectHits
    1:         1       35804
    2:         1       35805
    3:         5       44884
    4:         5       44885
    5:         5       44886
   ---                      
12398:     15969       43123
12399:     15970       25561
12400:     15971       55972
12401:     15971       55973
12402:     15971       55974


sessionInfo( )
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.1.1/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.14.2                        reshape2_1.4.4                          
 [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  ChIPseeker_1.30.3                       
 [5] ChIPpeakAnno_3.28.0                      msigdbr_7.4.1                           
 [7] ggupset_0.3.0                            DOSE_3.20.1                             
 [9] RColorBrewer_1.1-3                       fgsea_1.20.0                            
[11] org.Hs.eg.db_3.14.0                      enrichplot_1.14.2                       
[13] clusterProfiler_4.2.2                    pheatmap_1.0.12                         
[15] VennDiagram_1.7.3                        futile.logger_1.4.3                     
[17] enrichR_3.2                              rtracklayer_1.54.0                      
[19] TxDb.Hsapiens.UCSC.hg38.knownGene_3.14.0 GenomicFeatures_1.46.1                  
[21] AnnotationDbi_1.56.1                     DiffBind_3.4.11                         
[23] BiocParallel_1.28.3                      BiocManager_1.30.23                     
[25] openxlsx_4.2.5.2                         readxl_1.4.3                            
[27] EnhancedVolcano_1.13.2                   ggrepel_0.9.5                           
[29] magrittr_2.0.3                           lubridate_1.9.3                         
[31] forcats_1.0.0                            stringr_1.5.1                           
[33] dplyr_1.1.4                              purrr_1.0.2                             
[35] readr_2.1.5                              tidyr_1.3.1                             
[37] tibble_3.2.1                             tidyverse_2.0.0                         
[39] DESeq2_1.34.0                            SummarizedExperiment_1.24.0             
[41] MatrixGenerics_1.6.0                     matrixStats_0.62.0                      
[43] GenomicRanges_1.46.1                     GenomeInfoDb_1.30.1                     
[45] IRanges_2.28.0                           S4Vectors_0.32.4                        
[47] Biobase_2.54.0                           BiocGenerics_0.40.0                     
[49] limma_3.50.3                             ggplot2_3.5.1                           

loaded via a namespace (and not attached):
  [1] utf8_1.2.2               tidyselect_1.2.1         RSQLite_2.2.8            htmlwidgets_1.5.4       
  [5] scatterpie_0.1.7         munsell_0.5.0            systemPipeR_2.0.4        withr_2.5.0             
  [9] colorspace_2.0-3         GOSemSim_2.20.0          filelock_1.0.2           rstudioapi_0.16.0       
 [13] bbmle_1.0.24             GenomeInfoDbData_1.2.7   mixsqp_0.3-43            hwriter_1.3.2           
 [17] polyclip_1.10-0          bit64_4.0.5              farver_2.1.1             downloader_0.4          
 [21] treeio_1.18.1            coda_0.19-4              vctrs_0.6.5              generics_0.1.3          
 [25] lambda.r_1.2.4           timechange_0.3.0         BiocFileCache_2.2.0      regioneR_1.26.0         
 [29] R6_2.5.1                 apeglm_1.16.0            graphlayouts_0.8.0       invgamma_1.1            
 [33] locfit_1.5-9.4           AnnotationFilter_1.18.0  bitops_1.0-7             cachem_1.0.6            
 [37] gridGraphics_0.5-1       DelayedArray_0.20.0      assertthat_0.2.1         BiocIO_1.4.0            
 [41] scales_1.3.0             ggraph_2.0.5             gtable_0.3.1             ensembldb_2.18.2        
 [45] tidygraph_1.2.0          rlang_1.1.4              genefilter_1.76.0        splines_4.1.1           
 [49] lazyeval_0.2.2           yaml_2.3.6               qvalue_2.26.0            RBGL_1.70.0             
 [53] tools_4.1.1              ggplotify_0.1.0          gplots_3.1.3.1           Rcpp_1.0.9              
 [57] plyr_1.8.7               progress_1.2.2           zlibbioc_1.40.0          RCurl_1.98-1.9          
 [61] prettyunits_1.1.1        viridis_0.6.2            ashr_2.2-47              futile.options_1.0.1    
 [65] DO.db_2.9                truncnorm_1.0-8          mvtnorm_1.1-3            SQUAREM_2021.1          
 [69] amap_0.8-19              ProtGenerics_1.26.0      patchwork_1.1.1          hms_1.1.3               
 [73] xtable_1.8-4             XML_3.99-0.9             emdbook_1.3.12           jpeg_0.1-9              
 [77] gridExtra_2.3            compiler_4.1.1           biomaRt_2.50.0           bdsmatrix_1.3-7         
 [81] shadowtext_0.0.9         KernSmooth_2.23-20       crayon_1.5.2             htmltools_0.5.2         
 [85] ggfun_0.0.4              tzdb_0.2.0               geneplotter_1.72.0       aplot_0.1.1             
 [89] DBI_1.1.2                tweenr_1.0.2             formatR_1.11             WriteXLS_6.6.0          
 [93] dbplyr_2.1.1             MASS_7.3-54              rappdirs_0.3.3           boot_1.3-28             
 [97] babelgene_21.4           ShortRead_1.52.0         Matrix_1.3-4             cli_3.6.3               
[101] parallel_4.1.1           igraph_2.0.3             pkgconfig_2.0.3          GenomicAlignments_1.30.0
[105] numDeriv_2016.8-1.1      InteractionSet_1.22.0    xml2_1.3.3               ggtree_3.2.1            
[109] annotate_1.72.0          multtest_2.50.0          XVector_0.34.0           yulab.utils_0.0.4       
[113] digest_0.6.30            graph_1.72.0             Biostrings_2.62.0        cellranger_1.1.0        
[117] fastmatch_1.1-3          tidytree_0.3.6           restfulr_0.0.13          GreyListChIP_1.26.0     
[121] curl_4.3.3               Rsamtools_2.10.0         gtools_3.9.3             rjson_0.2.20            
[125] jsonlite_1.8.8           nlme_3.1-153             lifecycle_1.0.3          viridisLite_0.4.1       
[129] BSgenome_1.62.0          fansi_1.0.3              pillar_1.9.0             lattice_0.20-45         
[133] plotrix_3.8-2            KEGGREST_1.34.0          fastmap_1.1.0            httr_1.4.4              
[137] survival_3.2-13          GO.db_3.14.0             glue_1.6.2               zip_2.2.0               
[141] png_0.1-7                bit_4.0.4                ggforce_0.3.3            stringi_1.7.8           
[145] blob_1.2.2               latticeExtra_0.6-29      caTools_1.18.2           memoise_2.0.1           
[149] ape_5.5                  irlba_2.3.5.1

GenomicFiles • 1.7k views

ADD COMMENT • link updated 9 months ago by Hervé Pagès 16k • written 9 months ago by user_bioinfo • 0

score 3 · Accepted Answer · 2024-07-15

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

The queryHits are the positions in the query that the subjectHits overlap. As an example,

> gr1 <- GRanges(rep("chr1", 5), IRanges(c(5,80, 329, 1111, 2345), c(35, 99, 450, 1211, 3456)))
> gr1
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      5-35      *
  [2]     chr1     80-99      *
  [3]     chr1   329-450      *
  [4]     chr1 1111-1211      *
  [5]     chr1 2345-3456      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr2 <- rev(gr1)
> findOverlaps(gr1, gr2)
Hits object with 5 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           5
  [2]         2           4
  [3]         3           3
  [4]         4           2
  [5]         5           1
  -------
  queryLength: 5 / subjectLength: 5

In this contrived example, the first range in gr1 overlaps the last range in gr2 (because it's the same thing, only reversed), and 2 overlaps 4, etc.

ADD COMMENT • link 9 months ago James W. MacDonald 68k

0

Entering edit mode

thank you for your help. but if I want to assign each query to its subjecthits I can have as results a data with the genomic coordinates well assigned?

ADD REPLY • link 9 months ago user_bioinfo • 0

0

Entering edit mode

Can you restate the question? It's not clear to me what you are asking.

ADD REPLY • link 9 months ago James W. MacDonald 68k

0

Entering edit mode

sure! in ur example, findOverlaps(gr1, gr2) the 1st query is assigned to the 5th subjectHits, and the 2nd query is assigned to the 4th subjectHits. this is result represented as the indices of each gr1 and gr2, what if I want to represent it as genomic coordinates and not only the indices. So instead of having

queryHits subjectHits
      <integer>   <integer>
  [1]         1           5

we'll have, ie

seqnames    ranges strand
         <Rle> <IRanges>    <Rle>     <Rle1>    <IRanges>
  [1]     chr1      5-35      *        chr5       5-40

chr1 overlaps with chr5 on this know position

How can we do this?

ADD REPLY • link 9 months ago user_bioinfo • 0

2

Entering edit mode

We could use a GenomicPairs object for that.

> library(GenomicAlignments)
> gr1 <- GRanges(rep("chr1", 5), IRanges(c(5,80, 329, 1111, 2345), c(35, 99, 450, 1211, 3456)))
> gr2 <- rev(gr1)
> gr2 <- shift(gr2, 30)
> fo <- findOverlaps(gr1, gr2)
> z <- GAlignmentPairs(as(gr1[queryHits(fo)], "GAlignments"), as(gr2[subjectHits(fo)], "GAlignments"))
> z
GAlignmentPairs object with 4 pairs, strandMode=1, and 0 metadata columns:
      seqnames strand :    ranges --
         <Rle>  <Rle> : <IRanges> --
  [1]     chr1      * :      5-35 --
  [2]     chr1      * :   329-450 --
  [3]     chr1      * : 1111-1211 --
  [4]     chr1      * : 2345-3456 --
         ranges
      <IRanges>
  [1]     35-65
  [2]   359-480
  [3] 1141-1241
  [4] 2375-3486
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

ADD REPLY • link 9 months ago James W. MacDonald 68k

0

Entering edit mode

ohh interesting! I didn't know about that. Thank you so much for ur help :)

ADD REPLY • link 9 months ago user_bioinfo • 0

1

Entering edit mode

GAlignmentPairs objects are highly specialized objects used to represent aligned paired-end reads, typically by loading them from a BAM file. They are not really the right kind of object for representing the ranges of the overlaps between two GRanges objects. Better to use pintersect() for that:

pintersect(gr1[queryHits(fo)], gr2[subjectHits(fo)])
# GRanges object with 4 ranges and 1 metadata column:
#       seqnames    ranges strand |       hit
#          <Rle> <IRanges>  <Rle> | <logical>
#   [1]     chr1        35      * |      TRUE
#   [2]     chr1   359-450      * |      TRUE
#   [3]     chr1 1141-1211      * |      TRUE
#   [4]     chr1 2375-3456      * |      TRUE
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The returned GRanges object is _parallel_ to the Hits object fo and contains the range of the overlap for each hit in fo. It can be stored in the Hits object itself as a metadata column:

mcols(fo)$range <- pintersect(gr1[queryHits(fo)], gr2[subjectHits(fo)])
fo
# Hits object with 4 hits and 1 metadata column:
#       queryHits subjectHits |          range
#       <integer>   <integer> |      <GRanges>
#   [1]         1           5 |        chr1:35
#   [2]         3           3 |   chr1:359-450
#   [3]         4           2 | chr1:1141-1211
#   [4]         5           1 | chr1:2375-3456
#   -------
#   queryLength: 5 / subjectLength: 5

Hope this helps,

H.

ADD REPLY • link 9 months ago Hervé Pagès 16k