I would like to take the "start_position" for each gene in my data.frame and draw a window of 100000 bases upstream and 100000 bases downstream (each in 200 base windows, so 500 windows upstream/downstream of the "start_position") and intersect the window with a granges object that has a score column corresponding to 200 base bin positions.
The gene data.frame looks like this:
> head(gene) ensembl_transcript_id length eff_length est_counts tpm ensembl_gene_id mgi_symbol start_position end_position strand chromosome_name 1 ENSMUST00000000001 3262 3063 7876.000 78.1469000 ENSMUSG00000000001 Gnai3 107910198 107949064 - chr3 2 ENSMUST00000000003 902 703 0.000 0.0000000 ENSMUSG00000000003 Pbsn 75083240 75098962 - chrX 3 ENSMUST00000000010 2574 2375 502.564 6.4310300 ENSMUSG00000020875 Hoxb9 96132771 96137909 + chr11 4 ENSMUST00000000028 2143 1944 570.259 8.9151500 ENSMUSG00000000028 Cdc45 18780540 18812080 - chr16 5 ENSMUST00000000033 3708 3509 26338.300 228.1170000 ENSMUSG00000048583 Igf2 149836673 149852721 - chr7 6 ENSMUST00000000049 1190 991 3.000 0.0920027 ENSMUSG00000000049 Apoh 108204668 108275710 + chr11
The granges object looks like this:
> score_data UCSC track 'MEF_K27AC.downsampled.sorted' UCSCData object with 13274466 ranges and 1 metadata column: seqnames ranges strand | score <Rle> <IRanges> <Rle> | <numeric> [1] chr1 [ 1, 200] * | 0 [2] chr1 [201, 400] * | 0 [3] chr1 [401, 600] * | 0 [4] chr1 [601, 800] * | 0 [5] chr1 [801, 1000] * | 0 ... ... ... ... . ... [13274462] chrY [15901401, 15901600] * | 0 [13274463] chrY [15901601, 15901800] * | 0 [13274464] chrY [15901801, 15902000] * | 0 [13274465] chrY [15902001, 15902200] * | 0 [13274466] chrY [15902201, 15902400] * | 0 ------- seqinfo: 21 sequences from mm9 genome
The goal here is to compute for each gene TSS (start_position), get the score values that are upstream and downstream 500 windows. I'd like to add these windows as columns to the gene data.frame is possible. Ideally I would like the data in some form resembling:
ensembl_transcript_id | tpm | upstream_500 | upstream_499 | … | downstream_499 | downstream_500 | |
1 | ENSMUST00000000001 | 78.1469 | |||||
2 | ENSMUST00000000003 | 0 | |||||
3 | ENSMUST00000000010 | 6.43103 | |||||
4 | ENSMUST00000000028 | 8.91515 | |||||
5 | ENSMUST00000000033 | 228.117 | |||||
6 | ENSMUST00000000049 | 0.0920027 |
Any help would be much appreciated. I've posted links to the data and current code below:
Gene data.frame (this can be read in the code as the exp variable)
Score object (this can be read in the code as the mef_chip variable)
Current code that should work with the inputs above
Hi Michael,
I was wondering if you could help me out here. I am creating a new Granges object with the expression data and reducing the range to only the TSS, then I draw a window around the TSS. The exp_granges object looks correct to me. Now I want to subset the scores data.frame with these windows. The code below seems to work with the first two rows of the exp_granges object but when I use the entire object I only get a faction of the total rows in the expression data. I think this is because some of the ranges overlap? or the TSS locations are the same? Any help would be much appreciated!
I think the problem is that some transcripts have the same start and end position and these are not preserved as separate entries in subsetByOverlap. This seems to work but is VERY slow:
I'll again recommend using
promoters()
to formexp_granges
.subsetByOverlaps()
will just return the elements ofchip
that overlap any range inexp_granges
. Instead, I think you want to match the ranges up usingfindOverlaps()
or similar. But you probably want to group the scores by TSS, so something like:I think that will only work though if you first round off the TSS windows so that they align with the chip windows. That way, there's the same number of chip windows for every TSS window.