There are multiple ways in which you can do this, I'll show you one. first, you need to fetch the gene annotations you want. You've mentioned Gencode. One version of Gencode annotations is available in the TxDb.Hsapiens.UCSC.hg38.knownGene annotation package:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
next, you need to fetch coordinates of the genes, you have not mentioned whether you want to calculate the average conservation spanning the gene boundaries or, for instance, the average conservation of the spliced exons. i'm going to assume you want the latter:
exonsbygene <- exonsBy(txdb, by="gene")
class(exonsbygene)
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"
length(exonsbygene)
[1] 27363
these exons by gene coordinates are stored in a list-like data structure with 27,363 elements, thus indicating we have exon annotations for 27,363 genes. let's load now the annotation package with the phastCons conservation scores and check whether all sequence names (whole chromosomes and chromosome alternate sequences and patches in GRCh38) in the gene annotations are present in the phastCons package:
library(phastCons100way.UCSC.hg38)
phast <- phastCons100way.UCSC.hg38
seqmask <- seqlevels(exonsbygene) %in% seqlevels(phast)
all(seqmask)
[1] FALSE
length(seqmask)
[1] 595
sum(!seqmask)
[1] 140
head(seqlevels(exonsbygene)[!seqmask])
[1] "chr1_KN196472v1_fix" "chr1_KN196473v1_fix" "chr1_KN196474v1_fix"
[4] "chr1_KN538360v1_fix" "chr1_KN538361v1_fix" "chr1_KQ031383v1_fix"
so, not all sequence names in the gene annotations are present in the phastCons
package, concretely for 140 of the alternate and sequences and patches. we have to remove the annotations from those sequences because otherwise GenomicScores
will prompt an error. we do this by restricting annotations to those occurring in "standard" sequence names, corresponding to whole chromosomes:
exonsbygene <- keepStandardChromosomes(exonsbygene, pruning.mode="coarse")
length(exonsbygene)
[1] 25906
so, we have reduced the number of gene annotations from 27,363 to 25,906. still, this like-like data structure cannot be fed directly into the gscores()
function of the GenomicScores
package because it expects a GRanges
object. for this reason, we will first "unlist" this list-like data structure into a GRanges
object:
allexons <- unlist(exonsbygene)
length(allexons)
[1] 521670
head(allexons)
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
1 chr19 58345178-58347029 - | 573605 <NA>
1 chr19 58345183-58347029 - | 573606 <NA>
1 chr19 58346854-58347029 - | 573607 <NA>
1 chr19 58346858-58347029 - | 573608 <NA>
1 chr19 58346860-58347029 - | 573609 <NA>
1 chr19 58347353-58347634 - | 573610 <NA>
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
now extract the average conservation scores for each of those exons and put them as an additional metadata column in the GRanges
object (this took a couple of minutes in my laptop):
allexons$phast <- gscores(phast, allexons)$default
head(allexons)
GRanges object with 6 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
1 chr19 58345178-58347029 - | 573605 <NA>
1 chr19 58345183-58347029 - | 573606 <NA>
1 chr19 58346854-58347029 - | 573607 <NA>
1 chr19 58346858-58347029 - | 573608 <NA>
1 chr19 58346860-58347029 - | 573609 <NA>
1 chr19 58347353-58347634 - | 573610 <NA>
phast
<numeric>
1 0.0209503239740821
1 0.0204656199242014
1 0.0318181818181818
1 0.0325581395348837
1 0.0329411764705882
1 0.119148936170213
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
if we want to have the conservation scores organized by gene as in the annotations, we can do the following:
exonsbygene <- relist(allexons, exonsbygene)
i'm hitting the 5000 character limit in my answer, so i can't give further details on how to go on from here.
Is this also by you: https://www.biostars.org/p/427723/ ?
yes, I will close that in biostars