Dear Bioconductor community,
based on a project, trying to identify damaging somatic alterations per patient, I was trying to annotate some selected CNAs with their respective total copy number, to obtain the relevant annotated genes; On this premise, I tried to use the CNVRanger R package and proceed as follows:
head(ex.dt) # the input data frame of selected copy number annotations based on hg19 reference
# A tibble: 6 x 5
chromosome start end state file
<dbl> <dbl> <dbl> <dbl> <chr>
1 2 10500 28856096 3 ACCC_01_CRC
2 2 28863705 71742899 3 ACCC_01_CRC
3 2 71743258 74786985 3 ACCC_01_CRC
4 2 74787273 92325671 3 ACCC_01_CRC
5 2 95326671 132111157 3 ACCC_01_CRC
6 2 132592158 141570717 3 ACCC_01_CRC
dim(ex.dt)
[1] 54 5
grl <- GenomicRanges::makeGRangesListFromDataFrame(ex.dt, keep.extra.columns=TRUE,
split.field="file")
grl
GRangesList object of length 1:
$ACCC_01_CRC
GRanges object with 54 ranges and 1 metadata column:
seqnames ranges strand | state
<Rle> <IRanges> <Rle> | <numeric>
[1] 2 10500-28856096 * | 3
[2] 2 28863705-71742899 * | 3
[3] 2 71743258-74786985 * | 3
[4] 2 74787273-92325671 * | 3
[5] 2 95326671-132111157 * | 3
... ... ... ... . ...
[50] 18 10500-15410398 * | 1
[51] 18 19348569-78016748 * | 1
[52] 20 68259-26319069 * | 3
[53] 20 29420069-60419852 * | 3
[54] 20 60427797-62959343 * | 3
-------
seqinfo: 11 sequences from an unspecified genome; no seqlengths
library(EnsDb.Hsapiens.v75)
human.hg19.genes <- ensembldb::genes(EnsDb.Hsapiens.v75)
sel.hg19.genes <- subset(human.hg19.genes, gene_biotype == "protein_coding")
# move to the process of finding the overlapping regions of protein-coding genes
olaps <- GenomicRanges::findOverlaps(sel.hg19.genes, grl, ignore.strand=TRUE)
qh <- S4Vectors::queryHits(olaps)
sh <- S4Vectors::subjectHits(olaps)
cgenes <- sel.hg19.genes[qh]
cgenes$type <- grl$type[sh]
head(cgenes)
GRanges object with 6 ranges and 6 metadata columns:
seqnames ranges strand | gene_id gene_name gene_biotype
<Rle> <IRanges> <Rle> | <character> <character> <character>
ENSG00000268640 12 147052-147377 + | ENSG00000268640 AC026369.1 protein_coding
ENSG00000120645 12 175931-287626 + | ENSG00000120645 IQSEC3 protein_coding
ENSG00000111181 12 299243-323736 - | ENSG00000111181 SLC6A12 protein_coding
ENSG00000010379 12 329789-372039 - | ENSG00000010379 SLC6A13 protein_coding
ENSG00000073614 12 389295-498620 - | ENSG00000073614 KDM5A protein_coding
ENSG00000120647 12 498439-551811 + | ENSG00000120647 CCDC77 protein_coding
seq_coord_system symbol entrezid
<character> <character> <list>
ENSG00000268640 chromosome AC026369.1 <NA>
ENSG00000120645 chromosome IQSEC3 100996825,440073
ENSG00000111181 chromosome SLC6A12 6539
ENSG00000010379 chromosome SLC6A13 6540
ENSG00000073614 chromosome KDM5A 5927
ENSG00000120647 chromosome CCDC77 84318
-------
seqinfo: 273 sequences from GRCh37 genome
ra.grl <- RaggedExperiment::RaggedExperiment(grl) # the initial GRanges object from the txt file with all alterations
### select the largest call overlapping a gene in a sample, avoiding any averaging:
xx2 <- qreduceAssay(ra.grl, cgenes, simplifyReduce = CNVRanger:::.largest)
head(xx2)
ACCC_01_CRC
12:147052-147377:+ 3
12:175931-287626:+ 3
12:299243-323736:- 3
12:329789-372039:- 3
12:389295-498620:- 3
12:498439-551811:+ 3
rownames(xx2) <- cgenes$gene_name
head(xx2)
ACCC_01_CRC
AC026369.1 3
IQSEC3 3
SLC6A12 3
SLC6A13 3
KDM5A 3
CCDC77 3
table(duplicated(rownames(xx2)))
FALSE TRUE
5201 13
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.18.2 AnnotationFilter_1.18.0
[4] GenomicFeatures_1.46.1 AnnotationDbi_1.56.2 Biobase_2.54.0
[7] data.table_1.14.2 forcats_0.5.1 stringr_1.4.0
[10] dplyr_1.0.7 purrr_0.3.4 readr_2.1.1
[13] tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5
[16] tidyverse_1.3.1 CNVRanger_1.10.0 RaggedExperiment_1.18.0
[19] GenomicRanges_1.46.1 GenomeInfoDb_1.30.0 IRanges_2.28.0
[22] S4Vectors_0.32.3 BiocGenerics_0.40.0
Thus, my main questions are:
1) Is it incorrect or not expected to have duplicated gene symbols in my final annotated object? And how I should remove them? Like simply using: x.sel <- xx2[!duplicated(rownames(xx2)),]
? Or something is erroneous in my above steps and could be modified to get unique gene symbols?
2) Is there a way also to filter out any copy number alterations, that have an overlap less than 30% with the respective annotated gene?
Thank you in advance,
Efstathios