Annotating copy number alterations per sample using CNVRanger R package and return of duplicated gene symbols
0
0
Entering edit mode
svlachavas ▴ 840
@svlachavas-7225
Last seen 14 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

based on a project, trying to identify damaging somatic alterations per patient, I was trying to annotate some selected CNAs with their respective total copy number, to obtain the relevant annotated genes; On this premise, I tried to use the CNVRanger R package and proceed as follows:


head(ex.dt) # the input data frame of selected copy number annotations based on hg19 reference
# A tibble: 6 x 5
  chromosome     start       end state file       
       <dbl>     <dbl>     <dbl> <dbl> <chr>      
1          2     10500  28856096     3 ACCC_01_CRC
2          2  28863705  71742899     3 ACCC_01_CRC
3          2  71743258  74786985     3 ACCC_01_CRC
4          2  74787273  92325671     3 ACCC_01_CRC
5          2  95326671 132111157     3 ACCC_01_CRC
6          2 132592158 141570717     3 ACCC_01_CRC

dim(ex.dt)
[1] 54  5

grl <- GenomicRanges::makeGRangesListFromDataFrame(ex.dt, keep.extra.columns=TRUE,
split.field="file")

grl
GRangesList object of length 1:
$ACCC_01_CRC
GRanges object with 54 ranges and 1 metadata column:
       seqnames             ranges strand |     state
          <Rle>          <IRanges>  <Rle> | <numeric>
   [1]        2     10500-28856096      * |         3
   [2]        2  28863705-71742899      * |         3
   [3]        2  71743258-74786985      * |         3
   [4]        2  74787273-92325671      * |         3
   [5]        2 95326671-132111157      * |         3
   ...      ...                ...    ... .       ...
  [50]       18     10500-15410398      * |         1
  [51]       18  19348569-78016748      * |         1
  [52]       20     68259-26319069      * |         3
  [53]       20  29420069-60419852      * |         3
  [54]       20  60427797-62959343      * |         3
  -------
  seqinfo: 11 sequences from an unspecified genome; no seqlengths


library(EnsDb.Hsapiens.v75)
human.hg19.genes <- ensembldb::genes(EnsDb.Hsapiens.v75)

sel.hg19.genes <- subset(human.hg19.genes, gene_biotype == "protein_coding")

# move to the process of finding the overlapping regions of protein-coding genes

olaps <- GenomicRanges::findOverlaps(sel.hg19.genes, grl, ignore.strand=TRUE)

qh <- S4Vectors::queryHits(olaps)

sh <- S4Vectors::subjectHits(olaps)

cgenes <- sel.hg19.genes[qh]

cgenes$type <- grl$type[sh]

head(cgenes)
GRanges object with 6 ranges and 6 metadata columns:
                  seqnames        ranges strand |         gene_id   gene_name   gene_biotype
                     <Rle>     <IRanges>  <Rle> |     <character> <character>    <character>
  ENSG00000268640       12 147052-147377      + | ENSG00000268640  AC026369.1 protein_coding
  ENSG00000120645       12 175931-287626      + | ENSG00000120645      IQSEC3 protein_coding
  ENSG00000111181       12 299243-323736      - | ENSG00000111181     SLC6A12 protein_coding
  ENSG00000010379       12 329789-372039      - | ENSG00000010379     SLC6A13 protein_coding
  ENSG00000073614       12 389295-498620      - | ENSG00000073614       KDM5A protein_coding
  ENSG00000120647       12 498439-551811      + | ENSG00000120647      CCDC77 protein_coding
                  seq_coord_system      symbol         entrezid
                       <character> <character>           <list>
  ENSG00000268640       chromosome  AC026369.1             <NA>
  ENSG00000120645       chromosome      IQSEC3 100996825,440073
  ENSG00000111181       chromosome     SLC6A12             6539
  ENSG00000010379       chromosome     SLC6A13             6540
  ENSG00000073614       chromosome       KDM5A             5927
  ENSG00000120647       chromosome      CCDC77            84318
  -------
  seqinfo: 273 sequences from GRCh37 genome

ra.grl <- RaggedExperiment::RaggedExperiment(grl) # the initial GRanges object from the txt file with all alterations

### select the largest call overlapping a gene in a sample, avoiding any averaging:

xx2 <- qreduceAssay(ra.grl, cgenes, simplifyReduce = CNVRanger:::.largest)

head(xx2)
                   ACCC_01_CRC
12:147052-147377:+           3
12:175931-287626:+           3
12:299243-323736:-           3
12:329789-372039:-           3
12:389295-498620:-           3
12:498439-551811:+           3

rownames(xx2) <- cgenes$gene_name

head(xx2)
           ACCC_01_CRC
AC026369.1           3
IQSEC3               3
SLC6A12              3
SLC6A13              3
KDM5A                3
CCDC77               3

table(duplicated(rownames(xx2)))

FALSE  TRUE 
 5201    13 

sessionInfo( )

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.18.2          AnnotationFilter_1.18.0  
 [4] GenomicFeatures_1.46.1    AnnotationDbi_1.56.2      Biobase_2.54.0           
 [7] data.table_1.14.2         forcats_0.5.1             stringr_1.4.0            
[10] dplyr_1.0.7               purrr_0.3.4               readr_2.1.1              
[13] tidyr_1.2.0               tibble_3.1.6              ggplot2_3.3.5            
[16] tidyverse_1.3.1           CNVRanger_1.10.0          RaggedExperiment_1.18.0  
[19] GenomicRanges_1.46.1      GenomeInfoDb_1.30.0       IRanges_2.28.0           
[22] S4Vectors_0.32.3          BiocGenerics_0.40.0

Thus, my main questions are:

1) Is it incorrect or not expected to have duplicated gene symbols in my final annotated object? And how I should remove them? Like simply using: x.sel <- xx2[!duplicated(rownames(xx2)),] ? Or something is erroneous in my above steps and could be modified to get unique gene symbols?

2) Is there a way also to filter out any copy number alterations, that have an overlap less than 30% with the respective annotated gene?

Thank you in advance,

Efstathios

CopyNumberVariation CNVRanger GenomicRanges • 953 views
ADD COMMENT

Login before adding your answer.

Traffic: 517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6