How to remove duplicate files per case in TCGAbiolonks
1
0
Entering edit mode
raf4 ▴ 30
@raf4-8249
Last seen 22 months ago
United States

Dear List,

I ran a search and received a warning " Check if there are duplicated cases" Later when dowloading the program told me " There are samples duplicated. We will not be able to prepare it"

How do I get rid of the duplicate files? My code and session

> library(TCGAbiolinks)
> library(xlsx)
> library(DT)
> library(edgeR)
> library(org.Hs.eg.db)
> 
> query.cnv <- GDCquery(project = "TCGA-LUAD", data.category = "Copy Number Variation",  data.type = "Gene Level Copy Number",platform="Affymetrix SNP 6.0",legacy=FALSE)
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-LUAD
--------------------
oo Filtering results
--------------------
ooo By platform
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.14.0  AnnotationDbi_1.56.1 IRanges_2.28.0       S4Vectors_0.32.2    
 [5] Biobase_2.54.0       BiocGenerics_0.40.0  edgeR_3.36.0         limma_3.50.0        
 [9] DT_0.19              xlsx_0.6.5           TCGAbiolinks_2.23.1 

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                matrixStats_0.61.0          bit64_4.0.5                
 [4] filelock_1.0.2              progress_1.2.2              httr_1.4.2                 
 [7] GenomeInfoDb_1.30.0         tools_4.1.1                 utf8_1.2.2                 
[10] R6_2.5.1                    DBI_1.1.1                   colorspace_2.0-2           
[13] tidyselect_1.1.1            prettyunits_1.1.1           bit_4.0.4                  
[16] curl_4.3.2                  compiler_4.1.1              rvest_1.0.2                
[19] xml2_1.3.2                  DelayedArray_0.20.0         scales_1.1.1               
[22] readr_2.0.2                 rappdirs_0.3.3              stringr_1.4.0              
[25] digest_0.6.28               R.utils_2.11.0              XVector_0.34.0             
[28] pkgconfig_2.0.3             htmltools_0.5.2             MatrixGenerics_1.6.0       
[31] dbplyr_2.1.1                fastmap_1.1.0               highr_0.9                  
[34] htmlwidgets_1.5.4           rlang_0.4.12                RSQLite_2.2.8              
[37] generics_0.1.1              jsonlite_1.7.2              dplyr_1.0.7                
[40] R.oo_1.24.0                 RCurl_1.98-1.5              magrittr_2.0.1             
[43] GenomeInfoDbData_1.2.7      Matrix_1.3-4                Rcpp_1.0.7                 
[46] munsell_0.5.0               fansi_0.5.0                 lifecycle_1.0.1            
[49] R.methodsS3_1.8.1           stringi_1.7.5               SummarizedExperiment_1.24.0
[52] zlibbioc_1.40.0             plyr_1.8.6                  BiocFileCache_2.2.0        
[55] grid_4.1.1                  blob_1.2.2                  crayon_1.4.2               
[58] lattice_0.20-45             Biostrings_2.62.0           xlsxjars_0.6.1             
[61] hms_1.1.1                   KEGGREST_1.34.0             locfit_1.5-9.4             
[64] knitr_1.36                  pillar_1.6.4                GenomicRanges_1.46.0       
[67] TCGAbiolinksGUI.data_1.14.0 biomaRt_2.50.0              XML_3.99-0.8               
[70] glue_1.5.0                  downloader_0.4              data.table_1.14.2          
[73] png_0.1-7                   vctrs_0.3.8                 tzdb_0.2.0                 
[76] gtable_0.3.0                purrr_0.3.4                 tidyr_1.1.4                
[79] assertthat_0.2.1            cachem_1.0.6                ggplot2_3.3.5              
[82] xfun_0.28                   tibble_3.1.6                rJava_1.0-5                
[85] memoise_2.0.0               ellipsis_0.3.2             
>

Thanks and best wishes,

Rich

Richard Friedman,

Columbia University

TCGAbiolinks • 2.6k views
ADD COMMENT
0
Entering edit mode
raf4 ▴ 30
@raf4-8249
Last seen 22 months ago
United States

Dear List,

I figured out how to do this myself. I reproduce the code below in case it is of use to anyone else. Best wishes, Rich Richard Friedman, Columbia University

library(TCGAbiolinks)
library(xlsx)
library(DT)
library(edgeR)
library(org.Hs.eg.db)

query.cnv <- GDCquery(project = "TCGA-LUAD", data.category = "Copy Number Variation",  data.type = "Gene Level Copy Number",platform="Affymetrix SNP 6.0",legacy=FALSE)


query.cnv.cases<-getResults(query.cnv, cols="cases")
length(query.cnv.cases)
query.cnv.cases.dups<-query.cnv.cases[duplicated(query.cnv.cases)]
length(query.cnv.cases.dups)
query.cnv.cases.unique<-unique(query.cnv.cases)
length(query.cnv.cases.unique)
query.cnv.cases.nodups<-setdiff(query.cnv.cases.unique,query.cnv.cases.dups)
length(query.cnv.cases.nodups)

query.cnv.nodups <- GDCquery(project = "TCGA-LUAD", data.category = "Copy Number Variation", data.type = "Gene Level Copy Number",platform="Affymetrix SNP 6.0",legacy=FALSE,barcode=query.cnv.cases.nodups)
ADD COMMENT

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6