GDCquery_Maf error
0
0
Entering edit mode
e.iich • 0
@76e1237b
Last seen 2.3 years ago
Singapore

Hi all, I really need some help. I am trying to run GDCquery_Maf which worked fine until yesterday. Now I get the following error:

Error in GDCquery(paste0("TCGA-", tumor), data.category = "Simple Nucleotide Variation",  : 
  Please set a valid workflow.type argument from the list below:
  => Aliquot Ensemble Somatic Variant Merging and Masking

command used is below. any help would be greatly appreciated.


maf <- GDCquery_Maf(tumor = "COAD", pipelines = "mutect2")

sessionInfo( )
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_SG.UTF-8       LC_NUMERIC=C               LC_TIME=en_SG.UTF-8       
 [4] LC_COLLATE=en_SG.UTF-8     LC_MONETARY=en_SG.UTF-8    LC_MESSAGES=en_SG.UTF-8   
 [7] LC_PAPER=en_SG.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_SG.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TCGAbiolinks_2.20.1         maftools_2.8.05             survivalROC_1.0.3          
 [4] rms_6.2-0                   SparseM_1.81                Hmisc_4.6-0                
 [7] Formula_1.2-4               lattice_0.20-45             biomaRt_2.48.3             
[10] plotROC_2.2.1               survminer_0.4.9             ggpubr_0.4.0               
[13] pheatmap_1.0.12             glmnet_4.1-3                Matrix_1.4-0               
[16] survival_3.2-13             vsn_3.60.0                  DESeq2_1.32.0              
[19] limma_3.50.0                SummarizedExperiment_1.24.0 Biobase_2.54.0             
[22] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[25] S4Vectors_0.32.3            BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[28] matrixStats_0.61.0          EnhancedVolcano_1.10.0      ggrepel_0.9.1              
[31] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.8                
[34] purrr_0.3.4                 readr_2.1.2                 tidyr_1.2.0                
[37] tibble_3.1.6                ggplot2_3.3.5               tidyverse_1.3.1            

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                  R.utils_2.11.0              tidyselect_1.1.2           
  [4] RSQLite_2.2.11              AnnotationDbi_1.56.2        htmlwidgets_1.5.4          
  [7] grid_4.1.2                  BiocParallel_1.28.3         munsell_0.5.0              
 [10] codetools_0.2-18            preprocessCore_1.54.0       withr_2.5.0                
 [13] colorspace_2.0-3            filelock_1.0.2              ggalt_0.4.0                
 [16] knitr_1.38                  rstudioapi_0.13             ggsignif_0.6.3             
 [19] Rttf2pt1_1.3.10             labeling_0.4.2              GenomeInfoDbData_1.2.7     
 [22] hwriter_1.3.2               KMsurv_0.1-5                farver_2.1.0               
 [25] bit64_4.0.5                 downloader_0.4              vctrs_0.3.8                
 [28] generics_0.1.2              TH.data_1.1-0               xfun_0.30                  
 [31] BiocFileCache_2.0.0         EDASeq_2.26.1               markdown_1.1               
 [34] R6_2.5.1                    ggbeeswarm_0.6.0            locfit_1.5-9.5             
 [37] bitops_1.0-7                cachem_1.0.6                DelayedArray_0.20.0        
 [40] assertthat_0.2.1            BiocIO_1.2.0                vroom_1.5.7                
 [43] scales_1.1.1                multcomp_1.4-18             nnet_7.3-16                
 [46] beeswarm_0.4.0              gtable_0.3.0                ash_1.0-15                 
 [49] affy_1.70.0                 sandwich_3.0-1              rlang_1.0.2                
 [52] MatrixModels_0.5-0          genefilter_1.74.1           splines_4.1.2              
 [55] rtracklayer_1.52.1          rstatix_0.7.0               extrafontdb_1.0            
 [58] broom_0.7.12                checkmate_2.0.0             yaml_2.3.5                 
 [61] BiocManager_1.30.16         abind_1.4-5                 modelr_0.1.8               
 [64] GenomicFeatures_1.44.2      backports_1.4.1             gridtext_0.1.4             
 [67] extrafont_0.17              tools_4.1.2                 affyio_1.62.0              
 [70] ellipsis_0.3.2              RColorBrewer_1.1-2          Rcpp_1.0.8.3               
 [73] plyr_1.8.7                  base64enc_0.1-3             progress_1.2.2             
 [76] zlibbioc_1.40.0             RCurl_1.98-1.6              prettyunits_1.1.1          
 [79] rpart_4.1-15                cowplot_1.1.1               zoo_1.8-9                  
 [82] haven_2.4.3                 cluster_2.1.2               fs_1.5.2                   
 [85] magrittr_2.0.2              data.table_1.14.2           reprex_2.0.1               
 [88] mvtnorm_1.1-3               aroma.light_3.22.0          hms_1.1.1                  
 [91] TCGAbiolinksGUI.data_1.12.0 xtable_1.8-4                XML_3.99-0.9               
 [94] jpeg_0.1-9                  readxl_1.4.0                gridExtra_2.3              
 [97] shape_1.4.6                 compiler_4.1.2              maps_3.4.0                 
[100] KernSmooth_2.23-20          crayon_1.5.1                R.oo_1.24.0                
[103] htmltools_0.5.2             tzdb_0.3.0                  ggtext_0.1.1               
[106] geneplotter_1.70.0          lubridate_1.8.0             DBI_1.1.2                  
[109] dbplyr_2.1.1                proj4_1.0-11                MASS_7.3-54                
[112] rappdirs_0.3.3              ShortRead_1.50.0            car_3.0-12                 
[115] cli_3.2.0                   R.methodsS3_1.8.1           parallel_4.1.2             
[118] km.ci_0.5-2                 pkgconfig_2.0.3             GenomicAlignments_1.28.0   
[121] foreign_0.8-81              xml2_1.3.3                  foreach_1.5.2              
[124] annotate_1.72.0             vipor_0.4.5                 XVector_0.34.0             
[127] rvest_1.0.2                 digest_0.6.29               Biostrings_2.62.0          
[130] cellranger_1.1.0            survMisc_0.5.5              htmlTable_2.4.0            
[133] restfulr_0.0.13             curl_4.3.2                  Rsamtools_2.8.0            
[136] quantreg_5.88               rjson_0.2.21                lifecycle_1.0.1            
[139] nlme_3.1-152                jsonlite_1.8.0              carData_3.0-5              
[142] fansi_1.0.3                 pillar_1.7.0                ggrastr_1.0.1              
[145] KEGGREST_1.34.0             fastmap_1.1.0               httr_1.4.2                 
[148] glue_1.6.2                  png_0.1-7                   iterators_1.0.14           
[151] bit_4.0.4                   stringi_1.7.6               blob_1.2.2                 
[154] polspline_1.1.19            latticeExtra_0.6-29         memoise_2.0.1
TCGAbiolinks GDCquery_Maf • 5.8k views
ADD COMMENT
0
Entering edit mode

Hello,

I get the same error. For me the function yesterday still worked, but today not (no change in my R version nor TCGAbiolinks package version). So probably something changed with the TCGAbiolinks database? I experience similar discrepancies when querying gene expression data with the function GDCquery().

ADD REPLY
0
Entering edit mode

Hi, I got the same error. You should use:

query1 <- GDCquery( project = "TCGA-COAD", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", legacy=F)

GDCdownload(query1, directory = "GDCdata/")

muts <- GDCprepare(query1, directory = GDCdata/")

and so you will obtained hg38 by default (I think Benedek is right about TCGAbiolinks database changing). I tried and I obtained the same mutations datasets with both function GDCquery and GDCquery_Maf. The problem is the object of the GDCquery_Maf that is not compatible with the GDCprepare function, so we cannot get a unique file.

Barbara

ADD REPLY
0
Entering edit mode

Hi Barbara,

My problem is now if I try to query this way:

tcga_maf <- GDCquery(project = "TCGA-HNSC", 
                     data.category = "Simple Nucleotide Variation", # Simple nucleotide variation if legacy
                     data.type = "Masked Somatic Mutation",
                     access = "open", 
                     legacy = F,
                     sample.type = "Primary Tumor")

Output:

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-HNSC
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------

And then:

GDCdownload(tcga_maf,
            directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
tcga_maf <- GDCprepare(tcga_maf)

I get this error message:

Error in GDCprepare(tcga_maf) : 
  There are samples duplicated. We will not be able to prepare it

Checking the results of the query:

tcga_maf$results[[1]] %>% head(3)

Output:

id data_format cases access
1 ee627805-b05b-4ee8-832b-de0cae5e0b3f         MAF         open
2 67d2d32c-688b-4ddb-9d67-08328e3d6fed         MAF         open
3 5fa9ea87-4838-44ea-9f8e-f163be8de716         MAF         open
                                                                file_name                         submitter_id
1 ac88ad4e-1605-42b0-ac95-bb1d5fc28134.wxs.aliquot_ensemble_masked.maf.gz cffb420a-1221-419f-a22e-1c4ea97c295d
2 072406cf-ed8a-4017-9d28-35b0882e3dbe.wxs.aliquot_ensemble_masked.maf.gz 4ae27122-1a52-4f55-bf98-c2e47b14143a
3 a36857bd-dc86-4f0f-8d6d-8a491c362eaa.wxs.aliquot_ensemble_masked.maf.gz 62e2fab3-ce23-4b08-a33e-08e78d45de2d
                data_category                    type file_size                 created_datetime
1 Simple Nucleotide Variation masked_somatic_mutation     42558 2022-01-26T15:57:59.120442-06:00
2 Simple Nucleotide Variation masked_somatic_mutation     22036 2022-01-26T16:01:25.477980-06:00
3 Simple Nucleotide Variation masked_somatic_mutation      7636 2022-01-26T16:02:47.525883-06:00
                            md5sum                 updated_datetime                              file_id
1 b4d4f5e1724e11afe8a035737d0496a3 2022-02-28T13:36:16.196122-06:00 ee627805-b05b-4ee8-832b-de0cae5e0b3f
2 0b7738f9fcff344027e8d8560e7aa9eb 2022-02-28T13:36:01.102328-06:00 67d2d32c-688b-4ddb-9d67-08328e3d6fed
3 3b9a90288870a71279fe8dbca59b6aff 2022-02-28T13:35:40.649488-06:00 5fa9ea87-4838-44ea-9f8e-f163be8de716
                data_type    state experimental_strategy version data_release   project workflow_version
1 Masked Somatic Mutation released                   WXS       1         32.0 TCGA-HNSC   20211008T1907Z
2 Masked Somatic Mutation released                   WXS       1         32.0 TCGA-HNSC   20211008T1907Z
3 Masked Somatic Mutation released                   WXS       1         32.0 TCGA-HNSC   20211008T1907Z
         analysis_updated_datetime analysis_workflow_link                analysis_submitter_id analysis_state
1 2022-02-01T17:22:16.912149-06:00         quay.io/ncigdc ac88ad4e-1605-42b0-ac95-bb1d5fc28134       released
2 2022-02-01T17:22:16.912149-06:00         quay.io/ncigdc 072406cf-ed8a-4017-9d28-35b0882e3dbe       released
3 2022-02-01T17:22:16.912149-06:00         quay.io/ncigdc a36857bd-dc86-4f0f-8d6d-8a491c362eaa       released
                                analysis_workflow_type                 analysis_analysis_id
1 Aliquot Ensemble Somatic Variant Merging and Masking ff3ab545-8f7d-43b6-9430-e7bc9e5e7867
2 Aliquot Ensemble Somatic Variant Merging and Masking b61146b1-75ca-4b07-9349-9a421dbcb791
3 Aliquot Ensemble Somatic Variant Merging and Masking 54c3835f-81d5-4372-95e3-31b46c104d29
         analysis_created_datetime
1 2022-01-26T15:40:56.423362-06:00
2 2022-01-26T15:47:30.156028-06:00
3 2022-01-26T15:50:06.445540-06:00

So although I specify primary tumors I think it returns the normal cases as well (duplicate samples). This way I cannot even match the id's to the TCGA sample barcodes to remove duplicate samples... Everything worked fine with the GDCquery() function till yesterday. :( Also all the ID columns are unique (509 unique elements) and the cases column does not contain any values so it is impossible to find the duplicate samples.

ADD REPLY
1
Entering edit mode

Hi Benedek. the code below worked on my side. You also need to set the directory argument in GDCprepare.

tcga_maf <- GDCquery(
      project = "TCGA-HNSC", 
      data.category = "Simple Nucleotide Variation", # Simple nucleotide variation if legacy
      data.type = "Masked Somatic Mutation",
       access = "open", 
       legacy = F,
       sample.type = "Primary Tumor"
    )    
GDCdownload(tcga_maf, directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
tcga_maf <- GDCprepare(tcga_maf, directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
ADD REPLY
0
Entering edit mode

It's weird but I still get the same error message:

tcga_maf <- GDCquery(project = "TCGA-HNSC", 
                     data.category = "Simple Nucleotide Variation", # Simple nucleotide variation if legacy
                     data.type = "Masked Somatic Mutation",
                     access = "open", 
                     legacy = F,
                     sample.type = "Primary Tumor")
GDCdownload(tcga_maf,
            directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
tcga_maf <- GDCprepare(tcga_maf,
                       directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")

Output:

Error in GDCprepare(tcga_maf, directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks") : 
  There are samples duplicated. We will not be able to prepare it

I have TCGAbiolinks version 2.14.1 and R version 3.6.1 (2019-07-05).

Update: when using newer version of R (4.2) then it works just as for you. Thanks for the help!

ADD REPLY
1
Entering edit mode

Hi Benedek, I run your code and it is ok. Maybe, as Tiago suggested, the problem is the directory of GDCprepare. Bye. Barbara

ADD REPLY
0
Entering edit mode

I still got the same error message (R version 3.6), although when using newer R version then it works well. Thanks for your help!

ADD REPLY
0
Entering edit mode

library(TCGAbiolinks) query_SNV <- GDCquery(project = "TCGA-GBM", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking") GDCdownload(query_SNV)

ADD REPLY

Login before adding your answer.

Traffic: 474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6