Question

GDCquery_Maf error

0

Entering edit mode

e.iich • 0

@76e1237b

Last seen 2.3 years ago

Singapore

Hi all, I really need some help. I am trying to run GDCquery_Maf which worked fine until yesterday. Now I get the following error:

Error in GDCquery(paste0("TCGA-", tumor), data.category = "Simple Nucleotide Variation",  : 
  Please set a valid workflow.type argument from the list below:
  => Aliquot Ensemble Somatic Variant Merging and Masking

command used is below. any help would be greatly appreciated.


maf <- GDCquery_Maf(tumor = "COAD", pipelines = "mutect2")

sessionInfo( )
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_SG.UTF-8       LC_NUMERIC=C               LC_TIME=en_SG.UTF-8       
 [4] LC_COLLATE=en_SG.UTF-8     LC_MONETARY=en_SG.UTF-8    LC_MESSAGES=en_SG.UTF-8   
 [7] LC_PAPER=en_SG.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_SG.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TCGAbiolinks_2.20.1         maftools_2.8.05             survivalROC_1.0.3          
 [4] rms_6.2-0                   SparseM_1.81                Hmisc_4.6-0                
 [7] Formula_1.2-4               lattice_0.20-45             biomaRt_2.48.3             
[10] plotROC_2.2.1               survminer_0.4.9             ggpubr_0.4.0               
[13] pheatmap_1.0.12             glmnet_4.1-3                Matrix_1.4-0               
[16] survival_3.2-13             vsn_3.60.0                  DESeq2_1.32.0              
[19] limma_3.50.0                SummarizedExperiment_1.24.0 Biobase_2.54.0             
[22] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[25] S4Vectors_0.32.3            BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[28] matrixStats_0.61.0          EnhancedVolcano_1.10.0      ggrepel_0.9.1              
[31] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.8                
[34] purrr_0.3.4                 readr_2.1.2                 tidyr_1.2.0                
[37] tibble_3.1.6                ggplot2_3.3.5               tidyverse_1.3.1            

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                  R.utils_2.11.0              tidyselect_1.1.2           
  [4] RSQLite_2.2.11              AnnotationDbi_1.56.2        htmlwidgets_1.5.4          
  [7] grid_4.1.2                  BiocParallel_1.28.3         munsell_0.5.0              
 [10] codetools_0.2-18            preprocessCore_1.54.0       withr_2.5.0                
 [13] colorspace_2.0-3            filelock_1.0.2              ggalt_0.4.0                
 [16] knitr_1.38                  rstudioapi_0.13             ggsignif_0.6.3             
 [19] Rttf2pt1_1.3.10             labeling_0.4.2              GenomeInfoDbData_1.2.7     
 [22] hwriter_1.3.2               KMsurv_0.1-5                farver_2.1.0               
 [25] bit64_4.0.5                 downloader_0.4              vctrs_0.3.8                
 [28] generics_0.1.2              TH.data_1.1-0               xfun_0.30                  
 [31] BiocFileCache_2.0.0         EDASeq_2.26.1               markdown_1.1               
 [34] R6_2.5.1                    ggbeeswarm_0.6.0            locfit_1.5-9.5             
 [37] bitops_1.0-7                cachem_1.0.6                DelayedArray_0.20.0        
 [40] assertthat_0.2.1            BiocIO_1.2.0                vroom_1.5.7                
 [43] scales_1.1.1                multcomp_1.4-18             nnet_7.3-16                
 [46] beeswarm_0.4.0              gtable_0.3.0                ash_1.0-15                 
 [49] affy_1.70.0                 sandwich_3.0-1              rlang_1.0.2                
 [52] MatrixModels_0.5-0          genefilter_1.74.1           splines_4.1.2              
 [55] rtracklayer_1.52.1          rstatix_0.7.0               extrafontdb_1.0            
 [58] broom_0.7.12                checkmate_2.0.0             yaml_2.3.5                 
 [61] BiocManager_1.30.16         abind_1.4-5                 modelr_0.1.8               
 [64] GenomicFeatures_1.44.2      backports_1.4.1             gridtext_0.1.4             
 [67] extrafont_0.17              tools_4.1.2                 affyio_1.62.0              
 [70] ellipsis_0.3.2              RColorBrewer_1.1-2          Rcpp_1.0.8.3               
 [73] plyr_1.8.7                  base64enc_0.1-3             progress_1.2.2             
 [76] zlibbioc_1.40.0             RCurl_1.98-1.6              prettyunits_1.1.1          
 [79] rpart_4.1-15                cowplot_1.1.1               zoo_1.8-9                  
 [82] haven_2.4.3                 cluster_2.1.2               fs_1.5.2                   
 [85] magrittr_2.0.2              data.table_1.14.2           reprex_2.0.1               
 [88] mvtnorm_1.1-3               aroma.light_3.22.0          hms_1.1.1                  
 [91] TCGAbiolinksGUI.data_1.12.0 xtable_1.8-4                XML_3.99-0.9               
 [94] jpeg_0.1-9                  readxl_1.4.0                gridExtra_2.3              
 [97] shape_1.4.6                 compiler_4.1.2              maps_3.4.0                 
[100] KernSmooth_2.23-20          crayon_1.5.1                R.oo_1.24.0                
[103] htmltools_0.5.2             tzdb_0.3.0                  ggtext_0.1.1               
[106] geneplotter_1.70.0          lubridate_1.8.0             DBI_1.1.2                  
[109] dbplyr_2.1.1                proj4_1.0-11                MASS_7.3-54                
[112] rappdirs_0.3.3              ShortRead_1.50.0            car_3.0-12                 
[115] cli_3.2.0                   R.methodsS3_1.8.1           parallel_4.1.2             
[118] km.ci_0.5-2                 pkgconfig_2.0.3             GenomicAlignments_1.28.0   
[121] foreign_0.8-81              xml2_1.3.3                  foreach_1.5.2              
[124] annotate_1.72.0             vipor_0.4.5                 XVector_0.34.0             
[127] rvest_1.0.2                 digest_0.6.29               Biostrings_2.62.0          
[130] cellranger_1.1.0            survMisc_0.5.5              htmlTable_2.4.0            
[133] restfulr_0.0.13             curl_4.3.2                  Rsamtools_2.8.0            
[136] quantreg_5.88               rjson_0.2.21                lifecycle_1.0.1            
[139] nlme_3.1-152                jsonlite_1.8.0              carData_3.0-5              
[142] fansi_1.0.3                 pillar_1.7.0                ggrastr_1.0.1              
[145] KEGGREST_1.34.0             fastmap_1.1.0               httr_1.4.2                 
[148] glue_1.6.2                  png_0.1-7                   iterators_1.0.14           
[151] bit_4.0.4                   stringi_1.7.6               blob_1.2.2                 
[154] polspline_1.1.19            latticeExtra_0.6-29         memoise_2.0.1

TCGAbiolinks GDCquery_Maf • 5.8k views

ADD COMMENT • link updated 2.3 years ago by 1526466763 • 0 • written 2.7 years ago by e.iich • 0

0

Entering edit mode

Hello,

I get the same error. For me the function yesterday still worked, but today not (no change in my R version nor TCGAbiolinks package version). So probably something changed with the TCGAbiolinks database? I experience similar discrepancies when querying gene expression data with the function GDCquery().

ADD REPLY • link 2.7 years ago Benedek ▴ 20

0

Entering edit mode

Hi, I got the same error. You should use:

query1 <- GDCquery( project = "TCGA-COAD", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", legacy=F)

GDCdownload(query1, directory = "GDCdata/")

muts <- GDCprepare(query1, directory = GDCdata/")

and so you will obtained hg38 by default (I think Benedek is right about TCGAbiolinks database changing). I tried and I obtained the same mutations datasets with both function GDCquery and GDCquery_Maf. The problem is the object of the GDCquery_Maf that is not compatible with the GDCprepare function, so we cannot get a unique file.

Barbara

ADD REPLY • link 2.7 years ago Barbara ▴ 10

0

Entering edit mode

Hi Barbara,

My problem is now if I try to query this way:

tcga_maf <- GDCquery(project = "TCGA-HNSC", 
                     data.category = "Simple Nucleotide Variation", # Simple nucleotide variation if legacy
                     data.type = "Masked Somatic Mutation",
                     access = "open", 
                     legacy = F,
                     sample.type = "Primary Tumor")

Output:

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-HNSC
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------

And then:

GDCdownload(tcga_maf,
            directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
tcga_maf <- GDCprepare(tcga_maf)

I get this error message:

Error in GDCprepare(tcga_maf) : 
  There are samples duplicated. We will not be able to prepare it

Checking the results of the query:

tcga_maf$results[[1]] %>% head(3)

Output:

id data_format cases access
1 ee627805-b05b-4ee8-832b-de0cae5e0b3f         MAF         open
2 67d2d32c-688b-4ddb-9d67-08328e3d6fed         MAF         open
3 5fa9ea87-4838-44ea-9f8e-f163be8de716         MAF         open
                                                                file_name                         submitter_id
1 ac88ad4e-1605-42b0-ac95-bb1d5fc28134.wxs.aliquot_ensemble_masked.maf.gz cffb420a-1221-419f-a22e-1c4ea97c295d
2 072406cf-ed8a-4017-9d28-35b0882e3dbe.wxs.aliquot_ensemble_masked.maf.gz 4ae27122-1a52-4f55-bf98-c2e47b14143a
3 a36857bd-dc86-4f0f-8d6d-8a491c362eaa.wxs.aliquot_ensemble_masked.maf.gz 62e2fab3-ce23-4b08-a33e-08e78d45de2d
                data_category                    type file_size                 created_datetime
1 Simple Nucleotide Variation masked_somatic_mutation     42558 2022-01-26T15:57:59.120442-06:00
2 Simple Nucleotide Variation masked_somatic_mutation     22036 2022-01-26T16:01:25.477980-06:00
3 Simple Nucleotide Variation masked_somatic_mutation      7636 2022-01-26T16:02:47.525883-06:00
                            md5sum                 updated_datetime                              file_id
1 b4d4f5e1724e11afe8a035737d0496a3 2022-02-28T13:36:16.196122-06:00 ee627805-b05b-4ee8-832b-de0cae5e0b3f
2 0b7738f9fcff344027e8d8560e7aa9eb 2022-02-28T13:36:01.102328-06:00 67d2d32c-688b-4ddb-9d67-08328e3d6fed
3 3b9a90288870a71279fe8dbca59b6aff 2022-02-28T13:35:40.649488-06:00 5fa9ea87-4838-44ea-9f8e-f163be8de716
                data_type    state experimental_strategy version data_release   project workflow_version
1 Masked Somatic Mutation released                   WXS       1         32.0 TCGA-HNSC   20211008T1907Z
2 Masked Somatic Mutation released                   WXS       1         32.0 TCGA-HNSC   20211008T1907Z
3 Masked Somatic Mutation released                   WXS       1         32.0 TCGA-HNSC   20211008T1907Z
         analysis_updated_datetime analysis_workflow_link                analysis_submitter_id analysis_state
1 2022-02-01T17:22:16.912149-06:00         quay.io/ncigdc ac88ad4e-1605-42b0-ac95-bb1d5fc28134       released
2 2022-02-01T17:22:16.912149-06:00         quay.io/ncigdc 072406cf-ed8a-4017-9d28-35b0882e3dbe       released
3 2022-02-01T17:22:16.912149-06:00         quay.io/ncigdc a36857bd-dc86-4f0f-8d6d-8a491c362eaa       released
                                analysis_workflow_type                 analysis_analysis_id
1 Aliquot Ensemble Somatic Variant Merging and Masking ff3ab545-8f7d-43b6-9430-e7bc9e5e7867
2 Aliquot Ensemble Somatic Variant Merging and Masking b61146b1-75ca-4b07-9349-9a421dbcb791
3 Aliquot Ensemble Somatic Variant Merging and Masking 54c3835f-81d5-4372-95e3-31b46c104d29
         analysis_created_datetime
1 2022-01-26T15:40:56.423362-06:00
2 2022-01-26T15:47:30.156028-06:00
3 2022-01-26T15:50:06.445540-06:00

So although I specify primary tumors I think it returns the normal cases as well (duplicate samples). This way I cannot even match the id's to the TCGA sample barcodes to remove duplicate samples... Everything worked fine with the GDCquery() function till yesterday. :( Also all the ID columns are unique (509 unique elements) and the cases column does not contain any values so it is impossible to find the duplicate samples.

ADD REPLY • link 2.7 years ago Benedek ▴ 20

1

Entering edit mode

Hi Benedek. the code below worked on my side. You also need to set the directory argument in GDCprepare.

tcga_maf <- GDCquery(
      project = "TCGA-HNSC", 
      data.category = "Simple Nucleotide Variation", # Simple nucleotide variation if legacy
      data.type = "Masked Somatic Mutation",
       access = "open", 
       legacy = F,
       sample.type = "Primary Tumor"
    )    
GDCdownload(tcga_maf, directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
tcga_maf <- GDCprepare(tcga_maf, directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")

ADD REPLY • link 2.7 years ago Tiago C. Silva ▴ 270

0

Entering edit mode

It's weird but I still get the same error message:

tcga_maf <- GDCquery(project = "TCGA-HNSC", 
                     data.category = "Simple Nucleotide Variation", # Simple nucleotide variation if legacy
                     data.type = "Masked Somatic Mutation",
                     access = "open", 
                     legacy = F,
                     sample.type = "Primary Tumor")
GDCdownload(tcga_maf,
            directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")
tcga_maf <- GDCprepare(tcga_maf,
                       directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks")

Output:

Error in GDCprepare(tcga_maf, directory = "/home/rstudio/san1/BD/datasets/TCGA_biolinks") : 
  There are samples duplicated. We will not be able to prepare it

I have TCGAbiolinks version 2.14.1 and R version 3.6.1 (2019-07-05).

Update: when using newer version of R (4.2) then it works just as for you. Thanks for the help!

ADD REPLY • link 2.7 years ago Benedek ▴ 20

1

Entering edit mode

Hi Benedek, I run your code and it is ok. Maybe, as Tiago suggested, the problem is the directory of GDCprepare. Bye. Barbara

ADD REPLY • link 2.7 years ago Barbara ▴ 10

0

Entering edit mode

I still got the same error message (R version 3.6), although when using newer R version then it works well. Thanks for your help!

ADD REPLY • link 2.7 years ago Benedek ▴ 20

0

Entering edit mode

library(TCGAbiolinks) query_SNV <- GDCquery(project = "TCGA-GBM", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking") GDCdownload(query_SNV)

ADD REPLY • link 2.3 years ago 1526466763 • 0