Entering edit mode
Hi,
I downloaded maf files by TCGAbiolinks
coad.mutect.maf <- GDCquery_Maf("COAD", pipelines = "mutect") coad.mutect.maf %>% summarise(n_distinct(Tumor_Sample_Barcode)) #435 coad.mutect.maf %>% filter(Hugo_Symbol == "KRAS") %>% summarise(n_distinct(Tumor_Sample_Barcode)) #39
only 39/435 samples have KRAS mutation,
However, if I downloaded from firehose http://firebrowse.org/?cohort=COAD&download_dialog=true
around half the samples have KRAS mutations. so what's the difference?
#total ~/Downloads/gdac.broadinstitute.org_COAD.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0$ ls -1 *txt | wc -l 368 # with KRAS mutation ~/Downloads/gdac.broadinstitute.org_COAD.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0$ grep -l KRAS *txt | wc -l 176
One difference is GDCquery_Maf is accessing data aligned to hg38 while the one in firehose was aligned to hg19.
can I get the old hg19 data? but even it is aligned to hg38, the difference is just too big.
Yes you can download MAF aligned to hg19, but it is not in the manual, I'm adding it.
https://gist.github.com/tiagochst/03f5a11aa45d67940d65c8dd9bc90a70
Yes, the difference is really big!
From GDC Faqs
Why might variants found in TCGA-generated MAFs be missing from the GDC open access MAF files?
Some of the reasons particular mutations may have been removed include updates to third party databases, more conservative germline-masking rules by the GDC, and different mutation calling pipelines and versions. Despite these differences, the GDC recaptures over 97% of TCGA-validated variants in the controlled-access MAF files. The GDC suggests using controlled-access MAF files if important variants cannot be found in somatic MAF files.