What is the difference between 3 RNA-seq bams for same sample/aliquot_barcode? What's changed for the manifest file (GDC download)?
Take TCGA-A6-A5ZU-01A in COAD as an example:
There are 3 bams for it as below (in the manifest file). What's the difference between these files:
rna_seq.transcriptome.gdc_realn.bam,rna_seq.chimeric.gdc_realn.bam, *rna_seq.genomic.gdc_realn.bam ?
I can retrieve them separately by setting "analysis.workflow_type" in "GenomicDataCommons::filter". But what is the difference between "rna_seq.transcriptome.gdc_realn.bam" and "rna_seq.genomic.gdc_realn.bam"? If the later is the alignment to genome, why it is less than (about half) the former one (rna_seq.transcriptome.gdc_realn.bam)? I tried to look into the bam file to check the command to generate these bams, but the commands (used STAT) are same, feel confused about this, and not sure which one to use for downstream analysis, which I'd like to get the reads aligned to transcriptome.
# A tibble: 3 × 3
filename md5 size
<chr> <chr> <dbl>
1 1ab4a652-7fde-4128-ade4-0a1c56eede64.rna_seq.genomic.gdc_realn.bam 2b2fe9d2d677e8ed185c716bd3515616 2771700699
2 1ab4a652-7fde-4128-ade4-0a1c56eede64.rna_seq.transcriptome.gdc_realn.bam e1551e125123abcd60010b744becb6f5 4945055003
3 1ab4a652-7fde-4128-ade4-0a1c56eede64.rna_seq.chimeric.gdc_realn.bam 011fd055ae3966f41290e9d7cd786052 29159436
By setting "analysis.workflow_type", I can get 62 files for both rna_seq.transcriptome.gdc_realn.bam and rna_seq.chimeric.gdc_realn.bam, and 66 for *rna_seq.genomic.gdc_realn.bam for the cohorts of my interests.
For the description above, I used GenomicDataCommons_1.20.3., and the retrieved bams were aligned to GRCh38.
I also used GenomicDataCommons before, but can't recall the version. The point is that I did get different manifest file with the previous version. Let's still take TCGA-A6-A5ZU-01A as an example, and there is only one bam for it.
From the previous run (can't recall the version):
The folder id is 1ba88783-3486-4ecc-98a5-1a99eaccd77c, the bam file is bbbc6be9-0d92-47fb-b1fa-946915b6e533_gdc_realn_rehead.bam, and the size is 3256030252, based on the previous manifest file.
The number of bams is 66 for the cohorts of my interests. Based on the above description using GenomicDataCommons_1.20.3, I speculate the previous bams were aligned genomically (because it is same number as *rna_seq.genomic.gdc_realn.bam)?