Up until now I have used the UCSC Xenabrowser to access TCGA survival data. I decided to learn TCGAbiolinks because I thought it would be more fluid and automatable. I know from having done many survival analyses that from Xenabrowser I can access 371 survival data points (censoring and time of event). This is shown in this image https://photos.app.goo.gl/itfh5a1ppdYakWrz7 where OS.time is the time of the event, and is equal to either daystodeath or daystolastfollowup. OS is the event (1 for a death, zero for censoring). I downloaded the data and confirmed in excel that there were no duplicate patient IDs. Still a total of 371.
Here is a script that I wrote to reproduce these data using TCGAbiolinks: Forgive my hacky and verbose R coding, but I think that my comments should be informative. In the end there are only 343 survival data points! Is TCGAbiolinks not retrieving the data correctly? Perhaps there's a bug I'm not seeing and you have some advice for me to streamline my R code. Perhaps I'm not using TCGAbiolinks correctly. Any insight into this discrepancy would be greatly appreciated!
Code updated to be simpler, more legible
library("TCGAbiolinks")
library("tidyverse")
query <- GDCquery( project = "TCGA-LIHC",
data.category = "Clinical",
file.type = "xml",
legacy = FALSE)
GDCdownload(query,directory = ".")
clinical <- GDCprepare_clinic(query, clinical.info = "patient",directory = ".")
#getting the survival time of event data
survival_data <- as_tibble(clinical[,c("days_to_last_followup","days_to_death","vital_status","bcr_patient_barcode","patient_id")])
survival_data <- filter(survival_data,!is.na(days_to_last_followup)|!is.na(days_to_death)) #not both NA
survival_data <- filter(survival_data,is.na(days_to_last_followup)|days_to_last_followup>0)&is.na(days_to_death)|days_to_death > 0 )) #ensuring positive values
survival_data <- survival_data[!duplicated(survival_data$patient_id),] #ensuring no duplicates
dim(survival_data) #should be 371
Update: There are missing data when I take this approach as well
query <- GDCquery(project = "TCGA-LIHC",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
followup <- clinical.BCRtab.all$clinical_follow_up_v4.0_lihc
I need to check better your code to understand it. Unfortunately I don't have a lot of time now. But:
1) did you tried the indexed clinical data? This should be the clinical data with the follow up updated. index.clinical <- GDCqueryclinic("TCGA-LIHC") index.clinical[,c("daystolastfollowup","daysto_death")]
2) Did you check this paper: "An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics" https://doi.org/10.1016/j.cell.2018.02.052. They manually curated some of the TCGA clinical data.
3) for the BRC Biotab, not all patients might have follow up information. The final clinical data is the initial clinical (clinicalpatientlihc) + follow ups (clinicalfollowupv4.0lihc)
Hey Tiago! Thanks for your helpful response. I haven't read that paper but can tell that I need to.
I just finished going through the indexed clinical lihc data, and both the 'followup' and 'patient' sections of both the bcr biotab and the xml, and I found that they all have complete survival data that are completely equivalent to one another; and that you have to merge the most up to date survival data from both the followup and patient sections of the xml and biotab to do get them to be completely equivalent. I am guessing that GDCquery_clinic does this merge automatically in the background. Compared to the complete LIHC survival data from xenabrowser there is only one difference: patient TCGA-DD-A11A is listed as alive in the data from biolinks but dead in xenabrowser (event time is the same). I know that this is a small difference but it makes me wonder whether there might be other small mistakes for other cancers. Or the error could be in the xenabrowser data. I am not sure which one is right. In any case maybe it would be good to make a small note in the documentation of the fact that you have to perform this merge if you're using survival data from the xml or biotab?
Thanks again for your help.