Question

Analysing categorical CGH data

0

Entering edit mode

Aedin Culhane ▴ 510

@aedin-culhane-1526

Last seen 5.4 years ago

United States

Dear Richard, I was on holidays when you mailed BioC, and I only spotted your question today. CA is not the best approach for analysis of your factor table with 3 categories. It is designed for analysis of count data, originally for analysis of contingency data (species x traits counts). If wish to apply an ordination (ie dimension reduction) method to your categorical data. You could apply multiple correspondence analysis which is available in the ade4 package in the function dudi.acm If you have made4 installed, it will have installed ade4 automatically. Currently I have no wrappers between bioconductor and the function dudi.acm in ade4, however I could implement this as an extension to ord if you wish. There are further methods available in ade4. If you can apply weight to the categories, there is fuzzy correspondence analysis dudi.fca, or if you have a mix of quantitative and factor data, you can apply dudi.mix or dudi.hillsmith. See the ade4 manual for more details on these. I have not applied any of these approaches to CGH data myself so I can't comment on how well they will work. However I am glad to help you if I can. Regards Aedin Message: 3 Date: Mon, 17 Jul 2006 14:57:00 +0100 From: "Richard Birnie" Subject: [BioC] Analysing categorical CGH data Hi all, This is actually a relatively broad question regarding what would be the most appropriate methods to analyse a particular dataset and by extension which packages I need to perform said methods. If this is not the appropriate place for this question then I apologise, could someone please suggest where I might be more likely to find help. The two main questions I am looking for answers two are: 1) Are there methods available for clustering of categorical data of the type I describe below, and if so what are they? Even just references to other resources would be a good start. 2) Following on from this is the application of correspondence analysis a reasonable approach and have I done it correctly? What follows is a little long winded but I couldn't express it any more briefly. I am attempting to analyse a set of CGH data from a series of colorectal cancer samples that I have downloaded from the Progenetix CGH database (found at http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html). This dataset includes samples from colorectal adenomas, carcinomas and metastases. Each case (patient) is described in terms of gain or loss of individual chromosome bands using the 862 band resolution ISCN notation. Such that each band is scored as either 0 = no change, -1 = loss, 1 = gain, 2 = high level amplification. Essentially this results in a dataset with 440 observations (there are 440 cases) of 862 variables where each variable can take 1 of 4 possible values. Additionally each case falls into one of three possible categories adenoma, primary tumour, metastasis. I wish to perform cluster analysis on this dataset to identify changes that are associated with particular stages in the progression from adenoma-carcinoma-metastasis. What would be the most appropriate method for this and what packages supply said method? So far I have tried hierarchical clustering using the pvclust R package, k-means clustering from the stats package and correspondance analysis from the made4 package. Are these methods valid for categorical data? I have tried searching the web and the mailing lists for this question without finding a satisfactory answer. It seems to be implied that they are only suitable for continuous data but I could not find an explicit answer. To try and get around this I attempted correspondence analysis, which I am led to believe from reading around is suitable for categorical data. However this method is outside my current (fairly elementary) knowledge of statistics so I wanted to confirm if I am applying it correctly. I loaded my dataset as a dataframe with cases as columns and chromosome bands as rows. I also loaded a class vector that categorised each case (column of dataframe) as 'adenoma', 'primary' or 'tumour'. I then ran the analysis using >>P.coa <- ord(Progenetix.coa,type="coa",classvec=Progenetix.class) >>plot(P.coa, classvec=Progenetix.class, arraycol=c("red","blue","yellow","green")) This resulted in everything being clustered close to the origin with the three classes on top of each other. This suggests that there are no informative clusters in the data set however my concern is that this is caused by me applying the wrong methods. Any advice anyone has the time to give will be much appreciated. I have deliberately not described my so far unsuccessful attempts in much detail to try and keep the length down. I can supply more detail on what I have tried if required. Thanks for your patience. regards, Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk

Clustering Cancer CGH GLAD made4 Clustering Cancer CGH GLAD made4 • 1.2k views

ADD COMMENT • link 18.5 years ago Aedin Culhane ▴ 510