Hi:
I am using Affymetrix microarray gene expression data, and I am trying to do feature selection by using different methods. However, I am quite interested in one of the popular methods for feature extraction - minimum redundancy maximum relevance. I found CRAN
package, Parallelized Minimum Redundancy, Maximum Relevance (mRMR) Ensemble Feature Selection
and used this package for gene filtering, but I can't able to select top-ranked genes incrementally. I didn't see a corresponding Bioconductor package that implements mRMR
methods for gene expression data. Can anyone point me out the possible recommended workflow for extracting top-ranked genes by using minimum redundancy maximum relevance
feature selection method? Any possible strategy or ideas would be highly appreciated.
reproducible data:
Here is the minimal reproducible gene expression data that I want to select top-ranked genes by using mRMR
method:
> dput(raw_genes)
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01",
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01",
"Tarca_051_P1E03", "Tarca_063_P1F03"), target_age = c(11, 15.3, 21.7,
26.7, 31.3, 32.1, 19.7, 23.6), `1_at` = c(6.06221469449721, 5.8755020052495,
6.12613148162098, 6.1345548976595, 6.28953417729806, 6.08561779473768,
6.25857984382111, 6.22016811759586), `10_at` = c(3.79648446367096,
3.45024474095539, 3.62841140410044, 3.51232455992681, 3.56819306931016,
3.54911765491621, 3.59024881523945, 3.69553021972333), `100_at` = c(5.84933778267459,
6.55052475296263, 6.42187743053935, 6.15489279092855, 6.34807354206396,
6.11780116002087, 6.24635169763079, 6.25479583503303), `1000_at` = c(3.5677794435745,
3.31613364795286, 3.43245075704917, 3.63813996294905, 3.39904385276621,
3.54214650423219, 3.51532853598111, 3.50451431462302), `10000_at` = c(6.16681461038468,
6.18505928400759, 5.6337568741831, 5.14814946571171, 5.64064316609978,
6.25755205471611, 5.68110995701518, 5.14171528059565), `100009613_at` = c(4.44302662142323,
4.3934877055859, 4.6237834519809, 4.66743523288194, 4.97483476597509,
4.78673497541689, 4.77791032146269, 4.64089637146557), `100009676_at` = c(5.83652223195279,
5.89836406552412, 6.01979203584278, 5.98400432133011, 6.1149144301085,
5.74573650612351, 6.04564052289621, 6.10594091413241)), class = "data.frame", row.names = c("Tarca_001_P1A01",
"Tarca_013_P1B01", "Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01",
"Tarca_061_P1F01", "Tarca_051_P1E03", "Tarca_063_P1F03"))
my attempt:
library(mRMRe)
data.cgps <- data.frame(raw_genes, raw_genes$target_age)
dd <- mRMR.data(data = data.cgps)
res <- mRMR.classic(data = dd, target_ind=hta.all[[2]], feature_count = 6)
solutions(res)
res_ens <- mRMR.ensemble(data = dd, target_indices = c(1), solution_count = 1, feature_count = 6)
solutions(res_ens)
I came up this script by reading mRMRe
documentation, but it didn't work for me, I bet I was wrong data.cgps
representation.
when I also tried this workflow to actual gene expression data (367 samples in rows and 30,000 genes in the column), my computer is getting very slow even freeze a while. I think MIM computation is quite demanding for my machine (4GB RAM). Perhaps the above attempt may not be a good fit for filtering gene expression data. Any recommendation?
I am thinking about parallel processing for mRMR
feature selection to incrementally select top-ranked genes. I didn't find a similar thread in Bioconductor and Bioconductor package for gene filtering based on minimum redundancy maximum relevance feature selection method. How can I resolve this challenge? can anyone recommend me feasible workflow for selecting top-ranked genes based on mRMR
method? How can I overcome this problem? Any thoughts, script help or suggested workflow would be highly appreciated. Thanks