I am inexperienced with iteration in R and am hoping to speed up a process as I am implementing some analysis in a website. I realise this may be a Q for Stackoverflow but I couldn't get an answer from them..
I found a very useful tutorial that allows me to iterate through an adjacency matrix (used to make a data frame as input for networkd3), pick out the criteria data above a certain threshold (>0.01) and then permeate three vectors: "source, target and corr" with these values to eventually make a nicely organised data frame.
adjacency<-adjacency(dat)
m<-adjacency
source=c()
target=c()
corr<-c()
g1<-rownames(adjacency)[1:dim(m)[1]]
g2<-g1
for(gene in g1){
for(gen in g2){
if(m[gene,gen]>0.001){
source<-c(source,gene)
target<-c(target,gen)
corr<-c(corr,m[gene,gen])
}
}
}
network<-data.frame(source,target,corr)
However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...
network2 = which(m > 0.001, arr.ind = TRUE)
source = network2[,1]
target = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")
My problem is that when I run these two pieces of code on the same vector
>network
source target corr
1 n1 n1 1.000000000
2 n1 n3 0.001466542
3 n1 n7 0.001164927
4 n1 n9 0.004027818
5 n2 n2 1.000000000
6 n2 n9 0.002580978
The above is the expected output from the for loop code:
However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...
network2 = which(m > 0.001)
source = network2[,1]
target = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")
The problem is that I get the following error:
matching_indices = which(m > 0.01)
source = matching_indices[,1]
Error in matching_indices[, 1] : incorrect number of dimensions
This is because I have not used the argument 'are.ind=TRUE' in the 'match' command. But specifying this gives undesired (as it takes only the self-interacting nodes into account) output... so I'm wondering what I should do about this?
The undesired output is as follows (notice that 'corr' is unanimously 1, only taking self replicating nodes into account):
> network2
source target corr
n1 1 1 1
n2 2 2 1
n3 3 3 1
n4 4 4 1
n5 5 5 1
n6 6 6 1
How do I get the output of the vectorised format code to be the same as the the for loop kind?
example data can be stored in the variable 'dat' with the following code:
library('dplyr')
library('igraph')
library('RColorBrewer')
set.seed(1)
# generate a couple clusters
nodes_per_cluster <- 30
n <- 10
nvals <- nodes_per_cluster * n
# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)
dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_', 1:nodes_per_cluster))
Hi there thanks for the reply, while your code gets rid of the problem with needing to specify "arr.ind=TRUE" it produces the same output as the unwanted df given above- where it only seems to take self -interacting nodes into account.... do you reckon there is something wrong with the 'for loop' code that I give? I got it from an example tutorial and its output seems to indicate that it is attempting something different to that of the vectorised code... I am trying to take the edge values that are higher than 0.01 and make a network out of them....
I added the first few rows of output to my answer. The results include all nodes satisfying the criterion, not just self-interacting. The order of results is different from the for loop; is this what is causing you problems? I also added a variant where the row and column names are reported, rather than the indexes. Or put another way, what do you mean by 'self-interacting'?
Hi yes I have checked with my own code and unfortunately, while the output is improved as the self-interactive nodes are no longer a problem (i.e. the nodes have correlation values for nodes besides themselves), the output is till different to that of the 'for loop' code- self-interating means that the nodes interact with themselves by being given a correlation value of 1 diagonally through the matrix you can get rid of this with 'diag(adj_matrix)<-0'
I updated my response to show that the results of the for loop and my approach (also, your approach using
which(m > 0.001, arr.ind=TRUE)
) are identical except the rows of the data frames are in different order.Thanks so much it seems to work like you said