Question

most efficient way to read adjacency matrix to make networkd3 graph

0

Entering edit mode

chrisclarkson100 ▴ 50

@chrisclarkson100-11114

Last seen 2.8 years ago

United Kingdom

I am inexperienced with iteration in R and am hoping to speed up a process as I am implementing some analysis in a website. I realise this may be a Q for Stackoverflow but I couldn't get an answer from them..

I found a very useful tutorial that allows me to iterate through an adjacency matrix (used to make a data frame as input for networkd3), pick out the criteria data above a certain threshold (>0.01) and then permeate three vectors: "source, target and corr" with these values to eventually make a nicely organised data frame.

adjacency<-adjacency(dat)
m<-adjacency
source=c()
target=c()
corr<-c()
g1<-rownames(adjacency)[1:dim(m)[1]]
g2<-g1
for(gene in g1){
  for(gen in g2){
    if(m[gene,gen]>0.001){
      source<-c(source,gene)
      target<-c(target,gen)
      corr<-c(corr,m[gene,gen])
    }
  }
}
network<-data.frame(source,target,corr)

However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...

network2 = which(m > 0.001, arr.ind = TRUE)
source = network2[,1]
target  = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")

My problem is that when I run these two pieces of code on the same vector

>network
   source target        corr
1      n1     n1 1.000000000
2      n1     n3 0.001466542
3      n1     n7 0.001164927
4      n1     n9 0.004027818
5      n2     n2 1.000000000
6      n2     n9 0.002580978

The above is the expected output from the for loop code:

However for big data frames this code is quite slow and I switched to vectorised format. And yet when I try it in vectorised format its output is different...

network2 = which(m > 0.001)
source = network2[,1]
target  = network2[,2]
corr = m[m > 0.001]
network2<-cbind(network2, corr)
colnames(network2)<-c("source", "target", "corr")

The problem is that I get the following error:

matching_indices = which(m > 0.01)
source = matching_indices[,1]

Error in matching_indices[, 1] : incorrect number of dimensions

This is because I have not used the argument 'are.ind=TRUE' in the 'match' command. But specifying this gives undesired (as it takes only the self-interacting nodes into account) output... so I'm wondering what I should do about this?

The undesired output is as follows (notice that 'corr' is unanimously 1, only taking self replicating nodes into account):

> network2
    source target corr
n1       1      1    1
n2       2      2    1
n3       3      3    1
n4       4      4    1
n5       5      5    1
n6       6      6    1

How do I get the output of the vectorised format code to be the same as the the for loop kind?

example data can be stored in the variable 'dat' with the following code:

library('dplyr')
library('igraph')
library('RColorBrewer')

set.seed(1)

# generate a couple clusters
nodes_per_cluster <- 30
n <- 10

nvals <- nodes_per_cluster * n

# cluster 1 (increasing) 
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) + 
                   rnorm(nvals, sd=1),
                   nrow=nodes_per_cluster, byrow=TRUE)

# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) + 
                   rnorm(nvals, sd=1),
                   nrow=nodes_per_cluster, byrow=TRUE)

# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
                rnorm(nvals, sd=1.5),
                nrow=nodes_per_cluster, byrow=TRUE)

dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster), 
                   paste0('cluster2_', 1:nodes_per_cluster),
                   paste0('noise_',    1:nodes_per_cluster))

r iterators performance • 2.8k views

ADD COMMENT • link updated 8.6 years ago by phil.chapman ▴ 150 • written 8.6 years ago by chrisclarkson100 ▴ 50

score 1 · Answer 1 · 2016-07-20

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 7 weeks ago

United States

Here's some data

set.seed(123)
m = cor(matrix(rnorm(100), 10, dimnames=list(LETTERS[1:10], LETTERS[1:10])))

Something like

idx = m > 0.001
df = data.frame(
    source = row(m)[idx],
    target = col(m)[idx],
    corr = m[idx])

Code that uses a for loop, and 'copy-and-append' where a result grows inside the loop, should be viewed very suspiciously -- it makes n * (n-1) / 2 copies as the the result grows, so scales quadratically with the size of the problem.

The first few rows of the data frame are

> head(df)
  source target       corr
1      1      1 1.00000000
2      2      1 0.57761512
3      4      1 0.67301956
4      6      1 0.09055842
5      8      1 0.65713875
6      1      2 0.57761512

A variant of this would use the row / column names rather than indexes

df = data.frame(
    source = rownames(m)[ row(m)[idx] ],
    target = colnames(m)[ col(m)[idx] ],
    corr = m[idx])

This result is identical to the for loop, but with rows in a different order. This is shown by the following code, where I have run the for loop using adjacency = m

> df1 = df[order(df$source, df$target),]; rownames(df1) = NULL
> network1 = network[order(network$source, network$target),]; rownames(network1) = NULL
> identical(df1, network1)
[1] TRUE

ADD COMMENT • link 8.6 years ago Martin Morgan 25k

0

Entering edit mode

Hi there thanks for the reply, while your code gets rid of the problem with needing to specify "arr.ind=TRUE" it produces the same output as the unwanted df given above- where it only seems to take self -interacting nodes into account.... do you reckon there is something wrong with the 'for loop' code that I give? I got it from an example tutorial and its output seems to indicate that it is attempting something different to that of the vectorised code... I am trying to take the edge values that are higher than 0.01 and make a network out of them....

ADD REPLY • link 8.6 years ago chrisclarkson100 ▴ 50

0

Entering edit mode

I added the first few rows of output to my answer. The results include all nodes satisfying the criterion, not just self-interacting. The order of results is different from the for loop; is this what is causing you problems? I also added a variant where the row and column names are reported, rather than the indexes. Or put another way, what do you mean by 'self-interacting'?

ADD REPLY • link 8.6 years ago Martin Morgan 25k

0

Entering edit mode

Hi yes I have checked with my own code and unfortunately, while the output is improved as the self-interactive nodes are no longer a problem (i.e. the nodes have correlation values for nodes besides themselves), the output is till different to that of the 'for loop' code- self-interating means that the nodes interact with themselves by being given a correlation value of 1 diagonally through the matrix you can get rid of this with 'diag(adj_matrix)<-0'

ADD REPLY • link 8.6 years ago chrisclarkson100 ▴ 50

0

Entering edit mode

I updated my response to show that the results of the for loop and my approach (also, your approach using which(m > 0.001, arr.ind=TRUE)) are identical except the rows of the data frames are in different order.

ADD REPLY • link 8.6 years ago Martin Morgan 25k

0

Entering edit mode

Thanks so much it seems to work like you said

ADD REPLY • link 8.6 years ago chrisclarkson100 ▴ 50

score 0 · Answer 2 · 2016-07-20

0

Entering edit mode

phil.chapman ▴ 150

@philchapman-8324

Last seen 8.5 years ago

United Kingdom

It's quite difficult to follow your code without being able to run it. As I understand it you're starting from a matrix, so perhaps you could make a very small, toy object that you could include in your code to get us started? The following approach may be useful as it generates an R command at the end which can be copied and pasted into a script and run to construct the (in this case) toy object:

> data(mtcars)
> m <- as.matrix(mtcars)
> toy <- m[1:5,1:5]
> toy
                   mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
> dput(toy)
structure(c(21, 21, 22.8, 21.4, 18.7, 6, 6, 4, 6, 8, 160, 160,
108, 258, 360, 110, 110, 93, 110, 175, 3.9, 3.9, 3.85, 3.08,
3.15), .Dim = c(5L, 5L), .Dimnames = list(c("Mazda RX4", "Mazda RX4 Wag",
"Datsun 710", "Hornet 4 Drive", "Hornet Sportabout"), c("mpg",
"cyl", "disp", "hp", "drat")))

ADD COMMENT • link 8.6 years ago phil.chapman ▴ 150

0

Entering edit mode

hi there, thanks for the reply does this adj matrix work for you?

6450255 2570615 6370619 2600039 2650615 5340672 2000519 3870044 6450255 1.0000000 0.27049207 -0.189722700 0.4304964296 0.26144185 0.203182365 0.5261722409 0.28750105 2570615 0.2704921 1.00000000 0.042253692 0.3269207024 0.26079551 0.172129526 -0.1091231736 0.13630307 6370619 -0.1897227 0.04225369 1.000000000 -0.3046903972 0.15350119 -0.002081143 0.0154829510 -0.28944753 2600039 0.4304964 0.32692070 -0.304690397 1.0000000000 -0.05116379 -0.265505477 -0.0003627644 0.18117889 2650615 0.2614419 0.26079551 0.153501188 -0.0511637926 1.00000000 0.070419824 0.1920755290 0.39206708 5340672 0.2031824 0.17212953 -0.002081143 -0.2655054768 0.07041982 1.000000000 0.0118562293 0.23229705 2000519 0.5261722 -0.10912317 0.015482951 -0.0003627644 0.19207553 0.011856229 1.0000000000 0.01503415 3870044 0.2875011 0.13630307 -0.289447531 0.1811788864 0.39206708 0.232297053 0.0150341497 1.00000000 7050209 0.1073443 -0.11619209 -0.043299033 0.0086922869 0.13417166 -0.002184810 0.5265883416 0.11916228 1580181 -0.4386352 -0.20304406 0.441508770 -0.3195269908 -0.17324532 -0.086536659 -0.0233773362 -0.34569040 7050209 1580181 6450255 0.107344307 -0.43863522 2570615 -0.116192095 -0.20304406 6370619 -0.043299033 0.44150877 2600039 0.008692287 -0.31952699 2650615 0.134171658 -0.17324532 5340672 -0.002184810 -0.08653666 2000519 0.526588342 -0.02337734 3870044 0.119162282 -0.34569040 7050209 1.000000000 0.05452883 1580181 0.054528831 1.00000000

ADD REPLY • link 8.6 years ago chrisclarkson100 ▴ 50