Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
498 views
in Technique[技术] by (71.8m points)

r - Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices

I wonder if anyone could have a look at the following code and minimal example and suggest improvements - in particular regarding efficiency of the code when working with really large data sets.

The function takes a data.frame and splits it by a grouping variable (factor) and then calculates the distance matrix for all the rows in each group.

I do not need to keep the distance matrices - only some statistics ie the mean, the histogram .., then they can be discarded.

I don't know much about memory allocation and the like and am wondering what would be the best way to do this, since I will be working with 10.000 - 100.000 of cases per group. Any thoughts will be greatly appreciated!

Also, what would be the least painful way of including bigmemory or some other large data handling package into the function as is in case I run into serious memory issues?

FactorDistances <- function(df) {
  # df is the data frame where the first column is the grouping variable. 
  # find names and number of groups in df (in the example there are three:(2,3,4)
  factor.names <- unique(df[1])
  n.factors <-length(unique(df$factor))
  # split df by factor into list - each subset dataframe is one list element
  df.l<-list()
  for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
  # use lapply to go through list and calculate distance matrix for each group
  # this results in a new list where each element is a distance matrix
  distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))  
  # again use lapply to get the mean distance for each group
  means <- lapply (distances,  mean)  
  rm(distances)
  gc()
  return(means)
}

df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN

Edit: I edited the title to reflect the chunking issue I posted as an answer..

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...