70000 is not large. It's not small, but it's also not particularly large... The problem is the limited scalability of matrix-oriented approaches.
But there are plenty of clustering algorithms which do not use matrixes and do no need O(n^2)
(or even worse, O(n^3)
) runtime.
You may want to try ELKI, which has great index support (try the R*-tree with SortTimeRecursive bulk loading). The index support makes it a lot lot lot faster.
If you insist on using R, give at least kmeans a try and the fastcluster
package. K-means has runtime complexity O(n*k*i)
(where k is the parameter k, and i is the number of iterations); fastcluster has an O(n)
memory and O(n^2)
runtime implementation of single-linkage clustering comparable to the SLINK algorithm in ELKI. (The R "agnes" hierarchical clustering will use O(n^3)
runtime and O(n^2)
memory).
Implementation matters. Often, implementations in R aren't the best IMHO, except for core R which usually at least has a competitive numerical precision. But R was built by statisticians, not by data miners. It's focus is on statistical expressiveness, not on scalability. So the authors aren't to blame. It's just the wrong tool for large data.
Oh, and if your data is 1-dimensional, don't use clustering at all. Use kernel density estimation. 1 dimensional data is special: it's ordered. Any good algorithm for breaking 1-dimensional data into inverals should exploit that you can sort the data.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…