This is an implementation of the DBSCAN clustering algorithm on top of Apache Spark. It is loosely based on the paper from He, Yaobin, et al. "MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data".
-
I have overrided the version of Irving Cordova .
-
His version only can apply on two-dimensional dataset . Now this version can apply on arbitrary-dimension dataset.
-
And the first version has done, and i have released it .
-
In this verson , i do not use the r-tree to improve the operating efficiency of this algorithm. DBSCAN on Spark is built against Scala 2.10.
import org.apache.spark.mllib.clustering.dbscan.DBSCAN
object DBSCANSample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DBSCAN Sample")
val sc = new SparkContext(conf)
val data = sc.textFile(src)
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
log.info(s"EPS: $eps minPoints: $minPoints")
val model = DBSCAN.train(
parsedData,
eps = eps,
minPoints = minPoints,
maxPointsPerPartition = maxPointsPerPartition)
model.labeledPoints.map(p => s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)
sc.stop()
}
}
DBSCAN on Spark is available under the Apache 2.0 license. See the LICENSE file for details.
- If you have any question, you can communicate with me . [email protected]