Details

Type: New Feature

Status: Open

Priority: Major

Resolution: Unresolved

Affects Version/s: None

Fix Version/s: None

Component/s: Algorithms

Labels:None
Description
Implement the naive DBSCAN algorithm in Mahout Samsara, as part of the Algorithms Framework.
Issue Links
 links to
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Github user rawkintrevo commented on the issue:
https://github.com/apache/mahout/pull/334
Aditya and I talked about this on a hangout for posterity: Needs to be refactored into incore/distributed, made to fit with the Mahout Algos framework, and distanceMetric should leverage mahout's current distance metrics.
Github user dlyubimov commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r128654179
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
— End diff –
redundant dvec (imo), slice is already an `o.a.m.math.Vector`
Github user dlyubimov commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r128653815
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
+ var neighbourCount = 0
+ for(row < data) {
+ if(row(0).toInt != pointId) {
+ val arg1 = dvec(row)
— End diff –
i am not sure if `dvec` use is warranted... row is already an `o.a.m.math.Vector`
Github user dlyubimov commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r128653491
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
+ var neighbourCount = 0
+ for(row < data) {
+ if(row(0).toInt != pointId) {
+ val arg1 = dvec(row)
+ val arg2 = dvec(pointData)
+ if(distanceMetric(arg1, arg2) <= Eps)
+ }
+ }
+ var neighboursDvec: DenseVector = dvec(neighbours)
+ neighboursDvec
+ }
+
+ def addColumns(arg: Array[Double]): Array[Double] = {
+ val newArr = new Array[Double](arg.length + 5)
+ newArr(0) = 0.0 //coreFlag //Initialize all points as noncore points
+ newArr(1) = 0.0 //processedFlag //Initialize all points as nonprocessed points
+ newArr(2) = 1.0 //globalId
+ newArr(3) = 1.0 //clusterId //Initialize all points as not belonging to any cluster
+ newArr(4) = 0.0 //noiseFlag //Initialize all points as notNoise
+
+ for (i < 0 until (arg.size))
+ newArr
+ }
+
+ /*
+ Takes in two rows as input. Rows that contain the coordinates of the data as well as the augmented columns added at the beginning.
+ Computes distance of only the data part of the rows, ignoring the augmented columns.
+ //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly
+ Also give user the option of choosing the distance metric while running the algorithm.
+ */
+ def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = {
+ var diffsqSum = 1.0
+ if(arg1.length != arg2.length)
+ else {
+ val diff = arg1  arg2
+ val diffsq = diff^2
+ diffsqSum = 0.0
+ for(i < 4 until diffsq.length)
+ diffsqSum = math.sqrt(diffsqSum)
— End diff –
i'd just import scala.math._ and use just sqrt(a) etc.
Github user dlyubimov commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r128653039
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
— End diff –
do we really care to force DenseMatrix vs. just generic Matrix?
Github user dlyubimov commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r128653396
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
+ var neighbourCount = 0
+ for(row < data) {
+ if(row(0).toInt != pointId) {
+ val arg1 = dvec(row)
+ val arg2 = dvec(pointData)
+ if(distanceMetric(arg1, arg2) <= Eps)
+ }
+ }
+ var neighboursDvec: DenseVector = dvec(neighbours)
+ neighboursDvec
+ }
+
+ def addColumns(arg: Array[Double]): Array[Double] = {
+ val newArr = new Array[Double](arg.length + 5)
+ newArr(0) = 0.0 //coreFlag //Initialize all points as noncore points
+ newArr(1) = 0.0 //processedFlag //Initialize all points as nonprocessed points
+ newArr(2) = 1.0 //globalId
+ newArr(3) = 1.0 //clusterId //Initialize all points as not belonging to any cluster
+ newArr(4) = 0.0 //noiseFlag //Initialize all points as notNoise
+
+ for (i < 0 until (arg.size))
+ newArr
+ }
+
+ /*
+ Takes in two rows as input. Rows that contain the coordinates of the data as well as the augmented columns added at the beginning.
+ Computes distance of only the data part of the rows, ignoring the augmented columns.
+ //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly
+ Also give user the option of choosing the distance metric while running the algorithm.
+ */
+ def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = {
+ var diffsqSum = 1.0
+ if(arg1.length != arg2.length)
+ else {
+ val diff = arg1  arg2
+ val diffsq = diff^2
+ diffsqSum = 0.0
+ for(i < 4 until diffsq.length) {
— End diff –
DSL suggestion for this loop (not necessarily faster though, but DSLer: ) ):
diffSqSum += diffsq(4 until diffsq.length).sum
Github user rawkintrevo commented on the issue:
https://github.com/apache/mahout/pull/334
@andrewpalumbo proposing merging this to feature branch. It ended up being a bit hairier than we expected.
Github user andrewpalumbo commented on the issue:
https://github.com/apache/mahout/pull/334
Sounds good to me. Thx.
Sent from my Verizon Wireless 4G LTE smartphone
 Original message 
From: Trevor Grant <notifications@github.com>
Date: 08/25/2017 5:52 AM (GMT08:00)
To: apache/mahout <mahout@noreply.github.com>
Cc: Andrew Palumbo <ap.dev@outlook.com>, Mention <mention@noreply.github.com>
Subject: Re: [apache/mahout] [WIP] MAHOUT1991  Added code for incore DBSCAN ( O(n^2) ) (#334)
@andrewpalumbo<https://github.com/andrewpalumbo> proposing merging this to feature branch. It ended up being a bit hairier than we expected.
â€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/apache/mahout/pull/334#issuecomment324910936>, or mute the thread<https://github.com/notifications/unsubscribeauth/AHU2HcFyPlJMHCu0rmwyKpWWUH_Zzks5sbsONgaJpZM4OdJ9F>.
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r135379456
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
— End diff –
Nope. Changed all occurrences accordingly
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r135379575
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
— End diff –
Updated
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r135379577
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
+ var neighbourCount = 0
+ for(row < data) {
+ if(row(0).toInt != pointId) {
+ val arg1 = dvec(row)
— End diff –
Updated
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r135379586
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
@@ 0,0 +1,142 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.mahout.math.algorithms.clustering
+
+import scala.collection.JavaConversions.iterableAsScalaIterable
+import org.apache.mahout.math.scalabindings.::
+import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
+import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
+import org.apache.mahout.math.scalabindings.dvec
+import org.apache.mahout.math._
+import scalabindings._
+import scala.collection.mutable.ArrayBuffer
+import scala.collection.mutable.Queue
+
+class DBSCAN_Incore {
+
+ def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
+ data(i, 3) = clusterId.toDouble
+ var neighbourQueue = new Queue[Int]()
+ for(index < 0 until neighbours.length)
+
+ while(neighbourQueue.nonEmpty) {
+ var curr: Int = neighbourQueue.dequeue()
+ if(data(curr, 1) == 0) {
+ var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
+ data(curr, 1) = 1
+ if(currNeighbours.length >= Minpts) {
+ data(curr, 0) = 1 //coreFlag == True
+ for (index < 0 until currNeighbours.length)
+ }
+ if(data(curr, 3) == 1)
+ }
+ }
+ }
+
+ def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
+
+ var clusterId: Int = 0
+ for(i < 0 until data.nrow) {
+ if(data(i, 1) != 1) {
+ //i.e. if notProcessed...
+ val neighbours: DenseVector = findNeighbours(i, data, Eps)
+ data(i, 1) = 1.0 // ==> processedFlag = True
+ if(neighbours.length >= Minpts)
+ else
{ + data(i, 4) = 1.0 // ==> noiseFlag = True + } + }
+ }
+ }
+
+ /*
+ @args
+ point: Int  Id of the point whose neighbours are to be computed
+ data: DenseMatrix[Double]  The augmented matrix containing all the data points along with the 4 appended columns
+ Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
+ */
+ def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
+ val pointId: Int = data(point, 2).toInt
+ var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
+ val pointData: DenseVector = dvec(data(pointId, :)
+ var neighbourCount = 0
+ for(row < data) {
+ if(row(0).toInt != pointId) {
+ val arg1 = dvec(row)
+ val arg2 = dvec(pointData)
+ if(distanceMetric(arg1, arg2) <= Eps)
+ }
+ }
+ var neighboursDvec: DenseVector = dvec(neighbours)
+ neighboursDvec
+ }
+
+ def addColumns(arg: Array[Double]): Array[Double] = {
+ val newArr = new Array[Double](arg.length + 5)
+ newArr(0) = 0.0 //coreFlag //Initialize all points as noncore points
+ newArr(1) = 0.0 //processedFlag //Initialize all points as nonprocessed points
+ newArr(2) = 1.0 //globalId
+ newArr(3) = 1.0 //clusterId //Initialize all points as not belonging to any cluster
+ newArr(4) = 0.0 //noiseFlag //Initialize all points as notNoise
+
+ for (i < 0 until (arg.size))
+ newArr
+ }
+
+ /*
+ Takes in two rows as input. Rows that contain the coordinates of the data as well as the augmented columns added at the beginning.
+ Computes distance of only the data part of the rows, ignoring the augmented columns.
+ //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly
+ Also give user the option of choosing the distance metric while running the algorithm.
+ */
+ def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = {
+ var diffsqSum = 1.0
+ if(arg1.length != arg2.length)
+ else {
+ val diff = arg1  arg2
+ val diffsq = diff^2
+ diffsqSum = 0.0
+ for(i < 4 until diffsq.length)
+ diffsqSum = math.sqrt(diffsqSum)
— End diff –
Updated
Github user rawkintrevo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136095747
— Diff: website/docs/algorithms/clustering/dbscan/index.md —
@@ 0,0 +1,101 @@
+
+layout: algorithm
+title: DBSCAN
+theme:
+ name: retromahout
+
+
+### About
+
+[DBSCAN Clustering](https://www.aaai.org/Papers/KDD/1996/KDD96037.pdf)
+ is a simple and efficient density based clustering algorithm for group objects
+ into clusters. DBSCAN differs from other clustering algorithms in the way it
+ clusters objects. DBSCAN uses the density information of points to define clusters
+ Each object is represented as a point in the multidimensional feature space. The algorithm
+ uses the notion of core points, border points and noise points to perform the clustering.
+
+ Parameters of the Algorithm: (Epsilon, Min points)
+ * Epsilon (Eps)  denotes the radius of the sphere that defines the neighbourhood of a given data point
+ * Min points (Minpts)  the minimum threshold for the number of points in a point p's Epsneighbourhood to
+ make p a core point
+
+ Basic Definitions
+ * Epsneighbourhood of a point  The Epsneighbourhood of a point 'p' is defined as the set of points S
+ such that for each x 'belongs to' S, distance(p,x) <= eps
+ * Core point  A point 'p' is called a core point if the number of points in its Epsneighbourhood is >= Minpts
+ * Border point  A point that is not a core point but such that there exists a 'p' in its neighbourhood
+ such that p is a core point
+ * Noise point  Any point that does not classify as a core point or border point is a noise point. Noise points
+ do not belong to any cluster.
+
+ Algorithm:
+ The DBSCAN algorithm takes as parameters Eps and MinPts. Eps restricts the neighbourhood of a point and MinPts
+ denotes the threshold for the number of neighbours in Epsneighbours of point p, for it to form a cluster.
+ The algorithm randomly chooses a point q, and finds it Epsneighbourhood. If the number of points in the
+ Epsneighbourhood of q is less than MinPts, it is marked as a noise point. Otherwise it is marked as a core point
+ and a new cluster is created. If p is a core point, iteratively a Epsneighbourhood query is performed on each of
+ its neighbours and points are added to the created cluster. If no unvisited points can be added to cluster, the new
+ cluster is complete and no points will be added to the cluster in subsequent iterations. Then another unvisited point q'
+ is picked up and the same process is repeated. The algorithm terminates when all the points have been visited
+ i.e. when some of the points are added to clusters and some are marked as noise points. The total number of
+ Epsneighbourhood queries performed is equal to the size of the data and if no indexing data structure is
+ used then the calculation of Epsneighbourhood query involved the distance calculation with all the points .
+ Thus, if no indexing data structure is used, the complexity of the DBSCAN algorithm is O(n^2)
+
+#### Strategy for parallelization
+
+The DBSCAN algorithm is an inherently sequential algorithm. A quick look at the pseudo code provided in the original
+paper will show why.
+
+The parallelization strategy for DBSCAN involves 3 phases.
+* Data distribution phase preserving spatial locality
+* InCore DBSCAN Clustering
+* Merging InCore clusters to form Global clusters
+
+Few methods to parallelize DBSCAN have been proposed in the recent past and can be found [here](http://delivery.acm.org/10.1145/2390000/2389081/a62patwary.pdf?ip=14.139.128.15&id=2389081&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663674_474f7b3eaa352d59c2905c1f530907c1) and [here](http://delivery.acm.org/10.1145/2840000/2834894/a2gotz.pdf?ip=14.139.128.15&id=2834894&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663710_9001c4bdf3d4dc3ed6658d8a6aaf1190)
+
+In order to minimize the communication costs for large datasets, we provide an approximate DBSCAN algorithm. Data division among nodes is done randomly
+violating the spatial locality condition. Clustering is performed InCore and cluster merging is done by merging two clusters
+that are nearby and that contain significant number of points.
+
+### Parameters
+<div class="tablestriped">
+ <table class="table">
+ <tr>
+ <th>Parameter</th>
+ <th>Description</th>
+ <th>Default Value</th>
+ </tr>
+ <tr>
+ <td><code>'distanceMeasure</code></td>
+ <td>The metric used for calculating distance, see <a href="../distancemetrics.html">Distance Metrics</a></td>
+ <td><code>'Euclidean</code></td>
+ </tr>
+ <tr>
+ <td><code>'Eps</code></td>
+ <td>The radius that defines the neighbourhood of a point</code></td>
+ <td>NA</td>
+ </tr>
+ <tr>
+ <td><code>'Minpts</code></td>
+ <td>The threshold for number of points in the neighbourhood of another point</code></td>
+ <td>NA</td>
+ </tr>
+ </tr>
+ </table>
+</div>
+
+
+### Example
+
+ //Using the InCoreDBSCAN module
+ val input = dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 1.0), (6.0, 5.2, 5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, 10.0))
+
+ import org.apache.mahout.math.algorithms.clustering.InCoreDBSCAN
+ val distanceMetric = DistanceMetricSelector.select(DistanceMetricSelector.namedMetricLookup('Euclidean)))
— End diff –
+1 utilizing existing distanceMetrics framework.
Github user rawkintrevo commented on the issue:
https://github.com/apache/mahout/pull/334
@AdityaAS @andrewpalumbo @dlyubimov
Afaik, the RTree/Rect portion of this is solid I could really use those for something else I'm working on (ML/AI on Flink for upcoming talks).
As RTrees area core component for many hierarchical clustering algorithms, does anyone have any opinions to refactoring this JIRA into multiple the first of which being "Add RTree implementation" (and then additional for DBSCAN/ etc.)
Still proposing feature branch for most of this just splitting the RTrees which are done off into their own JIRA ticket/PR and merging.
Thoughts?
Github user andrewpalumbo commented on the issue:
https://github.com/apache/mahout/pull/334
@rawkintrevo @AdityaAS @dlyubimov Admittedly, I've not had a chance to review this closely, I think that, If the RTrees are solid, moving them to something like `org.apache.mahout.maath.commmon` and committing separately would make sense. We have other data structures from the prealgo framework days that need to be shared as well, e.g.: `ConfusionMatrix`: https://github.com/apache/mahout/tree/master/mathscala/src/main/scala/org/apache/mahout/classifier/stats which could also go in such a package.
Github user andrewpalumbo commented on the issue:
https://github.com/apache/mahout/pull/334
@rawkintrevo to clarify, you're talking about a PR against `master` for RTrees, correct?
Github user andrewpalumbo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136179872
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN.scala —
@@ 0,0 +1,217 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.clustering
+
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math._
+import org.apache.mahout.math.algorithms.common.distance.
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+import scala.collection.mutable
+import scala.io.Source
+
+class DistributedDBSCAN extends ClusteringFitter {
+
+ var epsilon: Double = _
+ var minPts: Int = _
+ var distanceMeasure: Symbol = _
+
+ def setStandardHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo > None)): Unit =
+
+ def fit[K](input: DrmLike[K],
+ hyperparameters: (Symbol, Any)*): DBSCANModel = {
+
+ setStandardHyperparameters(hyperparameters.toMap)
+ implicit val ctx = input.context
+ implicit val ktag = input.keyClassTag
+
+ val dmNumber = DistanceMetricSelector.namedMetricLookup(distanceMeasure)
+
+ val configBC = drmBroadcast(dvec(epsilon, minPts, dmNumber))
+
+ val clusters = input.allreduceBlock(
+ {
+ // Assign All Points to Clusters
+ case (keys, block: Matrix) =>
+ }, {
+ // Optionally Merge Clusters that are close enough
+ case (metadata1: Matrix, metadata2: Matrix) =>
+ })
+
+ val model = new DBSCANModel(1)
+ model.summary = s"""foo the bar"""
— End diff –
@AdityaAS do you have a plan for the model summary?
Github user andrewpalumbo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136180989
— Diff: mathscala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala —
@@ 45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers
+
+ test("inCore DBSCAN test") {
+// import org.apache.mahout.math.algorithms.clustering._
+//
+// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean)
+//// val url: URL = getClass().getResource("./data21.txt")
+// //TO DO //Need help
— End diff –
@AdityaAS if you need some toy data files to for tests, you can place them in `$MAHOUT_HOME/examples/bin/resources".
Github user andrewpalumbo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136181125
— Diff: mathscala/src/test/scala/org/apache/mahout/math/algorithms/data21.txt —
@@ 0,0 +1,23 @@
+21
— End diff –
see above comment.
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136310677
— Diff: mathscala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala —
@@ 45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers
+
+ test("inCore DBSCAN test") {
+// import org.apache.mahout.math.algorithms.clustering._
+//
+// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean)
+//// val url: URL = getClass().getResource("./data21.txt")
+// //TO DO //Need help
— End diff –
Ah. Thanks. I'll do that.
Github user rawkintrevo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136356536
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala —
@@ 0,0 +1,157 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.mahout.math.algorithms.clustering.rtree
+
+import scala.math._
+
+/*
+ A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner
+ */
+trait Rectangle{
+
+ require(bottomLeft.size == topRight.size) //Both the corners should have the same ns
+
+ def bottomLeft: List[Double] //Coordinates of bottom left corner
+ def topRight: List[Double] //Coordinates of top right corner
+ def n: Int = bottomLeft.size //nality
+
+ //Return the lowerLeft corner as a Point
+ def lowerLeftPoint: Point = Point(this.bottomLeft)
+
+ //Return the upper right corner as a Point
+ def upperRightPoint: Point = Point(this.topRight)
+
+ //Checks if this contains the Rectangle rect
+ def contains(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n ) {
+ if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight) )
+ }
+ true
+ }
+
+ def wraps(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n ) {
+ if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight) ) { + false + }
+ }
+ true
+ }
+
+ def intersects(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n)
+ true
+ }
+
+ def area(): Double = {
+ var area = 1.0
+ for(i < 0 until this.n)
+ area
+ }
+
+ def toMBR: MBR = MBR(this.bottomLeft, this.topRight)
+
+ def expandRectangle(rect: Rectangle): MBR = {
+ val minBottomLeft = new Array[Double](rect.n)
+ val maxTopRight = new Array[Double](rect.n)
+ for(i < 0 until rect.n)
+ MBR(minBottomLeft.toList, maxTopRight.toList)
+ }
+
+ //Area of 'this' rectangle after merging with 'rect'
+ def diffAreaAfterExpansion(rect: Rectangle): Double = {
+ val minBottomLeft = new Array[Double](rect.n)
+ val maxTopRight = new Array[Double](rect.n)
+ var area: Double = 0
+
+ for(i < 0 until rect.n)
+ area  this.area
+ }
+
+ //Distance of a point from the rectangle.
+ def distance(pt: Point): Double = {
+ val n = pt.coordinates.size
+ val dList = new Array[Double]
+ for(dim < 0 until n)
+ var squaredSum: Double = 0
+ for(i < 0 until n)
+ sqrt(squaredSum)
+ }
+
+}
+
+//Represents a point with coordinates and n information
+case class Point(coordinates: List[Double]) extends Rectangle {
— End diff –
Why not move this to its own file?
Github user rawkintrevo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136356644
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala —
@@ 0,0 +1,157 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.mahout.math.algorithms.clustering.rtree
+
+import scala.math._
+
+/*
+ A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner
+ */
+trait Rectangle{
+
+ require(bottomLeft.size == topRight.size) //Both the corners should have the same ns
+
+ def bottomLeft: List[Double] //Coordinates of bottom left corner
+ def topRight: List[Double] //Coordinates of top right corner
+ def n: Int = bottomLeft.size //nality
+
+ //Return the lowerLeft corner as a Point
+ def lowerLeftPoint: Point = Point(this.bottomLeft)
+
+ //Return the upper right corner as a Point
+ def upperRightPoint: Point = Point(this.topRight)
+
+ //Checks if this contains the Rectangle rect
+ def contains(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n ) {
+ if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight) )
+ }
+ true
+ }
+
+ def wraps(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n ) {
+ if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight) ) { + false + }
+ }
+ true
+ }
+
+ def intersects(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n)
+ true
+ }
+
+ def area(): Double = {
+ var area = 1.0
+ for(i < 0 until this.n)
+ area
+ }
+
+ def toMBR: MBR = MBR(this.bottomLeft, this.topRight)
+
+ def expandRectangle(rect: Rectangle): MBR = {
+ val minBottomLeft = new Array[Double](rect.n)
+ val maxTopRight = new Array[Double](rect.n)
+ for(i < 0 until rect.n)
+ MBR(minBottomLeft.toList, maxTopRight.toList)
+ }
+
+ //Area of 'this' rectangle after merging with 'rect'
+ def diffAreaAfterExpansion(rect: Rectangle): Double = {
+ val minBottomLeft = new Array[Double](rect.n)
+ val maxTopRight = new Array[Double](rect.n)
+ var area: Double = 0
+
+ for(i < 0 until rect.n)
+ area  this.area
+ }
+
+ //Distance of a point from the rectangle.
+ def distance(pt: Point): Double = {
+ val n = pt.coordinates.size
+ val dList = new Array[Double]
+ for(dim < 0 until n)
+ var squaredSum: Double = 0
+ for(i < 0 until n)
+ sqrt(squaredSum)
+ }
+
+}
+
+//Represents a point with coordinates and n information
+case class Point(coordinates: List[Double]) extends Rectangle {
+
+ override def bottomLeft: List[Double] = coordinates
+ override def topRight: List[Double] = coordinates
+
+ override def n: Int = bottomLeft.size
+
+ override def lowerLeftPoint: Point = this
+ override def upperRightPoint: Point = this
+
+ override def distance(pt: Point): Double = {
+ var squaredSum: Double = 0
+ for(i < 0 until pt.n)
+ squaredSum
+ }
+
+ override def wraps(rect: Rectangle): Boolean = false
+}
+
+//MBR is a physical manifestation of the Rectangle trait
+case class MBR(bottomLeft: List[Double], topRight: List[Double]) extends Rectangle {
— End diff –
same with case class and companion object. (move to own file)
Github user andrewpalumbo commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136427426
— Diff: mathscala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala —
@@ 45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers
+
+ test("inCore DBSCAN test") {
+// import org.apache.mahout.math.algorithms.clustering._
+//
+// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean)
+//// val url: URL = getClass().getResource("./data21.txt")
+// //TO DO //Need help
— End diff –
I apologize @AdityaAS I'd not noticed that this `.txt` file wasthis short No need to move to `$MAHOUT_HOME/examples/bin/resources` for Unit test input. FYI though if you write an example for this to be placed in `examples`, that's where you would place `.txt` files or any other resources.
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136441501
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN.scala —
@@ 0,0 +1,217 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.mahout.math.algorithms.clustering
+
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math._
+import org.apache.mahout.math.algorithms.common.distance.
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+
+import scala.collection.mutable
+import scala.io.Source
+
+class DistributedDBSCAN extends ClusteringFitter {
+
+ var epsilon: Double = _
+ var minPts: Int = _
+ var distanceMeasure: Symbol = _
+
+ def setStandardHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo > None)): Unit =
+
+ def fit[K](input: DrmLike[K],
+ hyperparameters: (Symbol, Any)*): DBSCANModel = {
+
+ setStandardHyperparameters(hyperparameters.toMap)
+ implicit val ctx = input.context
+ implicit val ktag = input.keyClassTag
+
+ val dmNumber = DistanceMetricSelector.namedMetricLookup(distanceMeasure)
+
+ val configBC = drmBroadcast(dvec(epsilon, minPts, dmNumber))
+
+ val clusters = input.allreduceBlock(
+ {
+ // Assign All Points to Clusters
+ case (keys, block: Matrix) =>
+ }, {
+ // Optionally Merge Clusters that are close enough
+ case (metadata1: Matrix, metadata2: Matrix) =>
+ })
+
+ val model = new DBSCANModel(1)
+ model.summary = s"""foo the bar"""
— End diff –
Yes. I do. Will update it.
Github user AdityaAS commented on a diff in the pull request:
https://github.com/apache/mahout/pull/334#discussion_r136441895
— Diff: mathscala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala —
@@ 0,0 +1,157 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.mahout.math.algorithms.clustering.rtree
+
+import scala.math._
+
+/*
+ A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner
+ */
+trait Rectangle{
+
+ require(bottomLeft.size == topRight.size) //Both the corners should have the same ns
+
+ def bottomLeft: List[Double] //Coordinates of bottom left corner
+ def topRight: List[Double] //Coordinates of top right corner
+ def n: Int = bottomLeft.size //nality
+
+ //Return the lowerLeft corner as a Point
+ def lowerLeftPoint: Point = Point(this.bottomLeft)
+
+ //Return the upper right corner as a Point
+ def upperRightPoint: Point = Point(this.topRight)
+
+ //Checks if this contains the Rectangle rect
+ def contains(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n ) {
+ if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight) )
+ }
+ true
+ }
+
+ def wraps(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n ) {
+ if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight) ) { + false + }
+ }
+ true
+ }
+
+ def intersects(rect: Rectangle): Boolean = {
+ for(i < 0 until rect.n)
+ true
+ }
+
+ def area(): Double = {
+ var area = 1.0
+ for(i < 0 until this.n)
+ area
+ }
+
+ def toMBR: MBR = MBR(this.bottomLeft, this.topRight)
+
+ def expandRectangle(rect: Rectangle): MBR = {
+ val minBottomLeft = new Array[Double](rect.n)
+ val maxTopRight = new Array[Double](rect.n)
+ for(i < 0 until rect.n)
+ MBR(minBottomLeft.toList, maxTopRight.toList)
+ }
+
+ //Area of 'this' rectangle after merging with 'rect'
+ def diffAreaAfterExpansion(rect: Rectangle): Double = {
+ val minBottomLeft = new Array[Double](rect.n)
+ val maxTopRight = new Array[Double](rect.n)
+ var area: Double = 0
+
+ for(i < 0 until rect.n)
+ area  this.area
+ }
+
+ //Distance of a point from the rectangle.
+ def distance(pt: Point): Double = {
+ val n = pt.coordinates.size
+ val dList = new Array[Double]
+ for(dim < 0 until n)
+ var squaredSum: Double = 0
+ for(i < 0 until n)
+ sqrt(squaredSum)
+ }
+
+}
+
+//Represents a point with coordinates and n information
+case class Point(coordinates: List[Double]) extends Rectangle {
+
+ override def bottomLeft: List[Double] = coordinates
+ override def topRight: List[Double] = coordinates
+
+ override def n: Int = bottomLeft.size
+
+ override def lowerLeftPoint: Point = this
+ override def upperRightPoint: Point = this
+
+ override def distance(pt: Point): Double = {
+ var squaredSum: Double = 0
+ for(i < 0 until pt.n)
+ squaredSum
+ }
+
+ override def wraps(rect: Rectangle): Boolean = false
+}
+
+//MBR is a physical manifestation of the Rectangle trait
+case class MBR(bottomLeft: List[Double], topRight: List[Double]) extends Rectangle {
— End diff –
My idea was to keep classes describing the datastructures in one file and the logic in another. But if having different classes is food design I'll update it
Github user AdityaAS commented on the issue:
https://github.com/apache/mahout/pull/334
@rawkintrevo @andrewpalumbo having a separate PR for Rtrees is a good idea. RTrees can be used in other algorithms as well (kNN for example). Should I go ahead a create a new issue on JIRA?
Github user rawkintrevo commented on the issue:
https://github.com/apache/mahout/pull/334
@AdityaAS Yes please. Also don't forget to add a jira ticket.
Github user rawkintrevo commented on the issue:
https://github.com/apache/mahout/pull/334
status on this @AdityaAS ?
GitHub user AdityaAS opened a pull request:
https://github.com/apache/mahout/pull/334
[WIP] MAHOUT1991
Added code for incore DBSCAN ( O(n^2) ) complexity.
I added the code, need feedback from the community members on structuring it properly and make it usable. I keep getting confused on how to structure it appropriately.
@rawkintrevo do help me out.
Please mark each with an "x"
If all of these things aren't complete, but you still feel it is
appropriate to open a PR, please add [WIP] after MAHOUTXXXX before the
descriptions e.g. "MAHOUTXXXX [WIP] Description of Change"
Does this change break earlier versions?
Is this the beginning of a larger project for which a feature branch should be made?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AdityaAS/mahout MAHOUT1991
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/334.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #334
commit 790005be3b149de2b515ddb2b77eb75ab73d0f1a
Author: AdityaAS <asaditya1195@gmail.com>
Date: 20170719T18:54:47Z
Added code for incore DBSCAN ( O(n^2) ) complexity.