Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1991

Implement naive DBSCAN Algorithm - O(n^2) complexity

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Algorithms
    • Labels:
      None

      Description

      Implement the naive DBSCAN algorithm in Mahout Samsara, as part of the Algorithms Framework.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user AdityaAS opened a pull request:

          https://github.com/apache/mahout/pull/334

          [WIP] MAHOUT-1991

          Added code for incore DBSCAN ( O(n^2) ) complexity.

          I added the code, need feedback from the community members on structuring it properly and make it usable. I keep getting confused on how to structure it appropriately.
          @rawkintrevo do help me out.

              1. Important ToDos
                Please mark each with an "x"
          • [ ] A JIRA ticket exists (if not, please create this first)https://issues.apache.org/jira/browse/ZEPPELIN/
          • [ ] Title of PR is "MAHOUT-XXXX Brief Description of Changes" where XXXX is the JIRA number.
          • [ ] Created unit tests where appropriate
          • [ ] Added licenses correct on newly added files
          • [ ] Assigned JIRA to self
          • [ ] Added documentation in scala docs/java docs, and to website
          • [ ] Successfully built and ran all unit tests, verified that all tests pass locally.

          If all of these things aren't complete, but you still feel it is
          appropriate to open a PR, please add [WIP] after MAHOUT-XXXX before the
          descriptions- e.g. "MAHOUT-XXXX [WIP] Description of Change"

          Does this change break earlier versions?

          Is this the beginning of a larger project for which a feature branch should be made?

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/AdityaAS/mahout MAHOUT-1991

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/mahout/pull/334.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #334


          commit 790005be3b149de2b515ddb2b77eb75ab73d0f1a
          Author: AdityaAS <asaditya1195@gmail.com>
          Date: 2017-07-19T18:54:47Z

          Added code for incore DBSCAN ( O(n^2) ) complexity.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user AdityaAS opened a pull request: https://github.com/apache/mahout/pull/334 [WIP] MAHOUT-1991 Added code for incore DBSCAN ( O(n^2) ) complexity. I added the code, need feedback from the community members on structuring it properly and make it usable. I keep getting confused on how to structure it appropriately. @rawkintrevo do help me out. Important ToDos Please mark each with an "x" [ ] A JIRA ticket exists (if not, please create this first) https://issues.apache.org/jira/browse/ZEPPELIN/ [ ] Title of PR is "MAHOUT-XXXX Brief Description of Changes" where XXXX is the JIRA number. [ ] Created unit tests where appropriate [ ] Added licenses correct on newly added files [ ] Assigned JIRA to self [ ] Added documentation in scala docs/java docs, and to website [ ] Successfully built and ran all unit tests, verified that all tests pass locally. If all of these things aren't complete, but you still feel it is appropriate to open a PR, please add [WIP] after MAHOUT-XXXX before the descriptions- e.g. "MAHOUT-XXXX [WIP] Description of Change" Does this change break earlier versions? Is this the beginning of a larger project for which a feature branch should be made? You can merge this pull request into a Git repository by running: $ git pull https://github.com/AdityaAS/mahout MAHOUT-1991 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/334.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #334 commit 790005be3b149de2b515ddb2b77eb75ab73d0f1a Author: AdityaAS <asaditya1195@gmail.com> Date: 2017-07-19T18:54:47Z Added code for incore DBSCAN ( O(n^2) ) complexity.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on the issue:

          https://github.com/apache/mahout/pull/334

          Aditya and I talked about this on a hangout- for posterity: Needs to be refactored into incore/distributed, made to fit with the Mahout Algos framework, and distanceMetric should leverage mahout's current distance metrics.

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on the issue: https://github.com/apache/mahout/pull/334 Aditya and I talked about this on a hangout- for posterity: Needs to be refactored into incore/distributed, made to fit with the Mahout Algos framework, and distanceMetric should leverage mahout's current distance metrics.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r128654179

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          — End diff –

          redundant dvec (imo), slice is already an `o.a.m.math.Vector`

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r128654179 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) — End diff – redundant dvec (imo), slice is already an `o.a.m.math.Vector`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r128653815

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          + var neighbourCount = 0
          + for(row <- data) {
          + if(row(0).toInt != pointId) {
          + val arg1 = dvec(row)
          — End diff –

          i am not sure if `dvec` use is warranted... row is already an `o.a.m.math.Vector`

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r128653815 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) + var neighbourCount = 0 + for(row <- data) { + if(row(0).toInt != pointId) { + val arg1 = dvec(row) — End diff – i am not sure if `dvec` use is warranted... row is already an `o.a.m.math.Vector`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r128653491

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          + var neighbourCount = 0
          + for(row <- data) {
          + if(row(0).toInt != pointId) {
          + val arg1 = dvec(row)
          + val arg2 = dvec(pointData)
          + if(distanceMetric(arg1, arg2) <= Eps)

          { + neighbourCount += 1 + neighbours += pointData(2).toInt + }

          + }
          + }
          + var neighboursDvec: DenseVector = dvec(neighbours)
          + neighboursDvec
          + }
          +
          + def addColumns(arg: Array[Double]): Array[Double] = {
          + val newArr = new Array[Double](arg.length + 5)
          + newArr(0) = 0.0 //coreFlag //Initialize all points as non-core points
          + newArr(1) = 0.0 //processedFlag //Initialize all points as non-processed points
          + newArr(2) = -1.0 //globalId
          + newArr(3) = -1.0 //clusterId //Initialize all points as not belonging to any cluster
          + newArr(4) = 0.0 //noiseFlag //Initialize all points as not-Noise
          +
          + for (i <- 0 until (arg.size))

          { + newArr(i + 5) = arg(i) + }

          + newArr
          + }
          +
          + /*
          + Takes in two rows as input. Rows that contain the co-ordinates of the data as well as the augmented columns added at the beginning.
          + Computes distance of only the data part of the rows, ignoring the augmented columns.
          + //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly
          + Also give user the option of choosing the distance metric while running the algorithm.
          + */
          + def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = {
          + var diffsqSum = -1.0
          + if(arg1.length != arg2.length)

          { + return diffsqSum + }

          + else {
          + val diff = arg1 - arg2
          + val diffsq = diff^2
          + diffsqSum = 0.0
          + for(i <- 4 until diffsq.length)

          { + diffsqSum += diffsq(i) + }

          + diffsqSum = math.sqrt(diffsqSum)
          — End diff –

          i'd just import scala.math._ and use just sqrt(a) etc.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r128653491 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) + var neighbourCount = 0 + for(row <- data) { + if(row(0).toInt != pointId) { + val arg1 = dvec(row) + val arg2 = dvec(pointData) + if(distanceMetric(arg1, arg2) <= Eps) { + neighbourCount += 1 + neighbours += pointData(2).toInt + } + } + } + var neighboursDvec: DenseVector = dvec(neighbours) + neighboursDvec + } + + def addColumns(arg: Array [Double] ): Array [Double] = { + val newArr = new Array [Double] (arg.length + 5) + newArr(0) = 0.0 //coreFlag //Initialize all points as non-core points + newArr(1) = 0.0 //processedFlag //Initialize all points as non-processed points + newArr(2) = -1.0 //globalId + newArr(3) = -1.0 //clusterId //Initialize all points as not belonging to any cluster + newArr(4) = 0.0 //noiseFlag //Initialize all points as not-Noise + + for (i <- 0 until (arg.size)) { + newArr(i + 5) = arg(i) + } + newArr + } + + /* + Takes in two rows as input. Rows that contain the co-ordinates of the data as well as the augmented columns added at the beginning. + Computes distance of only the data part of the rows, ignoring the augmented columns. + //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly + Also give user the option of choosing the distance metric while running the algorithm. + */ + def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = { + var diffsqSum = -1.0 + if(arg1.length != arg2.length) { + return diffsqSum + } + else { + val diff = arg1 - arg2 + val diffsq = diff^2 + diffsqSum = 0.0 + for(i <- 4 until diffsq.length) { + diffsqSum += diffsq(i) + } + diffsqSum = math.sqrt(diffsqSum) — End diff – i'd just import scala.math._ and use just sqrt(a) etc.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r128653039

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          — End diff –

          do we really care to force DenseMatrix vs. just generic Matrix?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r128653039 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { — End diff – do we really care to force DenseMatrix vs. just generic Matrix?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dlyubimov commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r128653396

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          + var neighbourCount = 0
          + for(row <- data) {
          + if(row(0).toInt != pointId) {
          + val arg1 = dvec(row)
          + val arg2 = dvec(pointData)
          + if(distanceMetric(arg1, arg2) <= Eps)

          { + neighbourCount += 1 + neighbours += pointData(2).toInt + }

          + }
          + }
          + var neighboursDvec: DenseVector = dvec(neighbours)
          + neighboursDvec
          + }
          +
          + def addColumns(arg: Array[Double]): Array[Double] = {
          + val newArr = new Array[Double](arg.length + 5)
          + newArr(0) = 0.0 //coreFlag //Initialize all points as non-core points
          + newArr(1) = 0.0 //processedFlag //Initialize all points as non-processed points
          + newArr(2) = -1.0 //globalId
          + newArr(3) = -1.0 //clusterId //Initialize all points as not belonging to any cluster
          + newArr(4) = 0.0 //noiseFlag //Initialize all points as not-Noise
          +
          + for (i <- 0 until (arg.size))

          { + newArr(i + 5) = arg(i) + }

          + newArr
          + }
          +
          + /*
          + Takes in two rows as input. Rows that contain the co-ordinates of the data as well as the augmented columns added at the beginning.
          + Computes distance of only the data part of the rows, ignoring the augmented columns.
          + //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly
          + Also give user the option of choosing the distance metric while running the algorithm.
          + */
          + def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = {
          + var diffsqSum = -1.0
          + if(arg1.length != arg2.length)

          { + return diffsqSum + }

          + else {
          + val diff = arg1 - arg2
          + val diffsq = diff^2
          + diffsqSum = 0.0
          + for(i <- 4 until diffsq.length) {
          — End diff –

          DSL suggestion for this loop (not necessarily faster though, but DSL-er: ) ):

          diffSqSum += diffsq(4 until diffsq.length).sum

          Show
          githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r128653396 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) + var neighbourCount = 0 + for(row <- data) { + if(row(0).toInt != pointId) { + val arg1 = dvec(row) + val arg2 = dvec(pointData) + if(distanceMetric(arg1, arg2) <= Eps) { + neighbourCount += 1 + neighbours += pointData(2).toInt + } + } + } + var neighboursDvec: DenseVector = dvec(neighbours) + neighboursDvec + } + + def addColumns(arg: Array [Double] ): Array [Double] = { + val newArr = new Array [Double] (arg.length + 5) + newArr(0) = 0.0 //coreFlag //Initialize all points as non-core points + newArr(1) = 0.0 //processedFlag //Initialize all points as non-processed points + newArr(2) = -1.0 //globalId + newArr(3) = -1.0 //clusterId //Initialize all points as not belonging to any cluster + newArr(4) = 0.0 //noiseFlag //Initialize all points as not-Noise + + for (i <- 0 until (arg.size)) { + newArr(i + 5) = arg(i) + } + newArr + } + + /* + Takes in two rows as input. Rows that contain the co-ordinates of the data as well as the augmented columns added at the beginning. + Computes distance of only the data part of the rows, ignoring the augmented columns. + //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly + Also give user the option of choosing the distance metric while running the algorithm. + */ + def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = { + var diffsqSum = -1.0 + if(arg1.length != arg2.length) { + return diffsqSum + } + else { + val diff = arg1 - arg2 + val diffsq = diff^2 + diffsqSum = 0.0 + for(i <- 4 until diffsq.length) { — End diff – DSL suggestion for this loop (not necessarily faster though, but DSL-er: ) ): diffSqSum += diffsq(4 until diffsq.length).sum
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on the issue:

          https://github.com/apache/mahout/pull/334

          @andrewpalumbo proposing merging this to feature branch. It ended up being a bit hairier than we expected.

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on the issue: https://github.com/apache/mahout/pull/334 @andrewpalumbo proposing merging this to feature branch. It ended up being a bit hairier than we expected.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the issue:

          https://github.com/apache/mahout/pull/334

          Sounds good to me. Thx.

          Sent from my Verizon Wireless 4G LTE smartphone

          -------- Original message --------
          From: Trevor Grant <notifications@github.com>
          Date: 08/25/2017 5:52 AM (GMT-08:00)
          To: apache/mahout <mahout@noreply.github.com>
          Cc: Andrew Palumbo <ap.dev@outlook.com>, Mention <mention@noreply.github.com>
          Subject: Re: [apache/mahout] [WIP] MAHOUT-1991 - Added code for incore DBSCAN ( O(n^2) ) (#334)

          @andrewpalumbo<https://github.com/andrewpalumbo> proposing merging this to feature branch. It ended up being a bit hairier than we expected.


          You are receiving this because you were mentioned.
          Reply to this email directly, view it on GitHub<https://github.com/apache/mahout/pull/334#issuecomment-324910936>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHU2HcFyPlJMH--Cu0rmwyKpW-WUH_Zzks5sbsONgaJpZM4OdJ9F>.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the issue: https://github.com/apache/mahout/pull/334 Sounds good to me. Thx. Sent from my Verizon Wireless 4G LTE smartphone -------- Original message -------- From: Trevor Grant <notifications@github.com> Date: 08/25/2017 5:52 AM (GMT-08:00) To: apache/mahout <mahout@noreply.github.com> Cc: Andrew Palumbo <ap.dev@outlook.com>, Mention <mention@noreply.github.com> Subject: Re: [apache/mahout] [WIP] MAHOUT-1991 - Added code for incore DBSCAN ( O(n^2) ) (#334) @andrewpalumbo< https://github.com/andrewpalumbo > proposing merging this to feature branch. It ended up being a bit hairier than we expected. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/apache/mahout/pull/334#issuecomment-324910936 >, or mute the thread< https://github.com/notifications/unsubscribe-auth/AHU2HcFyPlJMH--Cu0rmwyKpW-WUH_Zzks5sbsONgaJpZM4OdJ9F >.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r135379456

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          — End diff –

          Nope. Changed all occurrences accordingly

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r135379456 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { — End diff – Nope. Changed all occurrences accordingly
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r135379575

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          — End diff –

          Updated

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r135379575 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) — End diff – Updated
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r135379577

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          + var neighbourCount = 0
          + for(row <- data) {
          + if(row(0).toInt != pointId) {
          + val arg1 = dvec(row)
          — End diff –

          Updated

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r135379577 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) + var neighbourCount = 0 + for(row <- data) { + if(row(0).toInt != pointId) { + val arg1 = dvec(row) — End diff – Updated
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r135379586

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala —
          @@ -0,0 +1,142 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import scala.collection.JavaConversions.iterableAsScalaIterable
          +import org.apache.mahout.math.scalabindings.::
          +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps
          +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps
          +import org.apache.mahout.math.scalabindings.dvec
          +import org.apache.mahout.math._
          +import scalabindings._
          +import scala.collection.mutable.ArrayBuffer
          +import scala.collection.mutable.Queue
          +
          +class DBSCAN_Incore {
          +
          + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = {
          + data(i, 3) = clusterId.toDouble
          + var neighbourQueue = new Queue[Int]()
          + for(index <- 0 until neighbours.length)

          { + neighbourQueue += neighbours(index).toInt + }

          +
          + while(neighbourQueue.nonEmpty) {
          + var curr: Int = neighbourQueue.dequeue()
          + if(data(curr, 1) == 0) {
          + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps)
          + data(curr, 1) = 1
          + if(currNeighbours.length >= Minpts) {
          + data(curr, 0) = 1 //coreFlag == True
          + for (index <- 0 until currNeighbours.length)

          { + neighbourQueue += currNeighbours(index).toInt + }

          + }
          + if(data(curr, 3) == -1)

          { + data(curr, 3) = clusterId + }

          + }
          + }
          + }
          +
          + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = {
          +
          + var clusterId: Int = 0
          + for(i <- 0 until data.nrow) {
          + if(data(i, 1) != 1) {
          + //i.e. if notProcessed...
          + val neighbours: DenseVector = findNeighbours(i, data, Eps)
          + data(i, 1) = 1.0 // ==> processedFlag = True
          + if(neighbours.length >= Minpts)

          { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + }

          + else

          { + data(i, 4) = 1.0 // ==> noiseFlag = True + }

          + }
          + }
          + }
          +
          + /*
          + @args
          + point: Int - Id of the point whose neighbours are to be computed
          + data: DenseMatrix[Double] - The augmented matrix containing all the data points along with the 4 appended columns
          + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point.
          + */
          + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = {
          + val pointId: Int = data(point, 2).toInt
          + var neighbours: ArrayBuffer[Int] = new ArrayBuffer[Int]()
          + val pointData: DenseVector = dvec(data(pointId, :)
          + var neighbourCount = 0
          + for(row <- data) {
          + if(row(0).toInt != pointId) {
          + val arg1 = dvec(row)
          + val arg2 = dvec(pointData)
          + if(distanceMetric(arg1, arg2) <= Eps)

          { + neighbourCount += 1 + neighbours += pointData(2).toInt + }

          + }
          + }
          + var neighboursDvec: DenseVector = dvec(neighbours)
          + neighboursDvec
          + }
          +
          + def addColumns(arg: Array[Double]): Array[Double] = {
          + val newArr = new Array[Double](arg.length + 5)
          + newArr(0) = 0.0 //coreFlag //Initialize all points as non-core points
          + newArr(1) = 0.0 //processedFlag //Initialize all points as non-processed points
          + newArr(2) = -1.0 //globalId
          + newArr(3) = -1.0 //clusterId //Initialize all points as not belonging to any cluster
          + newArr(4) = 0.0 //noiseFlag //Initialize all points as not-Noise
          +
          + for (i <- 0 until (arg.size))

          { + newArr(i + 5) = arg(i) + }

          + newArr
          + }
          +
          + /*
          + Takes in two rows as input. Rows that contain the co-ordinates of the data as well as the augmented columns added at the beginning.
          + Computes distance of only the data part of the rows, ignoring the augmented columns.
          + //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly
          + Also give user the option of choosing the distance metric while running the algorithm.
          + */
          + def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = {
          + var diffsqSum = -1.0
          + if(arg1.length != arg2.length)

          { + return diffsqSum + }

          + else {
          + val diff = arg1 - arg2
          + val diffsq = diff^2
          + diffsqSum = 0.0
          + for(i <- 4 until diffsq.length)

          { + diffsqSum += diffsq(i) + }

          + diffsqSum = math.sqrt(diffsqSum)
          — End diff –

          Updated

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r135379586 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN_Incore.scala — @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.mahout.math.algorithms.clustering + +import scala.collection.JavaConversions.iterableAsScalaIterable +import org.apache.mahout.math.scalabindings.:: +import org.apache.mahout.math.scalabindings.RLikeOps.m2mOps +import org.apache.mahout.math.scalabindings.RLikeOps.v2vOps +import org.apache.mahout.math.scalabindings.dvec +import org.apache.mahout.math._ +import scalabindings._ +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.Queue + +class DBSCAN_Incore { + + def expandCluster(i: Int, data: DenseMatrix, neighbours: DenseVector, clusterId: Int, Eps: Double, Minpts: Int) = { + data(i, 3) = clusterId.toDouble + var neighbourQueue = new Queue [Int] () + for(index <- 0 until neighbours.length) { + neighbourQueue += neighbours(index).toInt + } + + while(neighbourQueue.nonEmpty) { + var curr: Int = neighbourQueue.dequeue() + if(data(curr, 1) == 0) { + var currNeighbours: DenseVector = findNeighbours(curr, data, Eps) + data(curr, 1) = 1 + if(currNeighbours.length >= Minpts) { + data(curr, 0) = 1 //coreFlag == True + for (index <- 0 until currNeighbours.length) { + neighbourQueue += currNeighbours(index).toInt + } + } + if(data(curr, 3) == -1) { + data(curr, 3) = clusterId + } + } + } + } + + def DBSCAN(data: DenseMatrix, Eps: Double, Minpts: Int): Unit = { + + var clusterId: Int = 0 + for(i <- 0 until data.nrow) { + if(data(i, 1) != 1) { + //i.e. if notProcessed... + val neighbours: DenseVector = findNeighbours(i, data, Eps) + data(i, 1) = 1.0 // ==> processedFlag = True + if(neighbours.length >= Minpts) { + // ==> i corresponds to a core point. + data(i, 0) = 1 // ==> coreFlag = True + // data(i, 3) = clusterId + expandCluster(i, data, neighbours, clusterId, Eps, Minpts) + clusterId += 1 + } + else { + data(i, 4) = 1.0 // ==> noiseFlag = True + } + } + } + } + + /* + @args + point: Int - Id of the point whose neighbours are to be computed + data: DenseMatrix [Double] - The augmented matrix containing all the data points along with the 4 appended columns + Returns: A DenseVector containing the ID's of all the points that are at an Eps distance from point. + */ + def findNeighbours(point: Int, data: DenseMatrix, Eps: Double) : DenseVector = { + val pointId: Int = data(point, 2).toInt + var neighbours: ArrayBuffer [Int] = new ArrayBuffer [Int] () + val pointData: DenseVector = dvec(data(pointId, : ) + var neighbourCount = 0 + for(row <- data) { + if(row(0).toInt != pointId) { + val arg1 = dvec(row) + val arg2 = dvec(pointData) + if(distanceMetric(arg1, arg2) <= Eps) { + neighbourCount += 1 + neighbours += pointData(2).toInt + } + } + } + var neighboursDvec: DenseVector = dvec(neighbours) + neighboursDvec + } + + def addColumns(arg: Array [Double] ): Array [Double] = { + val newArr = new Array [Double] (arg.length + 5) + newArr(0) = 0.0 //coreFlag //Initialize all points as non-core points + newArr(1) = 0.0 //processedFlag //Initialize all points as non-processed points + newArr(2) = -1.0 //globalId + newArr(3) = -1.0 //clusterId //Initialize all points as not belonging to any cluster + newArr(4) = 0.0 //noiseFlag //Initialize all points as not-Noise + + for (i <- 0 until (arg.size)) { + newArr(i + 5) = arg(i) + } + newArr + } + + /* + Takes in two rows as input. Rows that contain the co-ordinates of the data as well as the augmented columns added at the beginning. + Computes distance of only the data part of the rows, ignoring the augmented columns. + //DistanceMetric is Euclidean distance as of now, Check and add other types of distance metrics and rename methods accordingly + Also give user the option of choosing the distance metric while running the algorithm. + */ + def distanceMetric(arg1: DenseVector, arg2: DenseVector) : Double = { + var diffsqSum = -1.0 + if(arg1.length != arg2.length) { + return diffsqSum + } + else { + val diff = arg1 - arg2 + val diffsq = diff^2 + diffsqSum = 0.0 + for(i <- 4 until diffsq.length) { + diffsqSum += diffsq(i) + } + diffsqSum = math.sqrt(diffsqSum) — End diff – Updated
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136095747

          — Diff: website/docs/algorithms/clustering/dbscan/index.md —
          @@ -0,0 +1,101 @@
          +---
          +layout: algorithm
          +title: DBSCAN
          +theme:
          + name: retro-mahout
          +---
          +
          +### About
          +
          +[DBSCAN Clustering](https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf)
          + is a simple and efficient density based clustering algorithm for group objects
          + into clusters. DBSCAN differs from other clustering algorithms in the way it
          + clusters objects. DBSCAN uses the density information of points to define clusters
          + Each object is represented as a point in the multidimensional feature space. The algorithm
          + uses the notion of core points, border points and noise points to perform the clustering.
          +
          + Parameters of the Algorithm: (Epsilon, Min points)
          + * Epsilon (Eps) - denotes the radius of the sphere that defines the neighbourhood of a given data point
          + * Min points (Minpts) - the minimum threshold for the number of points in a point p's Eps-neighbourhood to
          + make p a core point
          +
          + Basic Definitions
          + * Eps-neighbourhood of a point - The Eps-neighbourhood of a point 'p' is defined as the set of points S
          + such that for each x 'belongs to' S, distance(p,x) <= eps
          + * Core point - A point 'p' is called a core point if the number of points in its Eps-neighbourhood is >= Minpts
          + * Border point - A point that is not a core point but such that there exists a 'p' in its neighbourhood
          + such that p is a core point
          + * Noise point - Any point that does not classify as a core point or border point is a noise point. Noise points
          + do not belong to any cluster.
          +
          + Algorithm:
          + The DBSCAN algorithm takes as parameters Eps and MinPts. Eps restricts the neighbourhood of a point and MinPts
          + denotes the threshold for the number of neighbours in Eps-neighbours of point p, for it to form a cluster.
          + The algorithm randomly chooses a point q, and finds it Eps-neighbourhood. If the number of points in the
          + Eps-neighbourhood of q is less than MinPts, it is marked as a noise point. Otherwise it is marked as a core point
          + and a new cluster is created. If p is a core point, iteratively a Eps-neighbourhood query is performed on each of
          + its neighbours and points are added to the created cluster. If no unvisited points can be added to cluster, the new
          + cluster is complete and no points will be added to the cluster in subsequent iterations. Then another unvisited point q'
          + is picked up and the same process is repeated. The algorithm terminates when all the points have been visited
          + i.e. when some of the points are added to clusters and some are marked as noise points. The total number of
          + Eps-neighbourhood queries performed is equal to the size of the data and if no indexing data structure is
          + used then the calculation of Eps-neighbourhood query involved the distance calculation with all the points .
          + Thus, if no indexing data structure is used, the complexity of the DBSCAN algorithm is O(n^2)
          +
          +#### Strategy for parallelization
          +
          +The DBSCAN algorithm is an inherently sequential algorithm. A quick look at the pseudo code provided in the original
          +paper will show why.
          +
          +The parallelization strategy for DBSCAN involves 3 phases.
          +* Data distribution phase preserving spatial locality
          +* InCore DBSCAN Clustering
          +* Merging InCore clusters to form Global clusters
          +
          +Few methods to parallelize DBSCAN have been proposed in the recent past and can be found [here](http://delivery.acm.org/10.1145/2390000/2389081/a62-patwary.pdf?ip=14.139.128.15&id=2389081&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663674_474f7b3eaa352d59c2905c1f530907c1) and [here](http://delivery.acm.org/10.1145/2840000/2834894/a2-gotz.pdf?ip=14.139.128.15&id=2834894&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663710_9001c4bdf3d4dc3ed6658d8a6aaf1190)
          +
          +In order to minimize the communication costs for large datasets, we provide an approximate DBSCAN algorithm. Data division among nodes is done randomly
          +violating the spatial locality condition. Clustering is performed InCore and cluster merging is done by merging two clusters
          +that are nearby and that contain significant number of points.
          +
          +### Parameters
          +<div class="table-striped">
          + <table class="table">
          + <tr>
          + <th>Parameter</th>
          + <th>Description</th>
          + <th>Default Value</th>
          + </tr>
          + <tr>
          + <td><code>'distanceMeasure</code></td>
          + <td>The metric used for calculating distance, see <a href="../distance-metrics.html">Distance Metrics</a></td>
          + <td><code>'Euclidean</code></td>
          + </tr>
          + <tr>
          + <td><code>'Eps</code></td>
          + <td>The radius that defines the neighbourhood of a point</code></td>
          + <td>NA</td>
          + </tr>
          + <tr>
          + <td><code>'Minpts</code></td>
          + <td>The threshold for number of points in the neighbourhood of another point</code></td>
          + <td>NA</td>
          + </tr>
          + </tr>
          + </table>
          +</div>
          +
          +
          +### Example
          +
          + //Using the InCoreDBSCAN module
          + val input = dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 1.0), (6.0, 5.2, -5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, -10.0))
          +
          + import org.apache.mahout.math.algorithms.clustering.InCoreDBSCAN
          + val distanceMetric = DistanceMetricSelector.select(DistanceMetricSelector.namedMetricLookup('Euclidean)))
          — End diff –

          +1 utilizing existing distanceMetrics framework.

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136095747 — Diff: website/docs/algorithms/clustering/dbscan/index.md — @@ -0,0 +1,101 @@ +--- +layout: algorithm +title: DBSCAN +theme: + name: retro-mahout +--- + +### About + + [DBSCAN Clustering] ( https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf ) + is a simple and efficient density based clustering algorithm for group objects + into clusters. DBSCAN differs from other clustering algorithms in the way it + clusters objects. DBSCAN uses the density information of points to define clusters + Each object is represented as a point in the multidimensional feature space. The algorithm + uses the notion of core points, border points and noise points to perform the clustering. + + Parameters of the Algorithm: (Epsilon, Min points) + * Epsilon (Eps) - denotes the radius of the sphere that defines the neighbourhood of a given data point + * Min points (Minpts) - the minimum threshold for the number of points in a point p's Eps-neighbourhood to + make p a core point + + Basic Definitions + * Eps-neighbourhood of a point - The Eps-neighbourhood of a point 'p' is defined as the set of points S + such that for each x 'belongs to' S, distance(p,x) <= eps + * Core point - A point 'p' is called a core point if the number of points in its Eps-neighbourhood is >= Minpts + * Border point - A point that is not a core point but such that there exists a 'p' in its neighbourhood + such that p is a core point + * Noise point - Any point that does not classify as a core point or border point is a noise point. Noise points + do not belong to any cluster. + + Algorithm: + The DBSCAN algorithm takes as parameters Eps and MinPts. Eps restricts the neighbourhood of a point and MinPts + denotes the threshold for the number of neighbours in Eps-neighbours of point p, for it to form a cluster. + The algorithm randomly chooses a point q, and finds it Eps-neighbourhood. If the number of points in the + Eps-neighbourhood of q is less than MinPts, it is marked as a noise point. Otherwise it is marked as a core point + and a new cluster is created. If p is a core point, iteratively a Eps-neighbourhood query is performed on each of + its neighbours and points are added to the created cluster. If no unvisited points can be added to cluster, the new + cluster is complete and no points will be added to the cluster in subsequent iterations. Then another unvisited point q' + is picked up and the same process is repeated. The algorithm terminates when all the points have been visited + i.e. when some of the points are added to clusters and some are marked as noise points. The total number of + Eps-neighbourhood queries performed is equal to the size of the data and if no indexing data structure is + used then the calculation of Eps-neighbourhood query involved the distance calculation with all the points . + Thus, if no indexing data structure is used, the complexity of the DBSCAN algorithm is O(n^2) + +#### Strategy for parallelization + +The DBSCAN algorithm is an inherently sequential algorithm. A quick look at the pseudo code provided in the original +paper will show why. + +The parallelization strategy for DBSCAN involves 3 phases. +* Data distribution phase preserving spatial locality +* InCore DBSCAN Clustering +* Merging InCore clusters to form Global clusters + +Few methods to parallelize DBSCAN have been proposed in the recent past and can be found [here] ( http://delivery.acm.org/10.1145/2390000/2389081/a62-patwary.pdf?ip=14.139.128.15&id=2389081&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663674_474f7b3eaa352d59c2905c1f530907c1 ) and [here] ( http://delivery.acm.org/10.1145/2840000/2834894/a2-gotz.pdf?ip=14.139.128.15&id=2834894&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663710_9001c4bdf3d4dc3ed6658d8a6aaf1190 ) + +In order to minimize the communication costs for large datasets, we provide an approximate DBSCAN algorithm. Data division among nodes is done randomly +violating the spatial locality condition. Clustering is performed InCore and cluster merging is done by merging two clusters +that are nearby and that contain significant number of points. + +### Parameters +<div class="table-striped"> + <table class="table"> + <tr> + <th>Parameter</th> + <th>Description</th> + <th>Default Value</th> + </tr> + <tr> + <td><code>'distanceMeasure</code></td> + <td>The metric used for calculating distance, see <a href="../distance-metrics.html">Distance Metrics</a></td> + <td><code>'Euclidean</code></td> + </tr> + <tr> + <td><code>'Eps</code></td> + <td>The radius that defines the neighbourhood of a point</code></td> + <td>NA</td> + </tr> + <tr> + <td><code>'Minpts</code></td> + <td>The threshold for number of points in the neighbourhood of another point</code></td> + <td>NA</td> + </tr> + </tr> + </table> +</div> + + +### Example + + //Using the InCoreDBSCAN module + val input = dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 1.0), (6.0, 5.2, -5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, -10.0)) + + import org.apache.mahout.math.algorithms.clustering.InCoreDBSCAN + val distanceMetric = DistanceMetricSelector.select(DistanceMetricSelector.namedMetricLookup('Euclidean))) — End diff – +1 utilizing existing distanceMetrics framework.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on the issue:

          https://github.com/apache/mahout/pull/334

          @AdityaAS @andrewpalumbo @dlyubimov

          Afaik, the R-Tree/Rect portion of this is solid- I could really use those for something else I'm working on- (ML/AI on Flink for upcoming talks).

          As RTrees area core component for many hierarchical clustering algorithms, does anyone have any opinions to refactoring this JIRA into multiple- the first of which being "Add R-Tree implementation" (and then additional for DBSCAN/ etc.)

          Still proposing feature branch for most of this- just splitting the R-Trees which are done off into their own JIRA ticket/PR and merging.

          Thoughts?

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on the issue: https://github.com/apache/mahout/pull/334 @AdityaAS @andrewpalumbo @dlyubimov Afaik, the R-Tree/Rect portion of this is solid- I could really use those for something else I'm working on- (ML/AI on Flink for upcoming talks). As RTrees area core component for many hierarchical clustering algorithms, does anyone have any opinions to refactoring this JIRA into multiple- the first of which being "Add R-Tree implementation" (and then additional for DBSCAN/ etc.) Still proposing feature branch for most of this- just splitting the R-Trees which are done off into their own JIRA ticket/PR and merging. Thoughts?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the issue:

          https://github.com/apache/mahout/pull/334

          @rawkintrevo @AdityaAS @dlyubimov Admittedly, I've not had a chance to review this closely, I think that, If the RTrees are solid, moving them to something like `org.apache.mahout.maath.commmon` and committing separately would make sense. We have other data structures from the pre-algo framework days that need to be shared as well, e.g.: `ConfusionMatrix`: https://github.com/apache/mahout/tree/master/math-scala/src/main/scala/org/apache/mahout/classifier/stats which could also go in such a package.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the issue: https://github.com/apache/mahout/pull/334 @rawkintrevo @AdityaAS @dlyubimov Admittedly, I've not had a chance to review this closely, I think that, If the RTrees are solid, moving them to something like `org.apache.mahout.maath.commmon` and committing separately would make sense. We have other data structures from the pre-algo framework days that need to be shared as well, e.g.: `ConfusionMatrix`: https://github.com/apache/mahout/tree/master/math-scala/src/main/scala/org/apache/mahout/classifier/stats which could also go in such a package.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on the issue:

          https://github.com/apache/mahout/pull/334

          @rawkintrevo to clarify, you're talking about a PR against `master` for RTrees, correct?

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on the issue: https://github.com/apache/mahout/pull/334 @rawkintrevo to clarify, you're talking about a PR against `master` for RTrees, correct?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136179872

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN.scala —
          @@ -0,0 +1,217 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.math.scalabindings.RLikeOps._
          +import org.apache.mahout.math._
          +import org.apache.mahout.math.algorithms.common.distance.

          {DistanceMetric, DistanceMetricSelector}

          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.drm.RLikeDrmOps._
          +
          +import scala.collection.mutable
          +import scala.io.Source
          +
          +class DistributedDBSCAN extends ClusteringFitter {
          +
          + var epsilon: Double = _
          + var minPts: Int = _
          + var distanceMeasure: Symbol = _
          +
          + def setStandardHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo -> None)): Unit =

          { + epsilon = hyperparameters.asInstanceOf[Map[Symbol, Double]].getOrElse('epsilon, 0.5) + minPts = hyperparameters.asInstanceOf[Map[Symbol, Int]].getOrElse('minPts, 1) + distanceMeasure = hyperparameters.asInstanceOf[Map[Symbol, Symbol]].getOrElse('distanceMeasure, 'Euclidean) + }

          +
          + def fit[K](input: DrmLike[K],
          + hyperparameters: (Symbol, Any)*): DBSCANModel = {
          +
          + setStandardHyperparameters(hyperparameters.toMap)
          + implicit val ctx = input.context
          + implicit val ktag = input.keyClassTag
          +
          + val dmNumber = DistanceMetricSelector.namedMetricLookup(distanceMeasure)
          +
          + val configBC = drmBroadcast(dvec(epsilon, minPts, dmNumber))
          +
          + val clusters = input.allreduceBlock(
          + {
          + // Assign All Points to Clusters
          + case (keys, block: Matrix) =>

          { + val epsilon_local = configBC.value.get(0) + val minPts_local = configBC.value.get(1) + + val distanceMetric = DistanceMetricSelector.select(configBC.value.get(3)) + val icDBSCAN = new InCoreDBSCAN(block, epsilon_local, minPts_local.toInt, distanceMetric) + // do stuff on icDBSCAN + icDBSCAN.DBSCAN() + }

          + }, {
          + // Optionally Merge Clusters that are close enough
          + case (metadata1: Matrix, metadata2: Matrix) =>

          { + // this does nothing- just returns the left matrix + metadata1 + }

          + })
          +
          + val model = new DBSCANModel(1)
          + model.summary = s"""foo the bar"""
          — End diff –

          @AdityaAS do you have a plan for the model summary?

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136179872 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN.scala — @@ -0,0 +1,217 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.mahout.math.algorithms.clustering + +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.math.scalabindings.RLikeOps._ +import org.apache.mahout.math._ +import org.apache.mahout.math.algorithms.common.distance. {DistanceMetric, DistanceMetricSelector} +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.drm.RLikeDrmOps._ + +import scala.collection.mutable +import scala.io.Source + +class DistributedDBSCAN extends ClusteringFitter { + + var epsilon: Double = _ + var minPts: Int = _ + var distanceMeasure: Symbol = _ + + def setStandardHyperparameters(hyperparameters: Map [Symbol, Any] = Map('foo -> None)): Unit = { + epsilon = hyperparameters.asInstanceOf[Map[Symbol, Double]].getOrElse('epsilon, 0.5) + minPts = hyperparameters.asInstanceOf[Map[Symbol, Int]].getOrElse('minPts, 1) + distanceMeasure = hyperparameters.asInstanceOf[Map[Symbol, Symbol]].getOrElse('distanceMeasure, 'Euclidean) + } + + def fit [K] (input: DrmLike [K] , + hyperparameters: (Symbol, Any)*): DBSCANModel = { + + setStandardHyperparameters(hyperparameters.toMap) + implicit val ctx = input.context + implicit val ktag = input.keyClassTag + + val dmNumber = DistanceMetricSelector.namedMetricLookup(distanceMeasure) + + val configBC = drmBroadcast(dvec(epsilon, minPts, dmNumber)) + + val clusters = input.allreduceBlock( + { + // Assign All Points to Clusters + case (keys, block: Matrix) => { + val epsilon_local = configBC.value.get(0) + val minPts_local = configBC.value.get(1) + + val distanceMetric = DistanceMetricSelector.select(configBC.value.get(3)) + val icDBSCAN = new InCoreDBSCAN(block, epsilon_local, minPts_local.toInt, distanceMetric) + // do stuff on icDBSCAN + icDBSCAN.DBSCAN() + } + }, { + // Optionally Merge Clusters that are close enough + case (metadata1: Matrix, metadata2: Matrix) => { + // this does nothing- just returns the left matrix + metadata1 + } + }) + + val model = new DBSCANModel(1) + model.summary = s"""foo the bar""" — End diff – @AdityaAS do you have a plan for the model summary?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136180989

          — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala —
          @@ -45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers

          { val epsilon = 1E-6 (myAnswer.norm - correctAnswer.norm) should be <= epsilon }

          +
          + test("inCore DBSCAN test") {
          +// import org.apache.mahout.math.algorithms.clustering._
          +//
          +// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean)
          +//// val url: URL = getClass().getResource("./data21.txt")
          +// //TO DO //Need help
          — End diff –

          @AdityaAS if you need some toy data files to for tests, you can place them in `$MAHOUT_HOME/examples/bin/resources".

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136180989 — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala — @@ -45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers { val epsilon = 1E-6 (myAnswer.norm - correctAnswer.norm) should be <= epsilon } + + test("inCore DBSCAN test") { +// import org.apache.mahout.math.algorithms.clustering._ +// +// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean) +//// val url: URL = getClass().getResource("./data21.txt") +// //TO DO //Need help — End diff – @AdityaAS if you need some toy data files to for tests, you can place them in `$MAHOUT_HOME/examples/bin/resources".
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136181125

          — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/data21.txt —
          @@ -0,0 +1,23 @@
          +21
          — End diff –

          see above comment.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136181125 — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/data21.txt — @@ -0,0 +1,23 @@ +21 — End diff – see above comment.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136310677

          — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala —
          @@ -45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers

          { val epsilon = 1E-6 (myAnswer.norm - correctAnswer.norm) should be <= epsilon }

          +
          + test("inCore DBSCAN test") {
          +// import org.apache.mahout.math.algorithms.clustering._
          +//
          +// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean)
          +//// val url: URL = getClass().getResource("./data21.txt")
          +// //TO DO //Need help
          — End diff –

          Ah. Thanks. I'll do that.

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136310677 — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala — @@ -45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers { val epsilon = 1E-6 (myAnswer.norm - correctAnswer.norm) should be <= epsilon } + + test("inCore DBSCAN test") { +// import org.apache.mahout.math.algorithms.clustering._ +// +// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean) +//// val url: URL = getClass().getResource("./data21.txt") +// //TO DO //Need help — End diff – Ah. Thanks. I'll do that.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136356536

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala —
          @@ -0,0 +1,157 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +
          +
          +package org.apache.mahout.math.algorithms.clustering.rtree
          +
          +import scala.math._
          +
          +/*
          + A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner
          + */
          +trait Rectangle{
          +
          + require(bottomLeft.size == topRight.size) //Both the corners should have the same ns
          +
          + def bottomLeft: List[Double] //Coordinates of bottom left corner
          + def topRight: List[Double] //Coordinates of top right corner
          + def n: Int = bottomLeft.size //nality
          +
          + //Return the lowerLeft corner as a Point
          + def lowerLeftPoint: Point = Point(this.bottomLeft)
          +
          + //Return the upper right corner as a Point
          + def upperRightPoint: Point = Point(this.topRight)
          +
          + //Checks if this contains the Rectangle rect
          + def contains(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n ) {
          + if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight) )

          { + false + }
          + }
          + true
          + }
          +
          + def wraps(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n ) {
          + if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight) ) { + false + }

          + }
          + true
          + }
          +
          + def intersects(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n)

          { + if( !(this.bottomLeft(i) <= rect.topRight(i) && rect.bottomLeft(i) <= this.topRight(i)) ) + false + }

          + true
          + }
          +
          + def area(): Double = {
          + var area = 1.0
          + for(i <- 0 until this.n)

          { + area = area * (this.topRight(i) - this.bottomLeft(i)) + }

          + area
          + }
          +
          + def toMBR: MBR = MBR(this.bottomLeft, this.topRight)
          +
          + def expandRectangle(rect: Rectangle): MBR = {
          + val minBottomLeft = new Array[Double](rect.n)
          + val maxTopRight = new Array[Double](rect.n)
          + for(i <- 0 until rect.n)

          { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + }

          + MBR(minBottomLeft.toList, maxTopRight.toList)
          + }
          +
          + //Area of 'this' rectangle after merging with 'rect'
          + def diffAreaAfterExpansion(rect: Rectangle): Double = {
          + val minBottomLeft = new Array[Double](rect.n)
          + val maxTopRight = new Array[Double](rect.n)
          + var area: Double = 0
          +
          + for(i <- 0 until rect.n)

          { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + area += (maxTopRight(i) - minBottomLeft(i))*(maxTopRight(i) - minBottomLeft(i)) + }

          + area - this.area
          + }
          +
          + //Distance of a point from the rectangle.
          + def distance(pt: Point): Double = {
          + val n = pt.coordinates.size
          + val dList = new Array[Double]
          + for(dim <- 0 until n)

          { + var x = this.bottomLeft(dim) + var x2 = this.topRight(dim) + dList(dim) = if (pt.coordinates(dim) < x) x - pt.coordinates(dim) else if (pt.coordinates(dim) < x2) 0D else pt.coordinates(dim) - x2 + }

          + var squaredSum: Double = 0
          + for(i <- 0 until n)

          { + squaredSum += dList(i)*dList(i) + }

          + sqrt(squaredSum)
          + }
          +
          +}
          +
          +//Represents a point with coordinates and n information
          +case class Point(coordinates: List[Double]) extends Rectangle {
          — End diff –

          Why not move this to its own file?

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136356536 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala — @@ -0,0 +1,157 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + + +package org.apache.mahout.math.algorithms.clustering.rtree + +import scala.math._ + +/* + A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner + */ +trait Rectangle{ + + require(bottomLeft.size == topRight.size) //Both the corners should have the same ns + + def bottomLeft: List [Double] //Coordinates of bottom left corner + def topRight: List [Double] //Coordinates of top right corner + def n: Int = bottomLeft.size //nality + + //Return the lowerLeft corner as a Point + def lowerLeftPoint: Point = Point(this.bottomLeft) + + //Return the upper right corner as a Point + def upperRightPoint: Point = Point(this.topRight) + + //Checks if this contains the Rectangle rect + def contains(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n ) { + if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight ) ) { + false + } + } + true + } + + def wraps(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n ) { + if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight ) ) { + false + } + } + true + } + + def intersects(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n) { + if( !(this.bottomLeft(i) <= rect.topRight(i) && rect.bottomLeft(i) <= this.topRight(i)) ) + false + } + true + } + + def area(): Double = { + var area = 1.0 + for(i <- 0 until this.n) { + area = area * (this.topRight(i) - this.bottomLeft(i)) + } + area + } + + def toMBR: MBR = MBR(this.bottomLeft, this.topRight) + + def expandRectangle(rect: Rectangle): MBR = { + val minBottomLeft = new Array [Double] (rect.n) + val maxTopRight = new Array [Double] (rect.n) + for(i <- 0 until rect.n) { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + } + MBR(minBottomLeft.toList, maxTopRight.toList) + } + + //Area of 'this' rectangle after merging with 'rect' + def diffAreaAfterExpansion(rect: Rectangle): Double = { + val minBottomLeft = new Array [Double] (rect.n) + val maxTopRight = new Array [Double] (rect.n) + var area: Double = 0 + + for(i <- 0 until rect.n) { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + area += (maxTopRight(i) - minBottomLeft(i))*(maxTopRight(i) - minBottomLeft(i)) + } + area - this.area + } + + //Distance of a point from the rectangle. + def distance(pt: Point): Double = { + val n = pt.coordinates.size + val dList = new Array [Double] + for(dim <- 0 until n) { + var x = this.bottomLeft(dim) + var x2 = this.topRight(dim) + dList(dim) = if (pt.coordinates(dim) < x) x - pt.coordinates(dim) else if (pt.coordinates(dim) < x2) 0D else pt.coordinates(dim) - x2 + } + var squaredSum: Double = 0 + for(i <- 0 until n) { + squaredSum += dList(i)*dList(i) + } + sqrt(squaredSum) + } + +} + +//Represents a point with coordinates and n information +case class Point(coordinates: List [Double] ) extends Rectangle { — End diff – Why not move this to its own file?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136356644

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala —
          @@ -0,0 +1,157 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +
          +
          +package org.apache.mahout.math.algorithms.clustering.rtree
          +
          +import scala.math._
          +
          +/*
          + A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner
          + */
          +trait Rectangle{
          +
          + require(bottomLeft.size == topRight.size) //Both the corners should have the same ns
          +
          + def bottomLeft: List[Double] //Coordinates of bottom left corner
          + def topRight: List[Double] //Coordinates of top right corner
          + def n: Int = bottomLeft.size //nality
          +
          + //Return the lowerLeft corner as a Point
          + def lowerLeftPoint: Point = Point(this.bottomLeft)
          +
          + //Return the upper right corner as a Point
          + def upperRightPoint: Point = Point(this.topRight)
          +
          + //Checks if this contains the Rectangle rect
          + def contains(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n ) {
          + if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight) )

          { + false + }
          + }
          + true
          + }
          +
          + def wraps(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n ) {
          + if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight) ) { + false + }

          + }
          + true
          + }
          +
          + def intersects(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n)

          { + if( !(this.bottomLeft(i) <= rect.topRight(i) && rect.bottomLeft(i) <= this.topRight(i)) ) + false + }

          + true
          + }
          +
          + def area(): Double = {
          + var area = 1.0
          + for(i <- 0 until this.n)

          { + area = area * (this.topRight(i) - this.bottomLeft(i)) + }

          + area
          + }
          +
          + def toMBR: MBR = MBR(this.bottomLeft, this.topRight)
          +
          + def expandRectangle(rect: Rectangle): MBR = {
          + val minBottomLeft = new Array[Double](rect.n)
          + val maxTopRight = new Array[Double](rect.n)
          + for(i <- 0 until rect.n)

          { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + }

          + MBR(minBottomLeft.toList, maxTopRight.toList)
          + }
          +
          + //Area of 'this' rectangle after merging with 'rect'
          + def diffAreaAfterExpansion(rect: Rectangle): Double = {
          + val minBottomLeft = new Array[Double](rect.n)
          + val maxTopRight = new Array[Double](rect.n)
          + var area: Double = 0
          +
          + for(i <- 0 until rect.n)

          { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + area += (maxTopRight(i) - minBottomLeft(i))*(maxTopRight(i) - minBottomLeft(i)) + }

          + area - this.area
          + }
          +
          + //Distance of a point from the rectangle.
          + def distance(pt: Point): Double = {
          + val n = pt.coordinates.size
          + val dList = new Array[Double]
          + for(dim <- 0 until n)

          { + var x = this.bottomLeft(dim) + var x2 = this.topRight(dim) + dList(dim) = if (pt.coordinates(dim) < x) x - pt.coordinates(dim) else if (pt.coordinates(dim) < x2) 0D else pt.coordinates(dim) - x2 + }

          + var squaredSum: Double = 0
          + for(i <- 0 until n)

          { + squaredSum += dList(i)*dList(i) + }

          + sqrt(squaredSum)
          + }
          +
          +}
          +
          +//Represents a point with coordinates and n information
          +case class Point(coordinates: List[Double]) extends Rectangle {
          +
          + override def bottomLeft: List[Double] = coordinates
          + override def topRight: List[Double] = coordinates
          +
          + override def n: Int = bottomLeft.size
          +
          + override def lowerLeftPoint: Point = this
          + override def upperRightPoint: Point = this
          +
          + override def distance(pt: Point): Double = {
          + var squaredSum: Double = 0
          + for(i <- 0 until pt.n)

          { + squaredSum += (pt.bottomLeft(i) - bottomLeft(i))*(pt.bottomLeft(i) - bottomLeft(i)) + }

          + squaredSum
          + }
          +
          + override def wraps(rect: Rectangle): Boolean = false
          +}
          +
          +//MBR is a physical manifestation of the Rectangle trait
          +case class MBR(bottomLeft: List[Double], topRight: List[Double]) extends Rectangle {
          — End diff –

          same with case class and companion object. (move to own file)

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136356644 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala — @@ -0,0 +1,157 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + + +package org.apache.mahout.math.algorithms.clustering.rtree + +import scala.math._ + +/* + A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner + */ +trait Rectangle{ + + require(bottomLeft.size == topRight.size) //Both the corners should have the same ns + + def bottomLeft: List [Double] //Coordinates of bottom left corner + def topRight: List [Double] //Coordinates of top right corner + def n: Int = bottomLeft.size //nality + + //Return the lowerLeft corner as a Point + def lowerLeftPoint: Point = Point(this.bottomLeft) + + //Return the upper right corner as a Point + def upperRightPoint: Point = Point(this.topRight) + + //Checks if this contains the Rectangle rect + def contains(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n ) { + if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight ) ) { + false + } + } + true + } + + def wraps(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n ) { + if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight ) ) { + false + } + } + true + } + + def intersects(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n) { + if( !(this.bottomLeft(i) <= rect.topRight(i) && rect.bottomLeft(i) <= this.topRight(i)) ) + false + } + true + } + + def area(): Double = { + var area = 1.0 + for(i <- 0 until this.n) { + area = area * (this.topRight(i) - this.bottomLeft(i)) + } + area + } + + def toMBR: MBR = MBR(this.bottomLeft, this.topRight) + + def expandRectangle(rect: Rectangle): MBR = { + val minBottomLeft = new Array [Double] (rect.n) + val maxTopRight = new Array [Double] (rect.n) + for(i <- 0 until rect.n) { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + } + MBR(minBottomLeft.toList, maxTopRight.toList) + } + + //Area of 'this' rectangle after merging with 'rect' + def diffAreaAfterExpansion(rect: Rectangle): Double = { + val minBottomLeft = new Array [Double] (rect.n) + val maxTopRight = new Array [Double] (rect.n) + var area: Double = 0 + + for(i <- 0 until rect.n) { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + area += (maxTopRight(i) - minBottomLeft(i))*(maxTopRight(i) - minBottomLeft(i)) + } + area - this.area + } + + //Distance of a point from the rectangle. + def distance(pt: Point): Double = { + val n = pt.coordinates.size + val dList = new Array [Double] + for(dim <- 0 until n) { + var x = this.bottomLeft(dim) + var x2 = this.topRight(dim) + dList(dim) = if (pt.coordinates(dim) < x) x - pt.coordinates(dim) else if (pt.coordinates(dim) < x2) 0D else pt.coordinates(dim) - x2 + } + var squaredSum: Double = 0 + for(i <- 0 until n) { + squaredSum += dList(i)*dList(i) + } + sqrt(squaredSum) + } + +} + +//Represents a point with coordinates and n information +case class Point(coordinates: List [Double] ) extends Rectangle { + + override def bottomLeft: List [Double] = coordinates + override def topRight: List [Double] = coordinates + + override def n: Int = bottomLeft.size + + override def lowerLeftPoint: Point = this + override def upperRightPoint: Point = this + + override def distance(pt: Point): Double = { + var squaredSum: Double = 0 + for(i <- 0 until pt.n) { + squaredSum += (pt.bottomLeft(i) - bottomLeft(i))*(pt.bottomLeft(i) - bottomLeft(i)) + } + squaredSum + } + + override def wraps(rect: Rectangle): Boolean = false +} + +//MBR is a physical manifestation of the Rectangle trait +case class MBR(bottomLeft: List [Double] , topRight: List [Double] ) extends Rectangle { — End diff – same with case class and companion object. (move to own file)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user andrewpalumbo commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136427426

          — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala —
          @@ -45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers

          { val epsilon = 1E-6 (myAnswer.norm - correctAnswer.norm) should be <= epsilon }

          +
          + test("inCore DBSCAN test") {
          +// import org.apache.mahout.math.algorithms.clustering._
          +//
          +// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean)
          +//// val url: URL = getClass().getResource("./data21.txt")
          +// //TO DO //Need help
          — End diff –

          I apologize @AdityaAS I'd not noticed that this `.txt` file wasthis short- No need to move to `$MAHOUT_HOME/examples/bin/resources` for Unit test input. FYI though if you write an example for this to be placed in `examples`, that's where you would place `.txt` files or any other resources.

          Show
          githubbot ASF GitHub Bot added a comment - Github user andrewpalumbo commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136427426 — Diff: math-scala/src/test/scala/org/apache/mahout/math/algorithms/ClusteringSuiteBase.scala — @@ -45,4 +48,17 @@ trait ClusteringSuiteBase extends DistributedMahoutSuite with Matchers { val epsilon = 1E-6 (myAnswer.norm - correctAnswer.norm) should be <= epsilon } + + test("inCore DBSCAN test") { +// import org.apache.mahout.math.algorithms.clustering._ +// +// val dm = DistanceMetricSelector.namedMetricLookup('Euclidean) +//// val url: URL = getClass().getResource("./data21.txt") +// //TO DO //Need help — End diff – I apologize @AdityaAS I'd not noticed that this `.txt` file wasthis short- No need to move to `$MAHOUT_HOME/examples/bin/resources` for Unit test input. FYI though if you write an example for this to be placed in `examples`, that's where you would place `.txt` files or any other resources.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136441501

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN.scala —
          @@ -0,0 +1,217 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +
          +package org.apache.mahout.math.algorithms.clustering
          +
          +import org.apache.mahout.math.scalabindings._
          +import org.apache.mahout.math.scalabindings.RLikeOps._
          +import org.apache.mahout.math._
          +import org.apache.mahout.math.algorithms.common.distance.

          {DistanceMetric, DistanceMetricSelector}

          +import org.apache.mahout.math.drm._
          +import org.apache.mahout.math.drm.RLikeDrmOps._
          +
          +import scala.collection.mutable
          +import scala.io.Source
          +
          +class DistributedDBSCAN extends ClusteringFitter {
          +
          + var epsilon: Double = _
          + var minPts: Int = _
          + var distanceMeasure: Symbol = _
          +
          + def setStandardHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo -> None)): Unit =

          { + epsilon = hyperparameters.asInstanceOf[Map[Symbol, Double]].getOrElse('epsilon, 0.5) + minPts = hyperparameters.asInstanceOf[Map[Symbol, Int]].getOrElse('minPts, 1) + distanceMeasure = hyperparameters.asInstanceOf[Map[Symbol, Symbol]].getOrElse('distanceMeasure, 'Euclidean) + }

          +
          + def fit[K](input: DrmLike[K],
          + hyperparameters: (Symbol, Any)*): DBSCANModel = {
          +
          + setStandardHyperparameters(hyperparameters.toMap)
          + implicit val ctx = input.context
          + implicit val ktag = input.keyClassTag
          +
          + val dmNumber = DistanceMetricSelector.namedMetricLookup(distanceMeasure)
          +
          + val configBC = drmBroadcast(dvec(epsilon, minPts, dmNumber))
          +
          + val clusters = input.allreduceBlock(
          + {
          + // Assign All Points to Clusters
          + case (keys, block: Matrix) =>

          { + val epsilon_local = configBC.value.get(0) + val minPts_local = configBC.value.get(1) + + val distanceMetric = DistanceMetricSelector.select(configBC.value.get(3)) + val icDBSCAN = new InCoreDBSCAN(block, epsilon_local, minPts_local.toInt, distanceMetric) + // do stuff on icDBSCAN + icDBSCAN.DBSCAN() + }

          + }, {
          + // Optionally Merge Clusters that are close enough
          + case (metadata1: Matrix, metadata2: Matrix) =>

          { + // this does nothing- just returns the left matrix + metadata1 + }

          + })
          +
          + val model = new DBSCANModel(1)
          + model.summary = s"""foo the bar"""
          — End diff –

          Yes. I do. Will update it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136441501 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/DBSCAN.scala — @@ -0,0 +1,217 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.mahout.math.algorithms.clustering + +import org.apache.mahout.math.scalabindings._ +import org.apache.mahout.math.scalabindings.RLikeOps._ +import org.apache.mahout.math._ +import org.apache.mahout.math.algorithms.common.distance. {DistanceMetric, DistanceMetricSelector} +import org.apache.mahout.math.drm._ +import org.apache.mahout.math.drm.RLikeDrmOps._ + +import scala.collection.mutable +import scala.io.Source + +class DistributedDBSCAN extends ClusteringFitter { + + var epsilon: Double = _ + var minPts: Int = _ + var distanceMeasure: Symbol = _ + + def setStandardHyperparameters(hyperparameters: Map [Symbol, Any] = Map('foo -> None)): Unit = { + epsilon = hyperparameters.asInstanceOf[Map[Symbol, Double]].getOrElse('epsilon, 0.5) + minPts = hyperparameters.asInstanceOf[Map[Symbol, Int]].getOrElse('minPts, 1) + distanceMeasure = hyperparameters.asInstanceOf[Map[Symbol, Symbol]].getOrElse('distanceMeasure, 'Euclidean) + } + + def fit [K] (input: DrmLike [K] , + hyperparameters: (Symbol, Any)*): DBSCANModel = { + + setStandardHyperparameters(hyperparameters.toMap) + implicit val ctx = input.context + implicit val ktag = input.keyClassTag + + val dmNumber = DistanceMetricSelector.namedMetricLookup(distanceMeasure) + + val configBC = drmBroadcast(dvec(epsilon, minPts, dmNumber)) + + val clusters = input.allreduceBlock( + { + // Assign All Points to Clusters + case (keys, block: Matrix) => { + val epsilon_local = configBC.value.get(0) + val minPts_local = configBC.value.get(1) + + val distanceMetric = DistanceMetricSelector.select(configBC.value.get(3)) + val icDBSCAN = new InCoreDBSCAN(block, epsilon_local, minPts_local.toInt, distanceMetric) + // do stuff on icDBSCAN + icDBSCAN.DBSCAN() + } + }, { + // Optionally Merge Clusters that are close enough + case (metadata1: Matrix, metadata2: Matrix) => { + // this does nothing- just returns the left matrix + metadata1 + } + }) + + val model = new DBSCANModel(1) + model.summary = s"""foo the bar""" — End diff – Yes. I do. Will update it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on a diff in the pull request:

          https://github.com/apache/mahout/pull/334#discussion_r136441895

          — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala —
          @@ -0,0 +1,157 @@
          +/**
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing,
          + * software distributed under the License is distributed on an
          + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + * KIND, either express or implied. See the License for the
          + * specific language governing permissions and limitations
          + * under the License.
          + */
          +
          +
          +package org.apache.mahout.math.algorithms.clustering.rtree
          +
          +import scala.math._
          +
          +/*
          + A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner
          + */
          +trait Rectangle{
          +
          + require(bottomLeft.size == topRight.size) //Both the corners should have the same ns
          +
          + def bottomLeft: List[Double] //Coordinates of bottom left corner
          + def topRight: List[Double] //Coordinates of top right corner
          + def n: Int = bottomLeft.size //nality
          +
          + //Return the lowerLeft corner as a Point
          + def lowerLeftPoint: Point = Point(this.bottomLeft)
          +
          + //Return the upper right corner as a Point
          + def upperRightPoint: Point = Point(this.topRight)
          +
          + //Checks if this contains the Rectangle rect
          + def contains(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n ) {
          + if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight) )

          { + false + }
          + }
          + true
          + }
          +
          + def wraps(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n ) {
          + if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight) ) { + false + }

          + }
          + true
          + }
          +
          + def intersects(rect: Rectangle): Boolean = {
          + for(i <- 0 until rect.n)

          { + if( !(this.bottomLeft(i) <= rect.topRight(i) && rect.bottomLeft(i) <= this.topRight(i)) ) + false + }

          + true
          + }
          +
          + def area(): Double = {
          + var area = 1.0
          + for(i <- 0 until this.n)

          { + area = area * (this.topRight(i) - this.bottomLeft(i)) + }

          + area
          + }
          +
          + def toMBR: MBR = MBR(this.bottomLeft, this.topRight)
          +
          + def expandRectangle(rect: Rectangle): MBR = {
          + val minBottomLeft = new Array[Double](rect.n)
          + val maxTopRight = new Array[Double](rect.n)
          + for(i <- 0 until rect.n)

          { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + }

          + MBR(minBottomLeft.toList, maxTopRight.toList)
          + }
          +
          + //Area of 'this' rectangle after merging with 'rect'
          + def diffAreaAfterExpansion(rect: Rectangle): Double = {
          + val minBottomLeft = new Array[Double](rect.n)
          + val maxTopRight = new Array[Double](rect.n)
          + var area: Double = 0
          +
          + for(i <- 0 until rect.n)

          { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + area += (maxTopRight(i) - minBottomLeft(i))*(maxTopRight(i) - minBottomLeft(i)) + }

          + area - this.area
          + }
          +
          + //Distance of a point from the rectangle.
          + def distance(pt: Point): Double = {
          + val n = pt.coordinates.size
          + val dList = new Array[Double]
          + for(dim <- 0 until n)

          { + var x = this.bottomLeft(dim) + var x2 = this.topRight(dim) + dList(dim) = if (pt.coordinates(dim) < x) x - pt.coordinates(dim) else if (pt.coordinates(dim) < x2) 0D else pt.coordinates(dim) - x2 + }

          + var squaredSum: Double = 0
          + for(i <- 0 until n)

          { + squaredSum += dList(i)*dList(i) + }

          + sqrt(squaredSum)
          + }
          +
          +}
          +
          +//Represents a point with coordinates and n information
          +case class Point(coordinates: List[Double]) extends Rectangle {
          +
          + override def bottomLeft: List[Double] = coordinates
          + override def topRight: List[Double] = coordinates
          +
          + override def n: Int = bottomLeft.size
          +
          + override def lowerLeftPoint: Point = this
          + override def upperRightPoint: Point = this
          +
          + override def distance(pt: Point): Double = {
          + var squaredSum: Double = 0
          + for(i <- 0 until pt.n)

          { + squaredSum += (pt.bottomLeft(i) - bottomLeft(i))*(pt.bottomLeft(i) - bottomLeft(i)) + }

          + squaredSum
          + }
          +
          + override def wraps(rect: Rectangle): Boolean = false
          +}
          +
          +//MBR is a physical manifestation of the Rectangle trait
          +case class MBR(bottomLeft: List[Double], topRight: List[Double]) extends Rectangle {
          — End diff –

          My idea was to keep classes describing the datastructures in one file and the logic in another. But if having different classes is food design I'll update it

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on a diff in the pull request: https://github.com/apache/mahout/pull/334#discussion_r136441895 — Diff: math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/rtree/Rectangle.scala — @@ -0,0 +1,157 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + + +package org.apache.mahout.math.algorithms.clustering.rtree + +import scala.math._ + +/* + A rectangle is defined by two parameters. The coordinates of its topright corner and the coordinates of its bootomleftcorner + */ +trait Rectangle{ + + require(bottomLeft.size == topRight.size) //Both the corners should have the same ns + + def bottomLeft: List [Double] //Coordinates of bottom left corner + def topRight: List [Double] //Coordinates of top right corner + def n: Int = bottomLeft.size //nality + + //Return the lowerLeft corner as a Point + def lowerLeftPoint: Point = Point(this.bottomLeft) + + //Return the upper right corner as a Point + def upperRightPoint: Point = Point(this.topRight) + + //Checks if this contains the Rectangle rect + def contains(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n ) { + if( !(this.bottomLeft <= rect.bottomLeft && rect.topRight <= this.topRight ) ) { + false + } + } + true + } + + def wraps(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n ) { + if( !(this.bottomLeft < rect.bottomLeft && rect.topRight < this.topRight ) ) { + false + } + } + true + } + + def intersects(rect: Rectangle): Boolean = { + for(i <- 0 until rect.n) { + if( !(this.bottomLeft(i) <= rect.topRight(i) && rect.bottomLeft(i) <= this.topRight(i)) ) + false + } + true + } + + def area(): Double = { + var area = 1.0 + for(i <- 0 until this.n) { + area = area * (this.topRight(i) - this.bottomLeft(i)) + } + area + } + + def toMBR: MBR = MBR(this.bottomLeft, this.topRight) + + def expandRectangle(rect: Rectangle): MBR = { + val minBottomLeft = new Array [Double] (rect.n) + val maxTopRight = new Array [Double] (rect.n) + for(i <- 0 until rect.n) { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + } + MBR(minBottomLeft.toList, maxTopRight.toList) + } + + //Area of 'this' rectangle after merging with 'rect' + def diffAreaAfterExpansion(rect: Rectangle): Double = { + val minBottomLeft = new Array [Double] (rect.n) + val maxTopRight = new Array [Double] (rect.n) + var area: Double = 0 + + for(i <- 0 until rect.n) { + minBottomLeft(i) = math.min(bottomLeft(i), rect.bottomLeft(i)) + maxTopRight(i) = math.max(topRight(i), rect.topRight(i)) + area += (maxTopRight(i) - minBottomLeft(i))*(maxTopRight(i) - minBottomLeft(i)) + } + area - this.area + } + + //Distance of a point from the rectangle. + def distance(pt: Point): Double = { + val n = pt.coordinates.size + val dList = new Array [Double] + for(dim <- 0 until n) { + var x = this.bottomLeft(dim) + var x2 = this.topRight(dim) + dList(dim) = if (pt.coordinates(dim) < x) x - pt.coordinates(dim) else if (pt.coordinates(dim) < x2) 0D else pt.coordinates(dim) - x2 + } + var squaredSum: Double = 0 + for(i <- 0 until n) { + squaredSum += dList(i)*dList(i) + } + sqrt(squaredSum) + } + +} + +//Represents a point with coordinates and n information +case class Point(coordinates: List [Double] ) extends Rectangle { + + override def bottomLeft: List [Double] = coordinates + override def topRight: List [Double] = coordinates + + override def n: Int = bottomLeft.size + + override def lowerLeftPoint: Point = this + override def upperRightPoint: Point = this + + override def distance(pt: Point): Double = { + var squaredSum: Double = 0 + for(i <- 0 until pt.n) { + squaredSum += (pt.bottomLeft(i) - bottomLeft(i))*(pt.bottomLeft(i) - bottomLeft(i)) + } + squaredSum + } + + override def wraps(rect: Rectangle): Boolean = false +} + +//MBR is a physical manifestation of the Rectangle trait +case class MBR(bottomLeft: List [Double] , topRight: List [Double] ) extends Rectangle { — End diff – My idea was to keep classes describing the datastructures in one file and the logic in another. But if having different classes is food design I'll update it
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user AdityaAS commented on the issue:

          https://github.com/apache/mahout/pull/334

          @rawkintrevo @andrewpalumbo having a separate PR for Rtrees is a good idea. RTrees can be used in other algorithms as well (kNN for example). Should I go ahead a create a new issue on JIRA?

          Show
          githubbot ASF GitHub Bot added a comment - Github user AdityaAS commented on the issue: https://github.com/apache/mahout/pull/334 @rawkintrevo @andrewpalumbo having a separate PR for Rtrees is a good idea. RTrees can be used in other algorithms as well (kNN for example). Should I go ahead a create a new issue on JIRA?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on the issue:

          https://github.com/apache/mahout/pull/334

          @AdityaAS Yes please. Also don't forget to add a jira ticket.

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on the issue: https://github.com/apache/mahout/pull/334 @AdityaAS Yes please. Also don't forget to add a jira ticket.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rawkintrevo commented on the issue:

          https://github.com/apache/mahout/pull/334

          status on this @AdityaAS ?

          Show
          githubbot ASF GitHub Bot added a comment - Github user rawkintrevo commented on the issue: https://github.com/apache/mahout/pull/334 status on this @AdityaAS ?

            People

            • Assignee:
              AdityaAS Aditya AS
              Reporter:
              AdityaAS Aditya AS
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:

                Development