Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: v2.0
    • Component/s: k-NN
    • Labels:

      Description

      Follow on from https://issues.apache.org/jira/browse/MADLIB-927
      which supports one distance function. This JIRA is to

      (1)
      add additional distance metrics. The model is follow is
      http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html

      fn_dist (optional)
      TEXT, default: squared_dist_norm2'. The name of the function to use to calculate the distance between data points.

      The following distance functions can be used (computation of barycenter/mean in parentheses):

      dist_norm1: 1-norm/Manhattan (element-wise median [Note that MADlib does not provide a median aggregate function for support and performance reasons.])
      dist_norm2: 2-norm/Euclidean (element-wise mean)
      squared_dist_norm2: squared Euclidean distance (element-wise mean)
      dist_angle: angle (element-wise mean of normalized points)
      dist_tanimoto: tanimoto (element-wise mean of normalized points [5])
      user defined function with signature DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -> DOUBLE PRECISION

      and also check of there are other distance functions under
      http://madlib.apache.org/docs/latest/group__grp__linalg.html
      that might make sense to include while you are at it, in addition to the ones listed above

      (2) Add an option for weighted average in the voting.

        Issue Links

          Activity

          Hide
          fmcquillan Frank McQuillan added a comment -

          After working on
          https://github.com/apache/madlib/pull/184
          Himanshu Pandey suggested he would like to work on this as well, so assigning to him.

          Thank you Himanshu

          Show
          fmcquillan Frank McQuillan added a comment - After working on https://github.com/apache/madlib/pull/184 Himanshu Pandey suggested he would like to work on this as well, so assigning to him. Thank you Himanshu
          Hide
          fmcquillan Frank McQuillan added a comment -

          Adding a comment from Nandish Jayaram that he put in
          https://issues.apache.org/jira/browse/MADLIB-1129

          "Himanshu,
          Since you are working on including more distance functions for kNN, I thought
          extending that to the output layer might also be useful. Right now, it looks like
          MADlib does a simple average of the k-nearest neighbors to come up with the
          final value for both classification and regression. Doing a weighted average instead
          might be a desirable functionality. The weighting for the average can be based on the
          distance of the k-nearest neighbors.
          We can probably provide an optional parameter to let users choose how the final
          classification label or regression score has to be computed (avg or weighted avg).
          Frank McQuillan any thoughts?"

          I think this is a good idea to do at the same time as adding the distance functions.

          Show
          fmcquillan Frank McQuillan added a comment - Adding a comment from Nandish Jayaram that he put in https://issues.apache.org/jira/browse/MADLIB-1129 "Himanshu, Since you are working on including more distance functions for kNN, I thought extending that to the output layer might also be useful. Right now, it looks like MADlib does a simple average of the k-nearest neighbors to come up with the final value for both classification and regression. Doing a weighted average instead might be a desirable functionality. The weighting for the average can be based on the distance of the k-nearest neighbors. We can probably provide an optional parameter to let users choose how the final classification label or regression score has to be computed (avg or weighted avg). Frank McQuillan any thoughts?" I think this is a good idea to do at the same time as adding the distance functions.
          Hide
          hpandey@pivotal.io Himanshu Pandey added a comment -

          Frank McQuillan,

          Currently K-NN uses squared_dist_norm2 function by default to calculate the distance. So with this functionality, the idea is, it can use any of these functions to calculate the distance right?

          dist_norm1
          dist_norm2 
          squared_dist_norm2
          dist_angle 
          dist_tanimoto 
          user defined function with signature DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -> DOUBLE PRECISION
          
          

          Are we going to add an overloaded function with this extra param or making it optional in the same function?

          Show
          hpandey@pivotal.io Himanshu Pandey added a comment - Frank McQuillan , Currently K-NN uses squared_dist_norm2 function by default to calculate the distance. So with this functionality, the idea is, it can use any of these functions to calculate the distance right? dist_norm1 dist_norm2 squared_dist_norm2 dist_angle dist_tanimoto user defined function with signature DOUBLE PRECISION[] x, DOUBLE PRECISION[] y -> DOUBLE PRECISION Are we going to add an overloaded function with this extra param or making it optional in the same function?
          Hide
          fmcquillan Frank McQuillan added a comment -

          Yes that is correct. User can pick distance function of interest, or use the default one.

          I know Orhan Kislal is away for a bit, but Nandish Jayaram can perhaps comment on preferred implementation.

          Also, is the above the full set of distance functions avail in MADlib today, or are there any other ones we could add to the list?

          Show
          fmcquillan Frank McQuillan added a comment - Yes that is correct. User can pick distance function of interest, or use the default one. I know Orhan Kislal is away for a bit, but Nandish Jayaram can perhaps comment on preferred implementation. Also, is the above the full set of distance functions avail in MADlib today, or are there any other ones we could add to the list?
          Hide
          njayaram Nandish Jayaram added a comment -

          By overloaded function do you mean defining multiple UDFs in SQL? If yes, then the answer is no.
          We can have fn_dist as one optional VARCHAR parameter in knn(...), as described in the description of this JIRA. It can take any of the values you have mentioned, while the default could be squared_dist_norm2. We can process the string in python to figure out and use the correct distance metric.

          Please let me know if I misunderstood your question.

          Show
          njayaram Nandish Jayaram added a comment - By overloaded function do you mean defining multiple UDFs in SQL? If yes, then the answer is no. We can have fn_dist as one optional VARCHAR parameter in knn(...) , as described in the description of this JIRA. It can take any of the values you have mentioned, while the default could be squared_dist_norm2 . We can process the string in python to figure out and use the correct distance metric. Please let me know if I misunderstood your question.
          Hide
          hpandey@pivotal.io Himanshu Pandey added a comment -

          Nandish Jayaram Yes, I meant multiple UDFs. Thanks for the clarification!

          Show
          hpandey@pivotal.io Himanshu Pandey added a comment - Nandish Jayaram Yes, I meant multiple UDFs. Thanks for the clarification!

            People

            • Assignee:
              hpandey12 Himanshu Pandey
              Reporter:
              fmcquillan Frank McQuillan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development