Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1019

VectorDistanceSimilarityJob

    XMLWordPrintableJSON

    Details

      Description

      The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of creating terabytes of output of dubious value. For example, I have ~10K seed vectors and millions of vectors to compute the similarity between so I would like to add an optional parameter to this job to specify a maximum distance threshold that prevents any distances above this value from being written to the output. The default would be 1.0d so no filtering is applied which ensures backwards compatibility, but if supplied, only rows where the distance is less than the threshold would be output from the mapper. This can help reduce the storage requirements of the output immensely. Probably name the parameter something like: noOutputIfDistanceGreaterThan

        Attachments

        1. MAHOUT-1019.patch
          5 kB
          Timothy Potter

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              thelabdude Timothy Potter
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 12h
                12h
                Remaining:
                Remaining Estimate - 12h
                12h
                Logged:
                Time Spent - Not Specified
                Not Specified