Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.8
-
all
Description
The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of creating terabytes of output of dubious value. For example, I have ~10K seed vectors and millions of vectors to compute the similarity between so I would like to add an optional parameter to this job to specify a maximum distance threshold that prevents any distances above this value from being written to the output. The default would be 1.0d so no filtering is applied which ensures backwards compatibility, but if supplied, only rows where the distance is less than the threshold would be output from the mapper. This can help reduce the storage requirements of the output immensely. Probably name the parameter something like: noOutputIfDistanceGreaterThan