Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1771

Cluster dumper omits indices and 0 elements for dense vector or sparse containing 0s

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.11.1
    • Component/s: Clustering, mrlegacy
    • Labels:
      None

      Description

      (EDIT: fixed incorrect analysis)

      Blast from the past – are patches still being accepted for "mrlegacy" code? Something turned up incidentally when working with a customer that looks like a minor bug in the cluster dumper code.

      In AbstractCluster.java:

      public static List<Object> formatVectorAsJson(Vector v, String[] bindings) throws IOException {
      
          boolean hasBindings = bindings != null;
          boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != v.size();
      
          // we assume sequential access in the output
          Vector provider = v.isSequentialAccess() ? v : new SequentialAccessSparseVector(v);
      
          List<Object> terms = new LinkedList<>();
          String term = "";
      
          for (Element elem : provider.nonZeroes()) {
      
            if (hasBindings && bindings.length >= elem.index() + 1 && bindings[elem.index()] != null) {
              term = bindings[elem.index()];
            } else if (hasBindings || isSparse) {
              term = String.valueOf(elem.index());
            }
      
            Map<String, Object> term_entry = new HashMap<>();
            double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
            if (hasBindings || isSparse) {
              term_entry.put(term, roundedWeight);
              terms.add(term_entry);
            } else {
              terms.add(roundedWeight);
            }
          }
      
          return terms;
        }
      

      The problem is that this never outputs any elements of a vector with value 0, but, also doesn't print indices in some cases. This means the output is smaller than the number of dimensions, but it's not possible to know where the omitted 0s are.

      It will not output indices if the vector is a dense vector, or if the number of non-default elements is the same as the size (which includes sparse vectors even containing 0 values, if they have been set explicitly). However the iteration is over non-zero elements only.

      The fix seems to be to print indices if the number of non-zero elements is less than size, for any vector:

          boolean isSparse = v.getNumZeroElements() != v.size();
      

      Pretty straightforward, and minor, but wanted to check with everyone before making a change.

        Attachments

          Activity

            People

            • Assignee:
              smarthi Suneel Marthi
              Reporter:
              srowen Sean Owen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: