Mahout
  1. Mahout
  2. MAHOUT-900

RandomSeedGenerator samples / output k texts incorrectly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

                int currentSize = chosenTexts.size();
                if (currentSize < k) {
                  chosenTexts.add(newText);
                  chosenClusters.add(newCluster);
                } else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
                  int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
                  chosenTexts.remove(indexToRemove);
                  chosenClusters.remove(indexToRemove);
                  chosenTexts.add(newText);
                  chosenClusters.add(newCluster);
                }
      

      The second "if" condition ought to be "!= 0", right? Only if it is 0 do we skip the body, which removes an existing element, since the new element itself is evicted.

      Second, this code:

              for (int i = 0; i < k; i++) {
                writer.append(chosenTexts.get(i), chosenClusters.get(i));
              }
      

      ... assumes that at least k elements existed in the input, and fails otherwise. Probably need to cap this.

      Patch attached.

        Activity

        Sean Owen created issue -
        Sean Owen made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Sean Owen made changes -
        Attachment MAHOUT-900.patch [ 12505318 ]
        Sean Owen made changes -
        Description           int currentSize = chosenTexts.size();
                  if (currentSize < k) {
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  } else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
                    int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
                    chosenTexts.remove(indexToRemove);
                    chosenClusters.remove(indexToRemove);
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  }

        The second "if" condition ought to be "!= 0", right? Only if it is 0 do we skip the body, which removes an existing element, since the new element itself is evicted.

        Second, this code:

                for (int i = 0; i < k; i++) {
                  writer.append(chosenTexts.get(i), chosenClusters.get(i));
                }

        ... assumes that at least k elements existed in the input, and fails otherwise. Probably need to cap this.

        Patch attached.
        {code}
                  int currentSize = chosenTexts.size();
                  if (currentSize < k) {
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  } else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
                    int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
                    chosenTexts.remove(indexToRemove);
                    chosenClusters.remove(indexToRemove);
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  }
        {code}

        The second "if" condition ought to be "!= 0", right? Only if it is 0 do we skip the body, which removes an existing element, since the new element itself is evicted.

        Second, this code:

        {code}
                for (int i = 0; i < k; i++) {
                  writer.append(chosenTexts.get(i), chosenClusters.get(i));
                }
        {code}

        ... assumes that at least k elements existed in the input, and fails otherwise. Probably need to cap this.

        Patch attached.
        Hide
        Sean Owen added a comment -

        I committed as tests passed and I think Robin gave the nod in a private thread.

        Show
        Sean Owen added a comment - I committed as tests passed and I think Robin gave the nod in a private thread.
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Assignee Robin Anil [ robinanil ] Sean Owen [ srowen ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1208 (See https://builds.apache.org/job/Mahout-Quality/1208/)
        MAHOUT-900 fix sampling logic and handle case of < k elements

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207508
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1208 (See https://builds.apache.org/job/Mahout-Quality/1208/ ) MAHOUT-900 fix sampling logic and handle case of < k elements srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207508 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development