Mahout
  1. Mahout
  2. MAHOUT-900

RandomSeedGenerator samples / output k texts incorrectly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

                int currentSize = chosenTexts.size();
                if (currentSize < k) {
                  chosenTexts.add(newText);
                  chosenClusters.add(newCluster);
                } else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
                  int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
                  chosenTexts.remove(indexToRemove);
                  chosenClusters.remove(indexToRemove);
                  chosenTexts.add(newText);
                  chosenClusters.add(newCluster);
                }
      

      The second "if" condition ought to be "!= 0", right? Only if it is 0 do we skip the body, which removes an existing element, since the new element itself is evicted.

      Second, this code:

              for (int i = 0; i < k; i++) {
                writer.append(chosenTexts.get(i), chosenClusters.get(i));
              }
      

      ... assumes that at least k elements existed in the input, and fails otherwise. Probably need to cap this.

      Patch attached.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        45s 1 Sean Owen 28/Nov/11 10:41
        Patch Available Patch Available Resolved Resolved
        8h 13m 1 Sean Owen 28/Nov/11 18:54
        Resolved Resolved Closed Closed
        72d 19h 5m 1 Sean Owen 09/Feb/12 14:00
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1208 (See https://builds.apache.org/job/Mahout-Quality/1208/)
        MAHOUT-900 fix sampling logic and handle case of < k elements

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207508
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1208 (See https://builds.apache.org/job/Mahout-Quality/1208/ ) MAHOUT-900 fix sampling logic and handle case of < k elements srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207508 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Assignee Robin Anil [ robinanil ] Sean Owen [ srowen ]
        Resolution Fixed [ 1 ]
        Hide
        Sean Owen added a comment -

        I committed as tests passed and I think Robin gave the nod in a private thread.

        Show
        Sean Owen added a comment - I committed as tests passed and I think Robin gave the nod in a private thread.
        Sean Owen made changes -
        Description           int currentSize = chosenTexts.size();
                  if (currentSize < k) {
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  } else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
                    int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
                    chosenTexts.remove(indexToRemove);
                    chosenClusters.remove(indexToRemove);
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  }

        The second "if" condition ought to be "!= 0", right? Only if it is 0 do we skip the body, which removes an existing element, since the new element itself is evicted.

        Second, this code:

                for (int i = 0; i < k; i++) {
                  writer.append(chosenTexts.get(i), chosenClusters.get(i));
                }

        ... assumes that at least k elements existed in the input, and fails otherwise. Probably need to cap this.

        Patch attached.
        {code}
                  int currentSize = chosenTexts.size();
                  if (currentSize < k) {
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  } else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
                    int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
                    chosenTexts.remove(indexToRemove);
                    chosenClusters.remove(indexToRemove);
                    chosenTexts.add(newText);
                    chosenClusters.add(newCluster);
                  }
        {code}

        The second "if" condition ought to be "!= 0", right? Only if it is 0 do we skip the body, which removes an existing element, since the new element itself is evicted.

        Second, this code:

        {code}
                for (int i = 0; i < k; i++) {
                  writer.append(chosenTexts.get(i), chosenClusters.get(i));
                }
        {code}

        ... assumes that at least k elements existed in the input, and fails otherwise. Probably need to cap this.

        Patch attached.
        Sean Owen made changes -
        Attachment MAHOUT-900.patch [ 12505318 ]
        Sean Owen made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Sean Owen created issue -

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development