Mahout
  1. Mahout
  2. MAHOUT-326

a possible bug with the isConverged() method in KMeansDriver.java

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.4
    • Component/s: Clustering
    • Labels:
      None

      Description

      In one of my today's test runs using the clustering example from the book "Mahout in Action", I noticed the following exception thrown by KMeansClusterMapper:

      ----------------------------
      java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 10 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at

      ***

      org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 13 more Caused by: java.lang.NullPointerException: Cluster is empty!!! at

      ***

      org.apache.mahout.clustering.kmeans.KMeansClusterMapper.configure(KMeansClusterMapper.java:63)
      ---------------------------

      which says that the runClustering method didn't see the cluster ouput. The same map task did finally succeed after a few failed attempts.

      After looking into KMeansDirver.java, I think may be a bug in the isConverged method. Basically, this method doesn't wait for the cluster output file to be fully populated. If the part-* file doesn't exist yet or has not been fully written, then this method can return true prematurally. I am not sure if this is a bug of hadoop itself because it may report successful job before the mapred output file is fully written. Meanwhile, a possible way to fix this problem is to force the isConverged method to wait for the existence of the cluster output file and make sure the file contains the 'converged' values for all the clusters.

      Please note, I saw this problem only once in many test runs I had so far. It may be a little bit difficult to reproduce. If you need any further information, please let me know.

      Thanks.

      1. mahout_bug.png
        146 kB
        Chad Chen

        Activity

        Chad Chen created issue -
        Hide
        Chad Chen added a comment -

        I am attaching an image to show that the map task failed twice before it succeeded on the third attempt. Second and third attempts are on the same node so the node itself should not be the problem. It's more likely that the cluster output file was ready only on the third attempt.

        Show
        Chad Chen added a comment - I am attaching an image to show that the map task failed twice before it succeeded on the third attempt. Second and third attempts are on the same node so the node itself should not be the problem. It's more likely that the cluster output file was ready only on the third attempt.
        Chad Chen made changes -
        Field Original Value New Value
        Attachment mahout_bug.png [ 12438482 ]
        Hide
        Robin Anil added a comment -

        Hi Chad, you are talking about the chapter 7 right? That was an example to run it like a normal java program, I am glad that you made it work on the cluster. I tried running on my local cluster, i can't seem to reproduce the error. Can you give some more details about your setup

        Show
        Robin Anil added a comment - Hi Chad, you are talking about the chapter 7 right? That was an example to run it like a normal java program, I am glad that you made it work on the cluster. I tried running on my local cluster, i can't seem to reproduce the error. Can you give some more details about your setup
        Hide
        Robin Anil added a comment -

        BTW, i hope you tried it with the latest trunk. I see that you have tagged this issue to version 0.2. If thats the case, I would strongly suggest that you move to the trunk

        Show
        Robin Anil added a comment - BTW, i hope you tried it with the latest trunk. I see that you have tagged this issue to version 0.2. If thats the case, I would strongly suggest that you move to the trunk
        Hide
        Chad Chen added a comment -

        Hi Robin,
        The bug does seem to be very difficult to reproduce. I have run the same test many more times and have not seen the same problem again. '
        Unfortunately I lost the original logging output to the console. Otherwise, I should be able tell whether or not the jc.monitorAndPrintJob call in the following method returned true :

        public static RunningJob runJob(JobConf job) throws IOException {
        JobClient jc = new JobClient(job);
        RunningJob rj = jc.submitJob(job);
        try {
        if (!jc.monitorAndPrintJob(job, rj))

        { throw new IOException("Job failed!"); }

        } catch (InterruptedException ie)

        { Thread.currentThread().interrupt(); }

        return rj;
        }

        However, I have a little bit confusion about the following code block:
        private static boolean runIteration(..) {
        ...
        try

        { JobClient.runJob(conf); FileSystem fs = FileSystem.get(outPath.toUri(), conf); return isConverged(clustersOut, conf, fs); }

        catch (IOException e)

        { log.warn(e.toString(), e); return true; }

        }

        So if the call to JobClient.runJob throws an IOException, the reunInteration will return true? In this case, the runClustering method may encounter the same problem I saw (i.e, the cluster output file was not ready). Is my understanding correct?

        Thanks.

        Show
        Chad Chen added a comment - Hi Robin, The bug does seem to be very difficult to reproduce. I have run the same test many more times and have not seen the same problem again. ' Unfortunately I lost the original logging output to the console. Otherwise, I should be able tell whether or not the jc.monitorAndPrintJob call in the following method returned true : public static RunningJob runJob(JobConf job) throws IOException { JobClient jc = new JobClient(job); RunningJob rj = jc.submitJob(job); try { if (!jc.monitorAndPrintJob(job, rj)) { throw new IOException("Job failed!"); } } catch (InterruptedException ie) { Thread.currentThread().interrupt(); } return rj; } However, I have a little bit confusion about the following code block: private static boolean runIteration(..) { ... try { JobClient.runJob(conf); FileSystem fs = FileSystem.get(outPath.toUri(), conf); return isConverged(clustersOut, conf, fs); } catch (IOException e) { log.warn(e.toString(), e); return true; } } So if the call to JobClient.runJob throws an IOException, the reunInteration will return true? In this case, the runClustering method may encounter the same problem I saw (i.e, the cluster output file was not ready). Is my understanding correct? Thanks.
        Hide
        Jeff Eastman added a comment -

        The isConverged method was not closing the SequenceFileReader and this was causing the mapper to fail at unpredictable times due to a lack of available file handles. This has been fixed in KMeansDriver and FuzzyKMeansDriver as well. Shall we close this issue or keep it around for a while longer?

        Show
        Jeff Eastman added a comment - The isConverged method was not closing the SequenceFileReader and this was causing the mapper to fail at unpredictable times due to a lack of available file handles. This has been fixed in KMeansDriver and FuzzyKMeansDriver as well. Shall we close this issue or keep it around for a while longer?
        Hide
        Sean Owen added a comment -

        Sounds close enough to fixed to move ahead on this issue.

        Show
        Sean Owen added a comment - Sounds close enough to fixed to move ahead on this issue.
        Sean Owen made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Jeff Eastman [ jeastman ]
        Fix Version/s 0.4 [ 12314396 ]
        Resolution Fixed [ 1 ]
        Hide
        Chad Chen added a comment - - edited

        Is it still the same in terms of the try and catch blocks below?

        try

        { JobClient.runJob(conf); FileSystem fs = FileSystem.get(outPath.toUri(), conf); return isConverged(clustersOut, conf, fs); }

        catch (IOException e)

        { log.warn(e.toString(), e); return true; <---- this seems to be a little buggy }
        Show
        Chad Chen added a comment - - edited Is it still the same in terms of the try and catch blocks below? try { JobClient.runJob(conf); FileSystem fs = FileSystem.get(outPath.toUri(), conf); return isConverged(clustersOut, conf, fs); } catch (IOException e) { log.warn(e.toString(), e); return true; <---- this seems to be a little buggy }
        Hide
        Sean Owen added a comment -

        Nope it just errors in this case now.

        Show
        Sean Owen added a comment - Nope it just errors in this case now.
        Hide
        Chad Chen added a comment -

        Great. Thanks for the fix.

        Show
        Chad Chen added a comment - Great. Thanks for the fix.
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Chad Chen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development