Mahout
  1. Mahout
  2. MAHOUT-958

NullPointerException in RepresentativePointsMapper when running cluster-reuters.sh example with kmeans

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.8
    • Component/s: Examples
    • Labels:
      None
    • Environment:

      Description

      > svn info
      Path: .
      URL: http://svn.apache.org/repos/asf/mahout/trunk
      Repository Root: http://svn.apache.org/repos/asf
      Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
      Revision: 1235544
      Node Kind: directory
      Schedule: normal
      Last Changed Author: tdunning
      Last Changed Rev: 1231800
      Last Changed Date: 2012-01-15 16:01:38 -0800 (Sun, 15 Jan 2012)
      
      > ./examples/bin/cluster-reuters.sh
      ...
      1. kmeans clustering
      ...
      Inter-Cluster Density: NaN
      Intra-Cluster Density: 0.0
      CDbw Inter-Cluster Density: 0.0
      CDbw Intra-Cluster Density: NaN
      CDbw Separation: 0.0
      12/01/24 16:08:47 INFO clustering.ClusterDumper: Wrote 20 clusters
      12/01/24 16:08:47 INFO driver.MahoutDriver: Program took 126749 ms (Minutes: 2.1124833333333335)
      

      All five "Representative Points Driver" jobs fail.

      2012-01-24 16:07:11,555 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
      2012-01-24 16:07:11,881 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
      2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
      2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
      2012-01-24 16:07:11,956 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
      2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
      2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName vernica for UID 1000 from the native implementation
      2012-01-24 16:07:11,981 WARN org.apache.hadoop.mapred.Child: Error running child
      java.lang.NullPointerException
      	at org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.mapPoint(RepresentativePointsMapper.java:73)
      	at org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:60)
      	at org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:40)
      	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
      	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
      	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
      	at org.apache.hadoop.mapred.Child.main(Child.java:253)
      
      1. MAHOUT-958.patch
        2 kB
        Vikram Dixit K

        Activity

        Hide
        Grant Ingersoll added a comment - - edited

        Hmm,works for me on mac. Trying linux.

        Show
        Grant Ingersoll added a comment - - edited Hmm,works for me on mac. Trying linux.
        Hide
        Grant Ingersoll added a comment -

        works for me on mac. I wonder if it is Java 7? Also, can you try clean install?

        Show
        Grant Ingersoll added a comment - works for me on mac. I wonder if it is Java 7? Also, can you try clean install?
        Hide
        ehgjr added a comment -

        same error here when run on top of hadoop, running locally is just fine. i guess it has something to do with the evaluation (-e) part of clusterdump. please help fix this. Many Thanks...

        Show
        ehgjr added a comment - same error here when run on top of hadoop, running locally is just fine. i guess it has something to do with the evaluation (-e) part of clusterdump. please help fix this. Many Thanks...
        Hide
        ehgjr added a comment -

        i've seen the problem. it's a wildcard issue with FsShell..

        Show
        ehgjr added a comment - i've seen the problem. it's a wildcard issue with FsShell..
        Hide
        Grant Ingersoll added a comment -

        ehgjr: do you have a patch, by chance?

        Show
        Grant Ingersoll added a comment - ehgjr: do you have a patch, by chance?
        Hide
        ehgjr added a comment -

        hello Grant. i would love to, but i'm just a newbie on this. still trying to learn Mahout...

        Show
        ehgjr added a comment - hello Grant. i would love to, but i'm just a newbie on this. still trying to learn Mahout...
        Hide
        Todd Rose added a comment - - edited

        I'm just starting with Mahout and I had the exact same issue running with Hadoop on a single node. I did an mvn clean install on main project, examples, core, and integration, then copied the "jar" files from the target directories to the main project directory and it worked fine.

        Edit: This problem re-appeared (and has me asking whether it actually ever worked on this version), all the rebuilding in the world hasn't fixed it. The problem is the List "repPoints" in RepresentativePointsMapper is null after the call to ".get(key)". Checking for null and skipping the totalDistance summation stops the error, but I'm not sure if its just hiding another problem further up.
        if (repPoints != null) {
        for (VectorWritable refPoint : repPoints)

        { totalDistance += measure.distance(refPoint.get(), point.getVector()); }

        }

        Show
        Todd Rose added a comment - - edited I'm just starting with Mahout and I had the exact same issue running with Hadoop on a single node. I did an mvn clean install on main project, examples, core, and integration, then copied the "jar" files from the target directories to the main project directory and it worked fine. Edit: This problem re-appeared (and has me asking whether it actually ever worked on this version), all the rebuilding in the world hasn't fixed it. The problem is the List "repPoints" in RepresentativePointsMapper is null after the call to ".get(key)". Checking for null and skipping the totalDistance summation stops the error, but I'm not sure if its just hiding another problem further up. if (repPoints != null) { for (VectorWritable refPoint : repPoints) { totalDistance += measure.distance(refPoint.get(), point.getVector()); } }
        Hide
        Johannes Rauber added a comment -

        The same happens for me when using the option -e even when running local.

        Current JDK is 1.6 and not 1.7 like yours.

        hduser@johannesvb:~$ java -version
        java version "1.6.0_35"
        Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
        Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode)

        I used the fix from Todd Rose which kind of fixed it. But the output

        12/09/29 22:58:59 INFO evaluation.ClusterEvaluator: Inter-Cluster Density = NaN
        12/09/29 22:58:59 INFO evaluation.ClusterEvaluator: Intra-Cluster Density = 0.0
        12/09/29 22:59:01 INFO clustering.ClusterDumper: Wrote 200 clusters

        implies that something breaks.

        Show
        Johannes Rauber added a comment - The same happens for me when using the option -e even when running local. Current JDK is 1.6 and not 1.7 like yours. hduser@johannesvb:~$ java -version java version "1.6.0_35" Java(TM) SE Runtime Environment (build 1.6.0_35-b10) Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode) I used the fix from Todd Rose which kind of fixed it. But the output 12/09/29 22:58:59 INFO evaluation.ClusterEvaluator: Inter-Cluster Density = NaN 12/09/29 22:58:59 INFO evaluation.ClusterEvaluator: Intra-Cluster Density = 0.0 12/09/29 22:59:01 INFO clustering.ClusterDumper: Wrote 200 clusters implies that something breaks.
        Hide
        Adam J. Baron added a comment -

        I had the exact same issue, but what ehgjr said about wildcards in a January 2012 comment gave me an idea.

        The problem in the cluster-reuters.sh script is the 'clusters-*-final':
        $MAHOUT clusterdump \
        -i $

        {WORK_DIR}/reuters-kmeans/clusters-*-final \
        -o ${WORK_DIR}

        /reuters-kmeans/clusterdump \
        -d $

        {WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \
        -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \
        --pointsDir ${WORK_DIR}

        /reuters-kmeans/clusteredPoints \

        For me, the clusters-*-final resolved to clusters-2-final. So I just re-ran that one clusterdump command outside of the cluster-reuters.sh script using 'clusters-2-final' instead and all ran fine. Obviously not a fix to cluster-reuters.sh, but a workaround to help you see the clusterdump results.

        PS: I'm running this over a 20-node Hadoop cluster, not locally. It seems strange that the --input, --dictionary and --pointsDir parameters reference HDFS locations while the the --output parameter references your EdgeNode's file system.

        Show
        Adam J. Baron added a comment - I had the exact same issue, but what ehgjr said about wildcards in a January 2012 comment gave me an idea. The problem in the cluster-reuters.sh script is the 'clusters-*-final': $MAHOUT clusterdump \ -i $ {WORK_DIR}/reuters-kmeans/clusters-*-final \ -o ${WORK_DIR} /reuters-kmeans/clusterdump \ -d $ {WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \ --pointsDir ${WORK_DIR} /reuters-kmeans/clusteredPoints \ For me, the clusters-*-final resolved to clusters-2-final. So I just re-ran that one clusterdump command outside of the cluster-reuters.sh script using 'clusters-2-final' instead and all ran fine. Obviously not a fix to cluster-reuters.sh, but a workaround to help you see the clusterdump results. PS: I'm running this over a 20-node Hadoop cluster, not locally. It seems strange that the --input, --dictionary and --pointsDir parameters reference HDFS locations while the the --output parameter references your EdgeNode's file system.
        Hide
        Vikram Dixit K added a comment -

        This fixes the null pointer exception. The issue was basically the * in the path name was mentioned earlier by someone. I used the globStatus API from HDFS to get the directory that matches the glob pattern and use that as the input directory instead of directly using the passed-in string. Please let me know if this works as expected in your case.

        Thanks
        Vikram.

        Show
        Vikram Dixit K added a comment - This fixes the null pointer exception. The issue was basically the * in the path name was mentioned earlier by someone. I used the globStatus API from HDFS to get the directory that matches the glob pattern and use that as the input directory instead of directly using the passed-in string. Please let me know if this works as expected in your case. Thanks Vikram.
        Hide
        Vikram Dixit K added a comment -

        Fixes the NPE in representative points mapper.

        Show
        Vikram Dixit K added a comment - Fixes the NPE in representative points mapper.
        Hide
        Grant Ingersoll added a comment -

        I couldn't reproduce this, but I suspect it is due to CDH, where I have seen similar behaviors before. At any rate, the code looked good, so I went ahead and committed it.

        Show
        Grant Ingersoll added a comment - I couldn't reproduce this, but I suspect it is due to CDH, where I have seen similar behaviors before. At any rate, the code looked good, so I went ahead and committed it.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2054 (See https://builds.apache.org/job/Mahout-Quality/2054/)
        MAHOUT-958: fix use with globs, MAHOUT-944: minor tweak to driver.classes (Revision 1490793)

        Result = FAILURE
        gsingers :
        Files :

        • /mahout/trunk/CHANGELOG
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsDriver.java
        • /mahout/trunk/src/conf/driver.classes.default.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2054 (See https://builds.apache.org/job/Mahout-Quality/2054/ ) MAHOUT-958 : fix use with globs, MAHOUT-944 : minor tweak to driver.classes (Revision 1490793) Result = FAILURE gsingers : Files : /mahout/trunk/CHANGELOG /mahout/trunk/integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsDriver.java /mahout/trunk/src/conf/driver.classes.default.props

          People

          • Assignee:
            Dan Filimon
            Reporter:
            Rares Vernica
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development