[MAHOUT-800] bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6
Component/s: classic
Labels:
None
Environment:

OSX; java version "1.6.0_26"

Description

(This began as a build-reuters.sh bug report, but the problem seemed deeper; please excuse the narrative format here)

Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt cluster mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ directory, because bin/mahout appends it to Java's classpath. This seems to trigger something in Mahout Java that will to try to use the cluster, without this being explicitly requested.

There have been reports (Jeff Eastman, myself; http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=Z0dynDA@mail.gmail.com%3E ) of build-reuters.sh attempting cluster mode, even while claiming - "MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant conditions, "no HADOOP_HOME set, running locally").

Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with a pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems to be the key.

When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh tries to use the cluster. Aside: this is not the same as it using non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s)." unless the cluster is up. If the cluster is up and accessible, I'll see java.io.IOException instead, presumably since the files aren't there.

If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda modes) runs OK without real Hadoop.

If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems fine. Only when it finds a Hadoop installation does it get confused.

Minimally I'd consider this a documentation issue. Nothing in build_reuters.sh script mentions role of HADOOP_CONF_DIR. Reading build-reuters.sh I get the impression both clustered and local modes are possible; however mailing list discussion leave me ensure whether clustered mode is still supposed to work in trunk.

Tests: (with no HADOOP_HOME set)

Running these extracts from build-reuters.sh in examples/bin/ after having previously run build-reuters.sh to fetch data...

#this one runs OK
MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
-i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 -chunk 5

#this fails (assuming there's a Hadoop there) by attempting clustered mode: 'Call to localhost/127.0.0.1:9000 failed...'

MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf ../../bin/mahout seqdirectory \
-i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 -chunk 5

Same thing with seq2sparse

#fails, localhost:9000
HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
-i mahout-work/reuters-out-seqdir/ -o mahout-work/reuters-out-seqdir-sparse-kmeans

#runs locally just fine (because of bad hadoop conf path)
HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
-i mahout-work/reuters-out-seqdir/ -o mahout-work/reuters-out-seqdir-sparse-kmeans

I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems general, not driver-specific.

All this seems to contradict the notes in ../../bin/mahout, i.e.

MAHOUT_LOCAL set to anything other than an empty string to force
mahout to run locally even if
HADOOP_CONF_DIR and HADOOP_HOME are set

Digging into bin/mahout it seems the accidental clustering happens deeper into java-land, not in the .sh; it's not invoking hadoop directly there. We get this far:

exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"

I compared the Java commandlines generated by successful vs accidentally-cluster-invoking runs of bin/mahout ...it seems the only difference is whether a hadoop conf directory is on the classpath that's passed to Java.

If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run

MAHOUT_LOCAL=true ../../bin/mahout kmeans \
-i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
-c mahout-work/reuters-kmeans-clusters \
-o mahout-work/reuters-kmeans \
-x 10 -k 20 -ow

...against an edited version of bin/mahout that appends a hadoop conf dir to the classpath, i.e.

exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"

This is enough to get "Exception in thread "main" java.io.IOException: Call to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"

(...and if I remove the /conf path from classpath, we're back to expected behaviours).

Not sure whether it's best to patch this in bin/mahout, or in the Java (perhaps the former might mask issues that'll cause later confusion?)

Perhaps only do

CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR

if we're not seeing MAHOUT_LOCAL?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-800.patch
02/Sep/11 08:57
0.7 kB
Dan Brickley

bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME

Details

Description

Attachments

Attachments

Activity

People

Dates