Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2113

[Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.8.0
    • 0.12.0
    • Python
    • Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 5.13.1

    Description

      Steps to replicate the issue:

      mkdir /tmp/test
      cd /tmp/test
      mkdir jars
      cd jars
      touch test1.jar
      mkdir -p ../lib/zookeeper
      cd ../lib/zookeeper
      ln -s ../../jars/test1.jar ./test1.jar
      ln -s test1.jar test.jar
      mkdir -p ../hadoop/lib
      cd ../hadoop/lib
      ln -s ../../../lib/zookeeper/test.jar ./test.jar

      (this part depends on your configuration you need those values for pyarrow.hdfs to work: )

      (path to libjvm: )

      (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)

      (path to libhdfs: )

      (export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)

      export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

      python
      import pyarrow.hdfs as hdfs;
      fs = hdfs.connect(user="hdfs")

       

      Ends with error:

      ------------
      loadFileSystems error:
      (unable to get root cause for java.lang.NoClassDefFoundError)
      (unable to get stack trace for java.lang.NoClassDefFoundError)
      hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error:
      (unable to get root cause for java.lang.NoClassDefFoundError)
      (unable to get stack trace for java.lang.NoClassDefFoundError)
      Traceback (most recent call last): (
      File "<stdin>", line 1, in <module>
      File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect
      kerb_ticket=kerb_ticket, driver=driver)
      File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in _init_
      self._connect(host, port, user, kerb_ticket, driver)
      File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
      File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
      pyarrow.lib.ArrowIOError: HDFS connection failed
      -------------

       

      export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
      python
      import pyarrow.hdfs as hdfs;
      fs = hdfs.connect(user="hdfs")

       

      Works properly.

       

      I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

      I would expect that pyarrow would work with any definition of path to .jar

      Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar),

      Because of this issue, our customer currently can't use pyarrow lib for oozie workflows.

      Attachments

        Issue Links

          Activity

            People

              andharris Andrew Harris
              michal.danko Michal Danko
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m