Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2025

[Python/C++] HDFS Client disconnect closes all open clients

    XMLWordPrintableJSON

Details

    Description

      In the python library, if an instance of `HadoopFileSystem` is garbage collected, all other existing instances become invalid. I haven't checked with a C++ only example, but from reading the cython code I can't see how cython is responsible, so I think this is a bug in the C++ library.

       

      >>> import pyarrow as pa
      >>> h = pa.hdfs.connect()
      18/01/24 16:54:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      18/01/24 16:54:26 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
      >>> h.ls("/")
      ['/benchmarks', '/hbase', '/tmp', '/user', '/var']
      >>> h2 = pa.hdfs.connect()
      >>> del h  # close one client
      >>> h2.ls("/")  # all filesystem operations now fail
      hdfsListDirectory(/): FileSystem#listStatus error:
      IOException: Filesystem closedjava.io.IOException: Filesystem closed
              at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:865)
              at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2106)
              at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2092)
              at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:743)
              at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:113)
              at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:808)
              at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:804)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:804)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/opt/conda/lib/python3.6/site-packages/pyarrow/hdfs.py", line 88, in ls
          return super(HadoopFileSystem, self).ls(path, detail)
        File "io-hdfs.pxi", line 248, in pyarrow.lib.HadoopFileSystem.ls
        File "error.pxi", line 79, in pyarrow.lib.check_status
      pyarrow.lib.ArrowIOError: HDFS: list directory failed
      >>> h2.is_open  # The python object still thinks it's open
      True
      

      Attachments

        Issue Links

          Activity

            People

              jim.crist Jim Crist
              jim.crist Jim Crist
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: