Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2081

[Python] Hdfs client isn't fork-safe

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: C++, Python
    • Labels:

      Description

      Given the following script:

       

      import multiprocessing as mp
      import pyarrow as pa
      
      
      def ls(h):
          print("calling ls")
          return h.ls("/tmp")
      
      
      if __name__ == '__main__':
          h = pa.hdfs.connect()
          print("Using 'spawn'")
          pool = mp.get_context('spawn').Pool(2)
          results = pool.map(ls, [h, h])
          sol = h.ls("/tmp")
          for r in results:
              assert r == sol
          print("'spawn' succeeded\n")
      
          print("Using 'fork'")
          pool = mp.get_context('fork').Pool(2)
          results = pool.map(ls, [h, h])
          sol = h.ls("/tmp")
          for r in results:
              assert r == sol
          print("'fork' succeeded")
      

       

      Results in the following output:

       

      $ python test.py
      Using 'spawn'
      calling ls
      calling ls
      'spawn' succeeded
      
      Using 'fork

       

      The process then hangs, and I have to `kill -9` the forked worker processes.

       

      I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a problem with libhdfs or just arrow's use of it (a quick google search didn't turn up anything useful).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jim.crist Jim Crist
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: