Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11596

[Python][Dataset] SIGSEGV when executing scan tasks with Python executors

    XMLWordPrintableJSON

Details

    Description

      This crashes for me with a segfault:

      import concurrent.futures
      import queue
      
      import numpy as np
      import pyarrow as pa
      import pyarrow.dataset as ds
      import pyarrow.fs as fs
      import pyarrow.parquet as pq
      
      
      schema = pa.schema([("foo", pa.float64())])
      table = pa.table([np.random.uniform(size=1024)], schema=schema)
      path = "/tmp/foo.parquet"
      pq.write_table(table, path)
      dataset = pa.dataset.FileSystemDataset.from_paths(
          [path],
          schema=schema,
          format=ds.ParquetFileFormat(),
          filesystem=fs.LocalFileSystem(),
      )
      
      with concurrent.futures.ThreadPoolExecutor(2) as executor:
          tasks = dataset.scan()
          q = queue.Queue()
      
          def _prebuffer():
              for task in tasks:
                  iterator = task.execute()
                  next(iterator)
                  q.put(iterator)
      
          executor.submit(_prebuffer).result()
          next(q.get())
      
      $ uname -a
      Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 +0000 x86_64 GNU/Linux
      $ pip freeze
      numpy==1.20.1
      pyarrow==3.0.0
      

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              lidavidm David Li
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m