Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11596

[Python][Dataset] SIGSEGV when executing scan tasks with Python executors

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      This crashes for me with a segfault:

      import concurrent.futures
      import queue
      
      import numpy as np
      import pyarrow as pa
      import pyarrow.dataset as ds
      import pyarrow.fs as fs
      import pyarrow.parquet as pq
      
      
      schema = pa.schema([("foo", pa.float64())])
      table = pa.table([np.random.uniform(size=1024)], schema=schema)
      path = "/tmp/foo.parquet"
      pq.write_table(table, path)
      dataset = pa.dataset.FileSystemDataset.from_paths(
          [path],
          schema=schema,
          format=ds.ParquetFileFormat(),
          filesystem=fs.LocalFileSystem(),
      )
      
      with concurrent.futures.ThreadPoolExecutor(2) as executor:
          tasks = dataset.scan()
          q = queue.Queue()
      
          def _prebuffer():
              for task in tasks:
                  iterator = task.execute()
                  next(iterator)
                  q.put(iterator)
      
          executor.submit(_prebuffer).result()
          next(q.get())
      
      $ uname -a
      Linux chaconne 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 +0000 x86_64 GNU/Linux
      $ pip freeze
      numpy==1.20.1
      pyarrow==3.0.0
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lidavidm David Li
            lidavidm David Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 40m
                1h 40m

                Slack

                  Issue deployment