Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14965

[Python][C++] Contention when reading Parquet files with multi-threading

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 6.0.0
    • None
    • C++, Python
    • None

    Description

      I'm attempting to read a table from multiple Parquet files where I already know which row_groups I want to read from each file. I also want to apply a filter expression while reading. To do this my code looks roughly like this:

       

      def read_file(filepath):
          format = ds.ParquetFileFormat(...)
          fragment = format.make_fragment(filepath, row_groups=[0, 1, 2, ...])
          scanner = ds.Scanner.from_fragment(
              fragment, 
              use_threads=True,
              use_async=False,
              filter=...
          )
          return scanner.to_reader().read_all()
      
      with ThreadPoolExecutor() as pool:
          pa.concat_tables(pool.map(read_file, file_paths)) 

      Running with a ProcessPoolExecutor, each of my 13 read_file calls takes at most 2 seconds. However, with a ThreadPoolExecutor some of the read_file calls take 20+ seconds.

       

      I've tried running this with various combinations of use_threads and use_async to try and see what's happening. The code blocks are sourced from py-spy, and identifying contention was done with viztracer.

       

      use_threads: False, use_async: False

       

      from_fragment (pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so)
      __pyx_getprop_7pyarrow_8_dataset_8Fragment_physical_schema (pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so)
      __pthread_cond_timedwait (libpthread-2.17.so) 

       

      use_threads: False, use_async: True

      • There's no longer any contention for pyarrow._dataset.from_fragment
      • But there's lots of contention for pyarrow.lib.RecordBatchReader.read_all

       

      arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600)
      arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext (pyarrow/libarrow_dataset.so.600)
      arrow::Iterator<arrow::dataset::TaggedRecordBatch>::Next<arrow::GeneratorIterator<arrow::dataset::TaggedRecordBatch> > (pyarrow/libarrow_dataset.so.600)
      arrow::FutureImpl::Wait (pyarrow/libarrow.so.600) 
      std::condition_variable::wait (libstdc++.so.6.0.19)

      use_threads: True, use_async: False

      • Appears to be some contention on Scanner.to_reader
      • But most contention remains for RecordBatchReader.read_all
      arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600)
      arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext (pyarrow/libarrow_dataset.so.600)
      arrow::Iterator<arrow::dataset::TaggedRecordBatch>::Next<arrow::FunctionIterator<arrow::dataset::(anonymous namespace)::SyncScanner::ScanBatches(arrow::Iterator<std::shared_ptr<arrow::dataset::ScanTask> >)::{lambda()#1}, arrow::dataset::TaggedRecordBatch> > (pyarrow/libarrow_dataset.so.600)
      std::condition_variable::wait (libstdc++.so.6.0.19)
      __pthread_cond_wait (libpthread-2.17.so) 

      use_threads: True, use_async: True

      • Contention again mostly for RecordBatchReader.read_all, but seems to complete in ~12 seconds rather than 20
      arrow::RecordBatchReader::ReadAll (pyarrow/libarrow.so.600)
      arrow::dataset::(anonymous namespace)::ScannerRecordBatchReader::ReadNext (pyarrow/libarrow_dataset.so.600)
      arrow::Iterator<arrow::dataset::TaggedRecordBatch>::Next<arrow::GeneratorIterator<arrow::dataset::TaggedRecordBatch> > (pyarrow/libarrow_dataset.so.600)
      arrow::FutureImpl::Wait (pyarrow/libarrow.so.600)
      std::condition_variable::wait (libstdc++.so.6.0.19)
      __pthread_cond_wait (libpthread-2.17.so) 

      Is this expected behaviour? Or should it be possible to achieve the same performance from multi-threading as from multi-processing?

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            gatesn Nick Gates
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: