Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9458

[Python] Dataset Scanner is single-threaded only

    XMLWordPrintableJSON

Details

    Description

      I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an issue in the C++ layer.

      I have 1000 parquet files with a total of 1 billion rows (1 million rows each file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 columns efficiently (vaex use case).

       

      import pyarrow.parquet
      import pyarrow as pa
      import pyarrow.dataset as ds
      import glob
      ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
      scanned = 0
      for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=True):
          for record_batch in scan_task.execute():
              scanned += record_batch.num_rows
      scanned
      

      This only seems to use 1 cpu.

      Using a threadpool from Python:

      # %%timeit
      import concurrent.futures
      pool = concurrent.futures.ThreadPoolExecutor()
      ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
      def process(scan_task):
          scan_count = 0
          for record_batch in scan_task.execute():
              scan_count += len(record_batch)
          return scan_count
      sum(pool.map(process, ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=False)))
      

      Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu).

      py-spy (profiler for Python) shows no GIL, so this might be something at the C++ layer.

      Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a problem on this system (it actually all comes from OS cache, no disk read observed)

       

      Attachments

        1. image-2020-07-14-14-31-29-943.png
          178 kB
          Maarten Breddels
        2. image-2020-07-14-14-38-16-767.png
          190 kB
          Maarten Breddels

        Issue Links

          Activity

            People

              maartenbreddels Maarten Breddels
              maartenbreddels Maarten Breddels
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h