Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15410

[C++][Datasets] Improve memory usage of datasets API when scanning parquet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.0.0
    • C++

    Description

      This is a more targeted fix to improve memory usage when scanning parquet files. It is related to broader issues like ARROW-14648 but those will likely take longer to fix. The goal here is to make it possible to scan large parquet datasets with many files where each file has reasonably sized row groups (e.g. 1 million rows). Currently we run out of memory scanning a configuration as simple as:

      21 parquet files
      Each parquet file has 10 million rows split into row groups of size 1 million

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m