Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16530

[Go] Serial read operations on columns, even when parallel = true

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 8.0.0
    • 9.0.0
    • Go
    • Linux, golang 1.18, AMD64

    Description

      I have submitted a pull request with the changes.

       https://github.com/apache/arrow/pull/13120#issuecomment-1123982147

      In pqarrow, when getting column readers for columns and struct members, the default behavior is a for loop that serially processes each column.  The process of "getting" readers causes a read request, therefore causing these reads always to be issued serially.  Additionally, the logic for getting next batch of records is executed in the same way, a for loop iterating through the columns.  The performance impact is especially large on high-latency files such as cloud storage.

      Additionally, the code to retrieve the next batch of records also issues reads serially.  

      I'm working with complex parquet files with 500+ "root" columns where some fields are lists of structs.  Some of these structs have 100's of columns.  In my tests, 800+ read operations are being issued to GCS serially which makes the current state of pqarrow too slow to be usable.

      The revision is to concurrently process the columns when retrieving child readers and column readers and to concurrently issue batch requests.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Purdom Robert Purdom
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Time Spent - 3.5h Remaining Estimate - 20.5h
                  20.5h
                  Logged:
                  Time Spent - 3.5h Remaining Estimate - 20.5h
                  3.5h