Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11989

[C++][Python] Improve ChunkedArray's complexity for the access of elements

Details

    Description

      Chunked arrays are stored as a C++ vector of Arrays.

      There is currently no indexing structure on top of the vector to allow for anything better than O(chunk) to access an arbitrary element.

      For example, with a Table consisting of 1 column “text” defined by:

      • 1024 chunks
      • each chunk is 1024 rows
      • each row is a text of 1024 characters

      Then the time it takes to access one example are:

      Time to access example at i=0%	: 6.7μs
      Time to access example at i=10%	: 7.2μs
      Time to access example at i=20%	: 9.1μs
      Time to access example at i=30%	: 11.4μs
      Time to access example at i=40%	: 13.8μs
      Time to access example at i=50%	: 16.2μs
      Time to access example at i=60%	: 18.7μs
      Time to access example at i=70%	: 21.1μs
      Time to access example at i=80%	: 26.8μs
      Time to access example at i=90%	: 25.2μs
      

      The time measured are the average times to do `table[“text”][j]` depending on the index we want to fetch (from the first example at 0% to the example at 90% of the length of the table).

      You can take a look at the code that produces this benchmark here.

      Some discussions in this thread on the mailing list suggested different approaches to improve the complexity:

      • use a contiguous array of chunk lengths, since having a contiguous array of lengths makes the iteration over the chunks lengths faster;
      • use a binary search, as in the Julia implementation here;
      • use interpolation search.

      Apparently there is also a lookup structure in the compute layer here.

      cc Micah Kornfield, Wes McKinney

      Thanks again for the amazing work !

      Attachments

        Issue Links

          Activity

            People

              edponce Eduardo Ponce
              lhoestq quentin lhoest
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 14h 20m
                  14h 20m

                  Slack

                    Issue deployment