[ARROW-11989] [C++][Python] Improve ChunkedArray's complexity for the access of elements - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 8.0.0
Component/s: C++, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27821

Description

Chunked arrays are stored as a C++ vector of Arrays.

There is currently no indexing structure on top of the vector to allow for anything better than O(chunk) to access an arbitrary element.

For example, with a Table consisting of 1 column “text” defined by:

1024 chunks
each chunk is 1024 rows
each row is a text of 1024 characters

Then the time it takes to access one example are:

Time to access example at i=0%	: 6.7μs
Time to access example at i=10%	: 7.2μs
Time to access example at i=20%	: 9.1μs
Time to access example at i=30%	: 11.4μs
Time to access example at i=40%	: 13.8μs
Time to access example at i=50%	: 16.2μs
Time to access example at i=60%	: 18.7μs
Time to access example at i=70%	: 21.1μs
Time to access example at i=80%	: 26.8μs
Time to access example at i=90%	: 25.2μs

The time measured are the average times to do `table[“text”][j]` depending on the index we want to fetch (from the first example at 0% to the example at 90% of the length of the table).

You can take a look at the code that produces this benchmark here.

Some discussions in this thread on the mailing list suggested different approaches to improve the complexity:

use a contiguous array of chunk lengths, since having a contiguous array of lengths makes the iteration over the chunks lengths faster;
use a binary search, as in the Julia implementation here;
use interpolation search.

Apparently there is also a lookup structure in the compute layer here.

cc Micah Kornfield, Wes McKinney

Thanks again for the amazing work !

Attachments

Issue Links

links to

GitHub Pull Request #12055

Activity

People

Assignee:: Eduardo Ponce

Reporter:: quentin lhoest

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 16/Mar/21 18:27

Updated:: 11/Jan/23 08:23

Resolved:: 14/Apr/22 10:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

[C++][Python] Improve ChunkedArray's complexity for the access of elements