Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0
Description
Chunked arrays are stored as a C++ vector of Arrays.
There is currently no indexing structure on top of the vector to allow for anything better than O(chunk) to access an arbitrary element.
For example, with a Table consisting of 1 column “text” defined by:
- 1024 chunks
- each chunk is 1024 rows
- each row is a text of 1024 characters
Then the time it takes to access one example are:
Time to access example at i=0% : 6.7μs Time to access example at i=10% : 7.2μs Time to access example at i=20% : 9.1μs Time to access example at i=30% : 11.4μs Time to access example at i=40% : 13.8μs Time to access example at i=50% : 16.2μs Time to access example at i=60% : 18.7μs Time to access example at i=70% : 21.1μs Time to access example at i=80% : 26.8μs Time to access example at i=90% : 25.2μs
The time measured are the average times to do `table[“text”][j]` depending on the index we want to fetch (from the first example at 0% to the example at 90% of the length of the table).
You can take a look at the code that produces this benchmark here.
Some discussions in this thread on the mailing list suggested different approaches to improve the complexity:
- use a contiguous array of chunk lengths, since having a contiguous array of lengths makes the iteration over the chunks lengths faster;
- use a binary search, as in the Julia implementation here;
- use interpolation search.
Apparently there is also a lookup structure in the compute layer here.
cc Micah Kornfield, Wes McKinney
Thanks again for the amazing work !
Attachments
Issue Links
- links to