Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-39

C++: Logical chunked arrays / columns: conforming to fixed chunk sizes

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • 0.3.0
    • C++
    • None

    Description

      Implementing algorithms on large arrays assembled in physical chunks is problematic if:

      • The chunks are not all the same size (except possibly the last chunk, which can be less). Otherwise, retrieving a particular element is in general a O(log num_chunks) operation
      • The chunk size is not a power of 2. Computing integer modulus with a non-multiple of 2 requires more clock cycles (in other words, i % p is much more expensive to compute than i & (p - 1), but the latter only works if p is a power of 2)

      Most of the Arrow data adapters will either feature contiguous data (1 chunk, so chunking is not an issue) or a regular chunk size, so this isn't as much of an immediate concern, but we should consider making it a contract of any data structures dealing in multiple arrays.

      In general, it would be preferable to reorganize memory into either a regular chunksize (like 64K values per chunk) or a contiguous memory region. I would prefer for the moment to not to invest significant energy in writing algorithms for data with irregular chunk sizes.

      Attachments

        Activity

          People

            wesm Wes McKinney
            wesm Wes McKinney
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: