Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1407

Dictionaries can only hold a maximum of 4096 indices

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Java
    • Labels:
      None

      Description

      Dictionaries seem to only be able to hold 4096 indices, meaning only vectors with 4096 values or less can be turned into dictionaries. The image attached is a stack trace of what happens when try to encode a dictionary with a vector containing 4097 strings, and a dictionary containing two distinct values.

      Basically the error can be traced to line 95 of DictionaryEncoder.java (`setter.invoke(mutator, i, encoded);`). It seems that the indices array which hold the encoded values is allocated on line 84 as `indices.allocateNew()` and it seems that `allocateNew()` only allocates 4096 bytes of data initially. The code runs if there are 4096 rows of data or less. Anymore and the same error is given.

        Attachments

          Activity

            People

            • Assignee:
              icexelloss Li Jin
              Reporter:
              shayanm Shayan Monshizadeh
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: