Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1407

Dictionaries can only hold a maximum of 4096 indices

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Java
    • Labels:
      None

      Description

      Dictionaries seem to only be able to hold 4096 indices, meaning only vectors with 4096 values or less can be turned into dictionaries. The image attached is a stack trace of what happens when try to encode a dictionary with a vector containing 4097 strings, and a dictionary containing two distinct values.

      Basically the error can be traced to line 95 of DictionaryEncoder.java (`setter.invoke(mutator, i, encoded);`). It seems that the indices array which hold the encoded values is allocated on line 84 as `indices.allocateNew()` and it seems that `allocateNew()` only allocates 4096 bytes of data initially. The code runs if there are 4096 rows of data or less. Anymore and the same error is given.

        Attachments

        1. Screen Shot 2017-08-22 at 7.14.07 PM.png
          150 kB
          Shayan Monshizadeh

          Activity

            People

            • Assignee:
              icexelloss Li Jin
              Reporter:
              shayanm Shayan Monshizadeh
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: