Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5140

[Bug?][Parquet] Can write a jagged array column of strings to disk, but hit `ArrowNotImplementedError` on read



    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Not A Problem
    • 0.12.0
    • 0.14.0
    • Python
    • None
    • Debian 8



      I encountered an issue on a proprietary dataset where we have a schema that looks roughly like:

      {{ |-- ids: array (nullable = true) | |-- element: string (containsNull = true) }}

      I was able to write this dataset to parquet no problem (using pq.write_table), but upon reading it (using pq.read_table) I encountered the following error: ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs (a full stacktrace is attached below)

      I believe that this is pretty confusing because I was able to serialize but not deserialize this table. I was able to also find that this does not happen with all sizes of the dataset - a smaller sample did not encounter this issue! So I built a small reproduction harness and checked out where this could happen:

      Further investigation

      • If I set the maximum number of elements per row of ids, I found that reducing it allows me to serialize/deserialize more rows
      • At a setting of maximum 15 elements per row, each element being at most 20 characters, I fail at about 1.3e5 rows
      • At the limit of my willingness to spend time building giant dataframes to investigate this, I haven't been able to reproduce this issue for e.g. longs instead of strings
      • Another column in this dataset consists of much longer strings than this column's strings (when concatenated), and the total sum of all characters is ~3x in that column versus this trouble column (when the strings in each row are just simply concatenated). I have no issue serializing / deserializing that column.
      • The fact that each array is of a different length doesn't seem to matter - if I change it so as to force everything to be ~14 elements, it fails with the same error even at 1e5 rows.

      Reproduction code

      This gist should have both a stacktrace and reproduction code.

      Version info

      {{pyarrow==0.12.0 parquet==1.2 }}

      Mea culpa

      I copy-pasted this from Github on request (https://github.com/apache/arrow/issues/4115), and Jira formatting is a nightmare compared to markdown, so I apologize.




            Unassigned Unassigned
            zmjjmz Zachary Jablons
            1 Vote for this issue
            4 Start watching this issue