Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17137

[Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • 11.0.0
    • Python

    Description

      Hey, 

      I have a data frame for which one column is a nested struct array. Converting it to a pyarrow.Table fails if the data frame gets too big. I could reproduce the bug with a minimal example with anonymized data that is roughly similar to mine. When I set, e.g., N_ROWS=500_000, or smaller, it is working fine.

       

      import pandas as pd
      import pyarrow as pa
      N_ROWS = 800_000
      item_record = {
          "someImportantAssets": [
              {
                  "square": "https://some.super.loooooooooong.link.com/withmany/lorem/upload/"
                  "ipsum/stilllooooooooooonger/lorem/{someparameter}/156fdjjf644984dfdfaera64"
                  "/specificLink-i15348891"
              }
          ],
          "id": "i15348891",
          "title": "Some Long Item Title i15348891",
      }
      user_record = {
          "userId": "faa4648-4964drf-64648fafa648-4648falj",
          "data": [item_record for _ in range(24)],
      }
      df = pd.DataFrame([user_record for _ in range(N_ROWS)])
      table = pa.Table.from_pandas(df)

       

      Traceback (most recent call last):
        File "/.../scratch/experiment_pq.py", line 23, in <module>
          table = pa.Table.from_pandas(df)
        File "pyarrow/table.pxi", line 3472, in pyarrow.lib.Table.from_pandas
        File "pyarrow/table.pxi", line 3574, in pyarrow.lib.Table.from_arrays
        File "pyarrow/table.pxi", line 2793, in pyarrow.lib.Table.validate
        File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: List child array invalid: Invalid: Struct child array #0 has length smaller than expected for struct array (13338407 < 13338408) 

      The length is always smaller than expected by 1.

       

      Expected behavior:

      Run without errors or fail with a better error message.

       

      System Info and Versions:

      Apple M1 Pro but also happened on amd64 Linux machine on AWS

       

      arrow-cpp                 7.0.0           py39h8a997f0_8_cpu    conda-forge
      pyarrow                   7.0.0           py39h3a11367_8_cpu    conda-forge
      python                    3.9.7           h54d631c_3_cpython    conda-forge
      

      I could also reproduce with

       pyarrow 8.0.0

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              SimonCW Simon Weiß
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: