Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1459

[Python] PyArrow fails to load partitioned parquet files with non-primitive types

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Python
    • Labels:
      None

      Description

      When reading partitioned parquet files (tested with those produced by Spark), that contain lists, the resulting table seems to contain data loaded only from one partition. Primitive types seems to be loaded correctly.

      It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):

      >>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), np.arange(20).reshape((10,2)).tolist())))
      >>> df.toPandas()
         _1        _2
      0   0    [0, 1]
      1   1    [2, 3]
      2   2    [4, 5]
      3   3    [6, 7]
      4   4    [8, 9]
      5   5  [10, 11]
      6   6  [12, 13]
      7   7  [14, 15]
      8   8  [16, 17]
      9   9  [18, 19]
      >>> df.repartition(2).write.parquet('df_parts.parquet')
      >>> pq.read_table('df_parts.parquet').to_pandas()
         _1        _2
      0   0    [0, 1]
      1   2    [4, 5]
      2   4    [8, 9]
      3   6  [12, 13]
      4   8  [16, 17]
      5   1    [0, 1]
      6   3    [4, 5]
      7   5    [8, 9]
      8   7  [12, 13]
      9   9  [16, 17]
      

      When the data is loaded using Spark or coalesced into one partition, everything works as expected:

      >>> spark.read.parquet('df_parts.parquet').toPandas()
         _1        _2
      0   1    [2, 3]
      1   3    [6, 7]
      2   5  [10, 11]
      3   7  [14, 15]
      4   9  [18, 19]
      5   0    [0, 1]
      6   2    [4, 5]
      7   4    [8, 9]
      8   6  [12, 13]
      9   8  [16, 17]
      >>> df.coalesce(1).write.parquet('df_single.parquet')
      >>> pq.read_table('df_single.parquet').to_pandas()
         _1        _2
      0   0    [0, 1]
      1   1    [2, 3]
      2   2    [4, 5]
      3   3    [6, 7]
      4   4    [8, 9]
      5   5  [10, 11]
      6   6  [12, 13]
      7   7  [14, 15]
      8   8  [16, 17]
      9   9  [18, 19]
      

        Attachments

          Activity

            People

            • Assignee:
              wesm Wes McKinney
              Reporter:
              jonasamrich Jonas Amrich
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: