Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3728

[Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10.0, 0.11.0, 0.11.1
    • 0.12.0
    • Python
    • Python 3.6.3
      OSX 10.14

    Description

      From: https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
       
      I am trying to merge multiple parquet files into one. Their schemas are identical field-wise but my ParquetWriter is complaining that they are not. After some investigation I found that the pandas meta in the schemas are different, causing this error.
       
      Sample-

      import pyarrow.parquet as pq
      
      pq_tables=[]
      for file_ in files:
          pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
          pq_tables.append(pq_table)
          if writer is None:
              writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, use_deprecated_int96_timestamps=True)
          writer.write_table(table=pq_table)
      

      The error-

      Traceback (most recent call last):
        File "{PATH_TO}/main.py", line 68, in lambda_handler
          writer.write_table(table=pq_table)
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 335, in write_table
          raise ValueError(msg)
      ValueError: Table schema does not match schema used to create file:
      

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              micahwilliamson Micah Williamson
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h