Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3728

[Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10.0, 0.11.0, 0.11.1
    • 0.12.0
    • Python
    • Python 3.6.3
      OSX 10.14

    Description

      From: https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
       
      I am trying to merge multiple parquet files into one. Their schemas are identical field-wise but my ParquetWriter is complaining that they are not. After some investigation I found that the pandas meta in the schemas are different, causing this error.
       
      Sample-

      import pyarrow.parquet as pq
      
      pq_tables=[]
      for file_ in files:
          pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
          pq_tables.append(pq_table)
          if writer is None:
              writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, use_deprecated_int96_timestamps=True)
          writer.write_table(table=pq_table)
      

      The error-

      Traceback (most recent call last):
        File "{PATH_TO}/main.py", line 68, in lambda_handler
          writer.write_table(table=pq_table)
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 335, in write_table
          raise ValueError(msg)
      ValueError: Table schema does not match schema used to create file:
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kszucs Krisztian Szucs Assign to me
            micahwilliamson Micah Williamson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 1h
              1h

              Slack

                Issue deployment