Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11069

[C++] Parquet writer incorrect data being written when data type is struct

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.0.0
    • 3.0.0
    • Python
    • None
    • pandas v1.0.4

    Description

      When writing a dict column using pyarrow. 

       

      import pandas as pd
      
      orig = pd.read_parquet("original.parquet")
      orig.to_parquet("first_write.parquet")
      
      first_write = pd.read_parquet("first_write.parquet")
      
      print(orig.equals(first_write))
      

       
      This incorrect results start appearing after index 1024. first_write.parquet was created after reading and then writing it again. I don't see any obvious pattern in the shuffled rows.


       Original records

      Written records

      Attachments

        1. original.parquet
          5 kB
          Palash Goel
        2. image-2020-12-30-01-20-45-183.png
          130 kB
          Palash Goel
        3. image-2020-12-30-01-19-42-739.png
          109 kB
          Palash Goel
        4. image-2020-12-30-01-19-20-491.png
          109 kB
          Palash Goel
        5. first_write.parquet
          5 kB
          Palash Goel

        Issue Links

          Activity

            People

              Unassigned Unassigned
              palashgoel7 Palash Goel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: