Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13214

[C++] [Parquet] uint32 does not roundtrip?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Parquet
    • None

    Description

      I found that the following does not roundtrip:

      [('generated_primitive', DataType(uint32)), ('generated_primitive', DataType(uint32))]
      [('generated_primitive_no_batches', DataType(uint32)), ('generated_primitive_no_batches', DataType(uint32))]
      [('generated_primitive_zerolength', DataType(uint32)), ('generated_primitive_zerolength', DataType(uint32))]
      

      The exact code I am using for this

      import os
      
      import pyarrow.ipc
      import pyarrow.parquet as pq
      
      
      def get_file_path(file: str):
          return f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{file}.arrow_file"
      
      
      def _expected(file: str):
          return pyarrow.ipc.RecordBatchFileReader(get_file_path(file)).read_all()
      
      
      def check_file(file):
          expected = _expected(file)
          path = f"{file}.parquet"
      
          pq.write_table(expected, path, compression=None, write_statistics=False)
      
          table = pq.read_table(path)
          os.remove(path)
      
          failing = []
          for c1, c2 in zip(expected, table):
              if c1 != c2:
                  failing.append((file, c1.type))
          return failing
      
      
      for file in [
          "generated_primitive",
          "generated_primitive_no_batches",
          "generated_primitive_zerolength",
          "generated_null",
          "generated_null_trivial",
          "generated_primitive_large_offsets",
      ]:
          failing = check_file(file)
          if failing:
              print(failing)
      

      Note: I generated the same parquet using the experimental parquet2 and the roundtrip succeeds, suggesting that the potential error is in writing.

      Upon further investigation, it seems that the only difference is the type: c1's type is uint32, c2's type is int64.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorgecarleitao Jorge Leitão
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: