Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10121

[C++][Python] Variable dictionaries do not survive roundtrip to IPC stream

    XMLWordPrintableJSON

Details

    Description

      Failing test case (from dev@ https://lists.apache.org/thread.html/r338942b4e9f9316b48e87aab41ac49c7ffedd45733d4a6349523b7eb%40%3Cdev.arrow.apache.org%3E)

      import pyarrow as pa
      from io import BytesIO
      
      pa.__version__
      
      schema = pa.schema([pa.field('foo', pa.int32()), pa.field('bar', pa.dictionary(pa.int32(), pa.string()))] )
      r1 = pa.record_batch(
          [
              [1, 2, 3, 4, 5],
              pa.array(["a", "b", "c", "d", "e"]).dictionary_encode()
          ],
          schema
      )
      
      r1.validate()
      r2 = pa.record_batch(
          [
              [1, 2, 3, 4, 5],
              pa.array(["c", "c", "e", "f", "g"]).dictionary_encode()
          ],
          schema
      )
      
      r2.validate()
      
      assert r1.column(1).dictionary != r2.column(1).dictionary
      
      
      sink =  pa.BufferOutputStream()
      writer = pa.RecordBatchStreamWriter(sink, schema)
      
      writer.write(r1)
      writer.write(r2)
      
      serialized = BytesIO(sink.getvalue().to_pybytes())
      stream = pa.ipc.open_stream(serialized)
      
      deserialized = []
      
      while True:
          try:
              deserialized.append(stream.read_next_batch())
          except StopIteration:
              break
      
      assert deserialized[1][1].to_pylist() == r2[1].to_pylist()
      

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              wesm Wes McKinney
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h