Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10145

[C++][Dataset] Assert integer overflow in partitioning falls back to string

    XMLWordPrintableJSON

Details

    Description

      From https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset

      Small reproducer:

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      table = pa.table({'part': [3760212050]*10, 'col': range(10)})
      pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
      
      In [35]: pq.read_table("test_int64_partition/")
      ...
      ArrowInvalid: error parsing '3760212050' as scalar of type int32
      In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
      In ../src/arrow/dataset/partition.cc, line 218, code: (_error_or_value26).status()
      In ../src/arrow/dataset/partition.cc, line 229, code: (_error_or_value27).status()
      In ../src/arrow/dataset/discovery.cc, line 256, code: (_error_or_value17).status()
      
      In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
      Out[36]: 
      pyarrow.Table
      col: int64
      part: dictionary<values=int64, indices=int32, ordered=0>
      

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m