Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7545

[C++] [Dataset] Scanning dataset with dictionary type hangs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • None
    • 0.16.0
    • C++

    Description

      I assume it is an issue on the C++ side of the datasets code, but reproducer in Python.

      I create a small parquet file with a single column of dictionary type. Reading it with pq.read_table works fine, reading it with the datasets machinery hangs when scanning:

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)})
      arrow_table = pa.Table.from_pandas(df)
      
      filename = "test.parquet"
      pq.write_table(arrow_table, filename)
      
      from pyarrow.fs import LocalFileSystem
      from pyarrow.dataset import ParquetFileFormat, Dataset, FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions
      
      filesystem = LocalFileSystem()
      format = ParquetFileFormat()
      options = FileSystemDiscoveryOptions()
      
      discovery = FileSystemDataSourceDiscovery(
              filesystem, [filename], format, options)
      inspected_schema = discovery.inspect()
      dataset = Dataset([discovery.finish()], inspected_schema)
      
      # dataset.schema works fine and gives correct schema
      dataset.schema
      
      scanner_builder = dataset.new_scan()
      scanner = scanner_builder.finish()
      # this hangs
      scanner.to_table()
      

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: