Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7727

[Python] Unable to read a ParquetDataset when schema validation is on.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.15.1
    • 0.16.0
    • Python
    • None

    Description

      I was trying to read a subset of my parquet files using the ParquetDataset object with a predefined schema, when it tries to validate the schema a `to_arrow_schema` is called and the schema does not support this. I don't what is happening, this is a sample. 

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      import numpy as np
      
      schema = pa.schema([
          pa.field("field1", pa.string()),
          pa.field("field2", pa.string()),
          pa.field("field3", pa.string()),
      ])
      
       ...
      
      pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)
      
      AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
      

      If we check the type of the schema as defined above we get:

      type(schema)
      pyarrow.lib.Schema

      But the required type according with the docs is `pyarrow.parquet.Schema`, I don't know how to produce a object with this since we are forbbiden to use the Schema constructor directly.

      If we check the implementation on github we get directly this line here:

      dataset_schema = self.schema.to_arrow_schema()

      Is this a problem in the schema builder or the parquet dataset object?

      Attachments

        Activity

          People

            Unassigned Unassigned
            otaviocv Otávio Vasques
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: