Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4516

[Python] Error while creating a ParquetDataset on a path without `_common_dataset` but with an empty `_tempfile`

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.12.0
    • 0.14.0
    • Python

    Description

      I suspect that there's an error in this line of code:

      https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L926

      While validating schema in the initialisation of a ParquetDataset, we assume that if _common_metadata file does not exist, the schema should be inferred from the first piece of that dataset. The first piece, in my experience, could refer to a file named with an underscore, that does not necessarily have to contain the schema, and could be an empty file, e.g. _tempfile.

      /tmp/pq/
      ├── part1.parquet
      └── _tempfile
      
      

      This behavior is allowed by the parquet specification, and we should probably ignore such pieces.

      On a cursory look, we could do either of the following.

      1. Choose the first piece with path that does not start with "_"
      2. Sort pieces by name, but put all the "_" pieces later while making the manifest. https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L729
      3. Silently exclude all the files starting with "_" here, but this will need to be tested: https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L770

      Attachments

        Activity

          People

            Unassigned Unassigned
            yogeshgarg yogesh garg
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: