Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1429

[Python] Error loading parquet file with _metadata from HDFS

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: Python
    • Labels:
      None
    • Environment:
      RHEL 6.8, Python 3.5.4 (Anaconda), Hadoop 2.6.0-cdh5.8.3

      Description

      I can open tables stored on HDFS as long as there is no _metadata file besides the parquet files.

      For two tables with a _metadata file I get the following traceback:

      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 199, in read_table
          pq_table = read_hdfs_parquet(hdfs_path, columns)
        File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 251, in read_hdfs_parquet
          return HDFS_CONNECTION.read_parquet(hdfs_path, columns)
        File "/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/filesystem.py", line 168, in read_parquet
          filesystem=self)
        File "/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py", line 535, in __init__
          self.common_metadata = ParquetFile(self.metadata_path).metadata
        File "/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py", line 54, in __init__
          self.reader.open(source, metadata=metadata)
        File "_parquet.pyx", line 398, in pyarrow._parquet.ParquetReader.open
        File "io.pxi", line 705, in pyarrow.lib.get_reader
        File "io.pxi", line 472, in pyarrow.lib.memory_map
        File "io.pxi", line 451, in pyarrow.lib.MemoryMappedFile._open
        File "error.pxi", line 72, in pyarrow.lib.check_status
      pyarrow.lib.ArrowIOError: Failed to open local file: hdfs://nameservice1/path/to/table/_metadata
      

      For another table with a _metadata file:

      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 199, in read_table
          pq_table = read_hdfs_parquet(hdfs_path, columns)
        File "/home/bmachie/Documents/ml_irissearch/python/util.py", line 251, in read_hdfs_parquet
          return HDFS_CONNECTION.read_parquet(hdfs_path, columns)
        File "/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/filesystem.py", line 168, in read_parquet
          filesystem=self)
        File "/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py", line 548, in __init__
          self.validate_schemas()
        File "/data/data01/dev/edl/infra/mstr/landing/condaenvs/ml_irissearch/lib/python3.5/site-packages/pyarrow/parquet.py", line 557, in validate_schemas
          self.schema = self.pieces[0].get_metadata(open_file).schema
      IndexError: list index out of range
      

        Attachments

          Activity

            People

            • Assignee:
              brechtm Brecht Machiels
              Reporter:
              brechtm Brecht Machiels
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: