Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2842

[Python] Cannot read parquet files with row group size of 1 From HDFS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • None
    • None
    • Python
    • None

    Description

      This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error

      ```

      TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
      @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
      @ parquet::SerializedFile::ParseMetaData()
      @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
      @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
      @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*)
      @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*)

      ```

      The following code causes it:

      ```

      import pyarrow

      import pyarrow.parquet as pq

       

      fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information

      file_object = fs.open('single-row.parquet') # update for hdfs path of file

      pq.read_metadata(file_object) # this works

      parquet_file = pq.ParquetFile(file_object)

      parquet_file.read_row_group(0) # throws error

      ```

       

      I am working on writing a unit test for this. Note that I am using libhdfs3.

      Attachments

        1. single-row.parquet
          0.4 kB
          Robbie Gruener

        Activity

          People

            Unassigned Unassigned
            rgruener Robbie Gruener
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: