Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2842

[Python] Cannot read parquet files with row group size of 1 From HDFS

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
      None

      Description

      This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error

      ```

      TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ Unknown
      @ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
      @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
      @ parquet::SerializedFile::ParseMetaData()
      @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
      @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
      @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*)
      @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*)

      ```

      The following code causes it:

      ```

      import pyarrow

      import pyarrow.parquet as pq

       

      fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information

      file_object = fs.open('single-row.parquet') # update for hdfs path of file

      pq.read_metadata(file_object) # this works

      parquet_file = pq.ParquetFile(file_object)

      parquet_file.read_row_group(0) # throws error

      ```

       

      I am working on writing a unit test for this. Note that I am using libhdfs3.

        Attachments

        1. single-row.parquet
          0.4 kB
          Robbie Gruener

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rgruener Robbie Gruener
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: