Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
None
-
None
-
None
Description
This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error
```
TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ Unknown
@ arrow::io::HdfsReadableFile::ReadAt(long, long, long*, void*)
@ parquet::ArrowInputFile::ReadAt(long, long, unsigned char*)
@ parquet::SerializedFile::ParseMetaData()
@ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
@ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
@ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*)
@ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*)
```
The following code causes it:
```
import pyarrow
import pyarrow.parquet as pq
fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information
file_object = fs.open('single-row.parquet') # update for hdfs path of file
pq.read_metadata(file_object) # this works
parquet_file = pq.ParquetFile(file_object)
parquet_file.read_row_group(0) # throws error
```
I am working on writing a unit test for this. Note that I am using libhdfs3.