Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Hi,
We are successfully reading parquet files block by block, and are running into a JVM out of memory issue in a certain edge case. Consider the following scenario:
Parquet file has one column and one block and is 10 GB
Our JVM is 5 GB
Is there any way to read such a file? Below is our implementation/stack trace
Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511) try { ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path, ParquetMetadataConverter.NO_FILTER); MessageType schema = readFooter.getFileMetaData().getSchema(); long a = readFooter.getBlocks().stream(). reduce(0L, (left, right) -> left > right.getTotalByteSize() ? left : right.getTotalByteSize(), (leftl, rightl) -> leftl > rightl ? leftl : rightl); for (BlockMetaData block : readFooter.getBlocks()) { try { fileReader = new ParquetFileReader(hfsConfig, readFooter.getFileMetaData(), path, Collections .singletonList(block), schema.getColumns()); PageReadStore pages; while (null != (pages = fileReader.readNextRowGroup())) { //exception gets thrown here on blocks larger than jvm memory final long rows = pages.getRowCount(); final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema); final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema)); for (int i = 0; i < rows; i++) { final Group group = recordReader.read(); int fieldCount = group.getType().getFieldCount(); for (int field = 0; field < fieldCount; field++) { int valueCount = group.getFieldRepetitionCount(field); Type fieldType = group.getType().getType(field); String fieldName = fieldType.getName(); for (int index = 0; index < valueCount; index++) { // Process data } } } } } catch (IOException e) { ... } finally { ... } }