Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.8.1
-
None
-
None
-
I have dataset with big strings (every record about 15 mb) in parquet.
When I try to open all parquet parts I get OutOfMemory exception.
How can I get only headers (first 100 symbols) for each string record without reading all record?
Schema avroProj = SchemaBuilder.builder() .record("proj").fields() .name("idx").type().nullable().longType().noDefault() .name("text").type().nullable().bytesType().noDefault() .endRecord(); Configuration conf = new Configuration(); AvroReadSupport.setRequestedProjection(conf, avroProj); ParquetReader<GenericRecord> parquetReader = AvroParquetReader .<GenericRecord>builder(new Path(filePath)) .withConf(conf) .build(); GenericRecord record = parquetReader.read(); // record already have full text in RAM Long idx = (Long) record.get("idx"); ByteBuffer rawText = (ByteBuffer) record.get("text"); String header = new String(rawText.array()).substring(0, 200);
I have dataset with big strings (every record about 15 mb) in parquet. When I try to open all parquet parts I get OutOfMemory exception. How can I get only headers (first 100 symbols) for each string record without reading all record? Schema avroProj = SchemaBuilder.builder() .record( "proj" ).fields() .name( "idx" ).type().nullable().longType().noDefault() .name( "text" ).type().nullable().bytesType().noDefault() .endRecord(); Configuration conf = new Configuration(); AvroReadSupport.setRequestedProjection(conf, avroProj); ParquetReader<GenericRecord> parquetReader = AvroParquetReader .<GenericRecord>builder( new Path(filePath)) .withConf(conf) .build(); GenericRecord record = parquetReader.read(); // record already have full text in RAM Long idx = ( Long ) record.get( "idx" ); ByteBuffer rawText = (ByteBuffer) record.get( "text" ); String header = new String (rawText.array()).substring(0, 200);
Description
I have dataset with big strings (every record about 15 mb) in parquet.
When I try to open all parquet parts I get OutOfMemory exception.
How can I get only headers (first 100 symbols) for each string record without reading all record?
Schema avroProj = SchemaBuilder.builder()
.record("proj").fields()
.name("idx").type().nullable().longType().noDefault()
.name("text").type().nullable().bytesType().noDefault()
.endRecord();
Configuration conf = new Configuration();
AvroReadSupport.setRequestedProjection(conf, avroProj);
ParquetReader<GenericRecord> parquetReader = AvroParquetReader
.<GenericRecord>builder(new Path(filePath))
.withConf(conf)
.build();
GenericRecord record = parquetReader.read(); // record already have full text in RAM
Long idx = (Long) record.get("idx");
ByteBuffer rawText = (ByteBuffer) record.get("text");
String header = new String(rawText.array()).substring(0, 200);