[PARQUET-1237] Reading big texts cause OutOfMemmory Error. How to read text partialy? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.8.1
Fix Version/s: None
Component/s: parquet-avro
Labels:
None
Environment:
Hide

I have dataset with big strings (every record about 15 mb) in parquet.

When I try to open all parquet parts I get OutOfMemory exception.

How can I get only headers (first 100 symbols) for each string record without reading all record?

Schema avroProj = SchemaBuilder.builder() .record("proj").fields() .name("idx").type().nullable().longType().noDefault() .name("text").type().nullable().bytesType().noDefault() .endRecord(); Configuration conf = new Configuration(); AvroReadSupport.setRequestedProjection(conf, avroProj); ParquetReader<GenericRecord> parquetReader = AvroParquetReader .<GenericRecord>builder(new Path(filePath)) .withConf(conf) .build(); GenericRecord record = parquetReader.read(); // record already have full text in RAM Long idx = (Long) record.get("idx"); ByteBuffer rawText = (ByteBuffer) record.get("text"); String header = new String(rawText.array()).substring(0, 200);
Show
I have dataset with big strings (every record about 15 mb) in parquet. When I try to open all parquet parts I get OutOfMemory exception. How can I get only headers (first 100 symbols) for each string record without reading all record? Schema avroProj = SchemaBuilder.builder() .record( "proj" ).fields() .name( "idx" ).type().nullable().longType().noDefault() .name( "text" ).type().nullable().bytesType().noDefault() .endRecord(); Configuration conf = new Configuration(); AvroReadSupport.setRequestedProjection(conf, avroProj); ParquetReader<GenericRecord> parquetReader = AvroParquetReader .<GenericRecord>builder( new Path(filePath)) .withConf(conf) .build(); GenericRecord record = parquetReader.read(); // record already have full text in RAM Long idx = ( Long ) record.get( "idx" ); ByteBuffer rawText = (ByteBuffer) record.get( "text" ); String header = new String (rawText.array()).substring(0, 200);

Description

I have dataset with big strings (every record about 15 mb) in parquet.

When I try to open all parquet parts I get OutOfMemory exception.

How can I get only headers (first 100 symbols) for each string record without reading all record?

Schema avroProj = SchemaBuilder.builder()

.record("proj").fields()

.name("idx").type().nullable().longType().noDefault()

.name("text").type().nullable().bytesType().noDefault()

.endRecord();

Configuration conf = new Configuration();

AvroReadSupport.setRequestedProjection(conf, avroProj);

ParquetReader<GenericRecord> parquetReader = AvroParquetReader

.<GenericRecord>builder(new Path(filePath))

.withConf(conf)

.build();

GenericRecord record = parquetReader.read(); // record already have full text in RAM

Long idx = (Long) record.get("idx");

ByteBuffer rawText = (ByteBuffer) record.get("text");

String header = new String(rawText.array()).substring(0, 200);

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Andrei Iatsuk

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Feb/18 10:58

Updated:: 26/Feb/18 10:58