[ORC-299] Improve heuristics for bailing on dictionary encoding - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Recently a user ran into the following failure:

Caused by: java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at
  org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115) at
  org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48) at
  org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:55) at
  org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:1250) at
  org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1797) at
  org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2469) at
  org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86) at
  org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753) at
  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
  org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at
  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
  org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:122) at
  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
  org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:110) at
  org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:165) at
  org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:536) ... 18 more

I tracked this down to the following in DynamicByteArray.java, which is being used to create the dictionary for a particular column:

private int length;

This has the side-effect of capping the memory available for the dictionary at 2GB.

Given the size of column values in this use case, and the fact that the user is exceeding this 2GB limit, there should probably be some heuristics that bail early on dictionary creation, so this limitation is never reached. Given the size of data that would be required to hit this limit, it is unlikely that a dictionary would be useful.

Attachments

Issue Links

relates to

ORC-373 If "orc.dictionary.key.threshold" is set to 0, don't try dictionary encoding.

Closed

ORC-397 ORC should allow selectively disabling dictionary-encoding on specified columns

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Chris Drome

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 08/Feb/18 00:21

Updated:: 19/Nov/20 21:50