[HUDI-3091] Make simple index as the default hoodie.index.type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11.0
Component/s: index
Labels:
- pull-request-available

Description

When performing upserts with derived datasets, we often run into an OOM issue with the bloom filter, hence we changed all the dataset index types to simple to resolve the issue.

Some of the tables were non-partitioned tables for which bloom index is not the right choice.

I'm proposing to make a simple index as the default value and on case-by-case basics, folks can choose the bloom filter for additional performance gains offered by bloom filters.

I agree that the performance will not be optimal but for regular use cases simple index would not break and give them sub-optimal read/write performance but it won't break any ingestion/derived jobs.

Tests to validate the flip:

Trigger some ingestions (either spark datasource or deltastreamer) with record keys having some timestamp characteristics.

Updates 5 to 10%.

Dataset size: 100GB.

measure index look up time across bloom index and simple index.

Attachments

Issue Links

is duplicated by

HUDI-3278 Make Simple Index the default index type

Closed

links to

GitHub Pull Request #4659

Activity

People

Assignee:: sivabalan narayanan

Reporter:: Vinoth Govindarajan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Dec/21 00:22

Updated:: 08/Feb/22 14:22

Resolved:: 08/Feb/22 14:22

Time Tracking

Estimated:

Remaining:

Logged: