[HUDI-432] Benchmark HFile for scan vs seek - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0
Component/s: performance, storage-management
Labels:
None

Description

We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing. HFile will be used inline in hudi log for index purposes.

So, as part of benchmarking, we want to see when does scan out performs seek.

This is our experiment set up.

keysToRead = no of keys to be looked up. // differs for different exp runs like 100k, 200k, 500k, 1M.

N = no of iterations

1M entries were written to a single HFile as key value pairs. 
Also, stored the keys in a separate file(key_file).
keyList = read all keys from key_file
for N no of iterations
{
    shuffle keyList 
    trim the list to keysToRead 
    start timer HFile 
    read benchmark(scan/seek) 
    end timer
}
found avg for all timers captured

Result:

Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized configs.

Results can be found here: HFile benchmark.xlsx

Source for benchmarking can be found here:

https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HFile benchmark_withS3.xlsx
09/Mar/20 07:23
25 kB
sivabalan narayanan
HFile benchmark.xlsx
04/Jan/20 02:47
15 kB
sivabalan narayanan
Screen Shot 2020-01-03 at 6.44.25 PM.png
04/Jan/20 02:45
48 kB
sivabalan narayanan
Screen Shot 2020-03-09 at 12.22.54 AM.png
09/Mar/20 07:26
22 kB
sivabalan narayanan

Issue Links

is depended upon by

HUDI-466 [Umbrella] Record level, global low-latency index implementation

Open

Activity

People

Assignee:: sivabalan narayanan

Reporter:: sivabalan narayanan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Dec/19 23:15

Updated:: 19/Jun/23 03:26

Resolved:: 10/May/20 13:27