Details

    Description

      We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing. HFile will be used inline in hudi log for index purposes. 

      So, as part of benchmarking, we want to see when does scan out performs seek. 

      This is our experiment set up.

      keysToRead = no of keys to be looked up. // differs for different exp runs like 100k, 200k, 500k, 1M. 

      N = no of iterations

       

      1M entries were written to a single HFile as key value pairs. 
      Also, stored the keys in a separate file(key_file).
      keyList = read all keys from key_file
      for N no of iterations
      {
          shuffle keyList 
          trim the list to keysToRead 
          start timer HFile 
          read benchmark(scan/seek) 
          end timer
      }
      found avg for all timers captured
      

       

       

      Result:

      Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized configs.

       

      Results can be found here: HFile benchmark.xlsx

      Source for benchmarking can be found here: 

      https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301

      Attachments

        1. HFile benchmark_withS3.xlsx
          25 kB
          sivabalan narayanan
        2. HFile benchmark.xlsx
          15 kB
          sivabalan narayanan
        3. Screen Shot 2020-01-03 at 6.44.25 PM.png
          48 kB
          sivabalan narayanan
        4. Screen Shot 2020-03-09 at 12.22.54 AM.png
          22 kB
          sivabalan narayanan

        Issue Links

          Activity

            People

              shivnarayan sivabalan narayanan
              shivnarayan sivabalan narayanan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: