Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3091

Make simple index as the default hoodie.index.type

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 0.11.0
    • index

    Description

      When performing upserts with derived datasets, we often run into an OOM issue with the bloom filter, hence we changed all the dataset index types to simple to resolve the issue.

       

      Some of the tables were non-partitioned tables for which bloom index is not the right choice.

      I'm proposing to make a simple index as the default value and on case-by-case basics, folks can choose the bloom filter for additional performance gains offered by bloom filters.

       

      I agree that the performance will not be optimal but for regular use cases simple index would not break and give them sub-optimal read/write performance but it won't break any ingestion/derived jobs.

       

       

      Tests to validate the flip:

      Trigger some ingestions (either spark datasource or deltastreamer) with record keys having some timestamp characteristics. 

      Updates 5 to 10%. 

      Dataset size: 100GB. 

      measure index look up time across bloom index and simple index. 

       

       

       

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shivnarayan sivabalan narayanan
            vino Vinoth Govindarajan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h
                2h

                Agile

                  Completed Sprints:
                  Cont' improve - 2021/01/24 ended 01/Feb/22
                  Cont' improve - 2021/01/31 ended 08/Feb/22
                  Cont' improve - 2022/02/07 ended 15/Feb/22
                  View on Board

                  Slack

                    Issue deployment