Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-10401

Oplog recovery takes too long due to fault in fastutil library

    XMLWordPrintableJSON

Details

    Description

      As we already know, the .drf file delete operations only contain OplogEntryID. During recovery, the server reads (byte by byte) each OplogEntryID and stores it in a HashSet to use later when recovering .crf files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The OplogEntryID of type integer will be stored in IntOpenHashSet, and long integer in LongOpenHashSet, probably due to memory optimization and performance factors. OplogEntryID starts with a zero and increments throughout time.

      We have observed in logs that between exception (There is a large number of deleted entries) and the previous log have passed more than 4 minutes (sometimes even more).

      {"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk store dataDiskStore.","metadata":
      {"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There is a large number of deleted entries within the disk-store, please execute an offline
      compaction.","metadata":
      

      When the above exception occurs, that means that the limit of 805306401 entries in IntOpenHashSet has been reached. In that case, the server rolls to the new IntOpenHashSet, where an exception and the delay could happen again.

      The problem is that due to the fault in FastUtil dependency (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens multiple times before the max size is reached. The rehashing starts from 805306368 onwards for each new entry until the max size. This rehashing adds several minutes to .drf Oplog recovery, but does nothing as max is already reached.

      Attachments

        Activity

          People

            jvarenina Jakov Varenina
            jvarenina Jakov Varenina
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: