Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-4314

Use statistics to choose better keys for RFile index

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.6, 1.7.2, 1.8.0
    • Component/s: None
    • Labels:
      None

      Description

      The commit for ACCUMULO-1124 makes two changes :

      • Generates shorter keys that may not exist in data to place in RFile index
      • Use statistics to make better choices about what keys to place in index. These changes look for keys that are average or below and excludes large keys (keys that are > 3 std dev).

      The change to generate shorter keys can not be made in 1.7.X and 1.6.X because it would generate RFiles that may not work properly with older 1.6 and 1.7 versions. However the changes to use statistics to pick better keys could be made in 1.6 and 1.7.

        Issue Links

          Activity

          Hide
          kturner Keith Turner added a comment -

          Another important change to note is the change in depth of the index tree. In the original file the tree was 4 levels. After running with these changes its only 2 levels. Having less levels is not just a function of the total index size. The larger keys tend to make the index tree deeper. Avoiding adding larger keys to the index avoids this problem.

          Show
          kturner Keith Turner added a comment - Another important change to note is the change in depth of the index tree. In the original file the tree was 4 levels. After running with these changes its only 2 levels. Having less levels is not just a function of the total index size. The larger keys tend to make the index tree deeper. Avoiding adding larger keys to the index avoids this problem.
          Hide
          kturner Keith Turner added a comment -

          I ran test with the changes in 1.7 for this issue using the same file I was testing the changes for ACCUMULO-1124 with. The total index size went from 6.9M to 3.6M.

          $ accumulo rfile-info /accumulo/tables/2/default_tablet/A0000005.rf
          Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
          Locality group         : <DEFAULT>
              Start block          : 0
              Num   blocks         : 20,041
              Index level 1        : 4,140 bytes  1 blocks
              Index level 0        : 3,620,079 bytes  14 blocks
              First key            : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false
              Last key             : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false
              Num entries          : 24,299,468
              Column families      : [data]
          
          Meta block     : BCFile.index
                Raw size             : 4 bytes
                Compressed size      : 12 bytes
                Compression type     : gz
          
          Meta block     : RFile.index
                Raw size             : 4,258 bytes
                Compressed size      : 2,154 bytes
                Compression type     : gz
          
          Show
          kturner Keith Turner added a comment - I ran test with the changes in 1.7 for this issue using the same file I was testing the changes for ACCUMULO-1124 with. The total index size went from 6.9M to 3.6M. $ accumulo rfile-info /accumulo/tables/2/default_tablet/A0000005.rf Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf Locality group : <DEFAULT> Start block : 0 Num blocks : 20,041 Index level 1 : 4,140 bytes 1 blocks Index level 0 : 3,620,079 bytes 14 blocks First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false Last key : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false Num entries : 24,299,468 Column families : [data] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 4,258 bytes Compressed size : 2,154 bytes Compression type : gz
          Hide
          kturner Keith Turner added a comment -

          I just pushed the changes. Unfortunately I noticed that I put the wrong issue in the commit message. I put ACCUMULO-4318 in the commit message (another issue I am currently working). The commit in 1.6 is 63a8a5d. The merge commit in 1.7 is d33b2a0.

          Show
          kturner Keith Turner added a comment - I just pushed the changes. Unfortunately I noticed that I put the wrong issue in the commit message. I put ACCUMULO-4318 in the commit message (another issue I am currently working). The commit in 1.6 is 63a8a5d. The merge commit in 1.7 is d33b2a0.
          Hide
          mdrob Mike Drob added a comment -

          Keith Turner - are you still on track to get this done soon? I see that ACCUMULO-1124 has been completed.

          Show
          mdrob Mike Drob added a comment - Keith Turner - are you still on track to get this done soon? I see that ACCUMULO-1124 has been completed.

            People

            • Assignee:
              kturner Keith Turner
              Reporter:
              kturner Keith Turner
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development