Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: None
    • Labels:
      None

      Description

      I noticed HBASE-7845 and it seems like something we could do in RFile, too.

      Instead of putting the whole key in the index, you put in enough of the key to get the reader to the beginning of the block.

        Issue Links

          Activity

          Hide
          Keith Turner added a comment -

          I was running Fluo's Webindex example on EC2 for a long period. After running the example I inspected some RFiles. Some of them had larger indexes than I expected. I suspect making a change mentioned in the ticket would reduce the index size.

          Below is info for an rfile that uses URLs from web pages in keys. I am going to experiment with generating shorter keys in the index for this file. This file was generated using 64K data blocks and 256K index blocks.

          [centos@leader1 ~]$ accumulo rfile-info  --histogram /accumulo/tables/7/t-0003uq7/A000rxoi.rf
          2016-05-16 16:48:38,914 [rfile.PrintInfo] WARN : Attempting to find file across filesystems. Consider providing URI instead of path
          Reading file: hdfs://leader1:10000/accumulo/tables/7/t-0003uq7/A000rxoi.rf
          Locality group         : notify
          	Start block          : 0
          	Num   blocks         : 0
          	Index level 0        : 0 bytes  1 blocks
          	First key            : null
          	Last key             : null
          	Num entries          : 0
          	Column families      : [ntfy]
          Locality group         : <DEFAULT>
          	Start block          : 0
          	Num   blocks         : 21,818
          	Index level 3        : 120,581 bytes  1 blocks
          	Index level 2        : 451,008 bytes  2 blocks
          	Index level 1        : 714,687 bytes  3 blocks
          	Index level 0        : 6,915,137 bytes  25 blocks
          	First key            : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false
          	Last key             : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false
          	Num entries          : 24,299,468
          	Column families      : [data]
          
          Meta block     : BCFile.index
                Raw size             : 4 bytes
                Compressed size      : 12 bytes
                Compression type     : gz
          
          Meta block     : RFile.index
                Raw size             : 120,754 bytes
                Compressed size      : 21,719 bytes
                Compression type     : gz
          
          
          Up to size      count      %-age
                   10 :    9292962  22.56%
                  100 :   14947371  74.88%
                 1000 :      59017   2.45%
                10000 :        112   0.07%
               100000 :          6   0.04%
              1000000 :          0   0.00%
             10000000 :          0   0.00%
            100000000 :          0   0.00%
           1000000000 :          0   0.00%
          10000000000 :          0   0.00%
          
          Show
          Keith Turner added a comment - I was running Fluo's Webindex example on EC2 for a long period. After running the example I inspected some RFiles. Some of them had larger indexes than I expected. I suspect making a change mentioned in the ticket would reduce the index size. Below is info for an rfile that uses URLs from web pages in keys. I am going to experiment with generating shorter keys in the index for this file. This file was generated using 64K data blocks and 256K index blocks. [centos@leader1 ~]$ accumulo rfile-info --histogram /accumulo/tables/7/t-0003uq7/A000rxoi.rf 2016-05-16 16:48:38,914 [rfile.PrintInfo] WARN : Attempting to find file across filesystems. Consider providing URI instead of path Reading file: hdfs://leader1:10000/accumulo/tables/7/t-0003uq7/A000rxoi.rf Locality group : notify Start block : 0 Num blocks : 0 Index level 0 : 0 bytes 1 blocks First key : null Last key : null Num entries : 0 Column families : [ntfy] Locality group : <DEFAULT> Start block : 0 Num blocks : 21,818 Index level 3 : 120,581 bytes 1 blocks Index level 2 : 451,008 bytes 2 blocks Index level 1 : 714,687 bytes 3 blocks Index level 0 : 6,915,137 bytes 25 blocks First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false Last key : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false Num entries : 24,299,468 Column families : [data] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 120,754 bytes Compressed size : 21,719 bytes Compression type : gz Up to size count %-age 10 : 9292962 22.56% 100 : 14947371 74.88% 1000 : 59017 2.45% 10000 : 112 0.07% 100000 : 6 0.04% 1000000 : 0 0.00% 10000000 : 0 0.00% 100000000 : 0 0.00% 1000000000 : 0 0.00% 10000000000 : 0 0.00%
          Hide
          Keith Turner added a comment -

          I experimented with shortening keys in the index and that gave some nice improvements, but not as much as I expected. I realized that even with those changes, bad keys were still being placed in the index. I added code to keep statistics on key sizes and used those statistics to try to select keys that were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std dev from the mean). With the key shortening and statistics changes I was able to reduce the index size for the file in my previous comment to that below.

          RFile Version            : 8
          
          Locality group           : <DEFAULT>
          	Num   blocks           : 21,758
          	Index level 1          : 3,048 bytes  1 blocks
          	Index level 0          : 1,873,885 bytes  8 blocks
          	First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false
          	Last key               : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false
          	Num entries            : 24,299,468
          	Column families        : [data]
          
          Meta block     : BCFile.index
                Raw size             : 4 bytes
                Compressed size      : 12 bytes
                Compression type     : gz
          
          Meta block     : RFile.index
                Raw size             : 3,163 bytes
                Compressed size      : 1,515 bytes
                Compression type     : gz
          

          At first I thought I could make these changes in 1.6 and 1.7. However while working on this I realized the key shortening change is breaking change, in that older RFile code would not be able to handle keys in the index that do not exist in the data. The changes to uses statistics to choose better keys could be made in 1.6 and 1.7.

          Show
          Keith Turner added a comment - I experimented with shortening keys in the index and that gave some nice improvements, but not as much as I expected. I realized that even with those changes, bad keys were still being placed in the index. I added code to keep statistics on key sizes and used those statistics to try to select keys that were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std dev from the mean). With the key shortening and statistics changes I was able to reduce the index size for the file in my previous comment to that below. RFile Version : 8 Locality group : <DEFAULT> Num blocks : 21,758 Index level 1 : 3,048 bytes 1 blocks Index level 0 : 1,873,885 bytes 8 blocks First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false Last key : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false Num entries : 24,299,468 Column families : [data] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 3,163 bytes Compressed size : 1,515 bytes Compression type : gz At first I thought I could make these changes in 1.6 and 1.7. However while working on this I realized the key shortening change is breaking change, in that older RFile code would not be able to handle keys in the index that do not exist in the data. The changes to uses statistics to choose better keys could be made in 1.6 and 1.7.
          Hide
          Josh Elser added a comment -

          I experimented with shortening keys in the index and that gave some nice improvements, but not as much as I expected. I realized that even with those changes, bad keys were still being placed in the index. I added code to keep statistics on key sizes and used those statistics to try to select keys that were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std dev from the mean).

          I had the thought "how would we determine when index size is efficient" in the future (both evaluating the success of this change as well as identifying perf issues in the future). Did you give any thought about how we could expose this information more easily? Maybe we include some extra information in the file entry in metadata so that the master/monitor could easily aggregate/report on file statistics? Not suggesting it needs to happen now, but wondering your thoughts (since I assume you were doing all this investigation by hand).

          Show
          Josh Elser added a comment - I experimented with shortening keys in the index and that gave some nice improvements, but not as much as I expected. I realized that even with those changes, bad keys were still being placed in the index. I added code to keep statistics on key sizes and used those statistics to try to select keys that were <=AVG(keySize). I also excluded keys that were too big (greater than 3 std dev from the mean). I had the thought "how would we determine when index size is efficient" in the future (both evaluating the success of this change as well as identifying perf issues in the future). Did you give any thought about how we could expose this information more easily? Maybe we include some extra information in the file entry in metadata so that the master/monitor could easily aggregate/report on file statistics? Not suggesting it needs to happen now, but wondering your thoughts (since I assume you were doing all this investigation by hand).
          Hide
          Keith Turner added a comment -

          One thing I thought about but did not get to was making rfile-info print some stats about the index. Can already calculate the average key size with the info that rfile info prints out (using num blocks and total index size). For the histogram option we could print stats and histogram for index and all data. Having the histogram information + stats for all keys and index keys would be really nice for comparing the index to all of the data in the file.

          I suspect that before this change larger keys may have had a higher chance of ending up in the index. Before this change when a data block exceeded the size it would take the last key in the data block and put it in the index. Larger keys would push data blocks over the threshold. Making rfile-info print out these index vs data stats would show this for older files. Maybe I can add that to rfile-info in the PR.

          Show
          Keith Turner added a comment - One thing I thought about but did not get to was making rfile-info print some stats about the index. Can already calculate the average key size with the info that rfile info prints out (using num blocks and total index size). For the histogram option we could print stats and histogram for index and all data. Having the histogram information + stats for all keys and index keys would be really nice for comparing the index to all of the data in the file. I suspect that before this change larger keys may have had a higher chance of ending up in the index. Before this change when a data block exceeded the size it would take the last key in the data block and put it in the index. Larger keys would push data blocks over the threshold. Making rfile-info print out these index vs data stats would show this for older files. Maybe I can add that to rfile-info in the PR.
          Hide
          Keith Turner added a comment -

          I pushed a commit to the PR that adds a --keyStats option to rfile-info. Below is the output of running this command on the original file. Can see that the 6 largest keys all ended up in the index. Also the average key size in the index is over twice that of the data.

          $ accumulo rfile-info --keyStats ~/A000rxoi.rf 
          Reading file: file:/home/fluo/A000rxoi.rf
          RFile Version            : 7
          
          Locality group           : notify
          	Start block            : 0
          	Num   blocks           : 0
          	Index level 0          : 0 bytes  1 blocks
          	First key              : null
          	Last key               : null
          	Num entries            : 0
          	Column families        : [ntfy]
          Locality group           : <DEFAULT>
          	Start block            : 0
          	Num   blocks           : 21,818
          	Index level 3          : 120,581 bytes  1 blocks
          	Index level 2          : 451,008 bytes  2 blocks
          	Index level 1          : 714,687 bytes  3 blocks
          	Index level 0          : 6,915,137 bytes  25 blocks
          	First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false
          	Last key               : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false
          	Num entries            : 24,299,468
          	Column families        : [data]
          
          Meta block     : BCFile.index
                Raw size             : 4 bytes
                Compressed size      : 12 bytes
                Compression type     : gz
          
          Meta block     : RFile.index
                Raw size             : 120,754 bytes
                Compressed size      : 21,719 bytes
                Compression type     : gz
          
          
          Statistics for keys in data :
          	Up to size      count      %-age
          	         10 :   10768926  26.51%
          	        100 :   13471699  70.82%
          	       1000 :      58725   2.56%
          	      10000 :        112   0.07%
          	     100000 :          6   0.04%
          	    1000000 :          0   0.00%
          	   10000000 :          0   0.00%
          	  100000000 :          0   0.00%
          	 1000000000 :          0   0.00%
          	10000000000 :          0   0.00%
          
          	min:      31.00 max: 330,380.00 avg:     122.99 stddev:     157.51
          
          Statistics for keys in index :
          	Up to size      count      %-age
          	         10 :       6192   7.67%
          	        100 :      15024  49.96%
          	       1000 :        578  13.21%
          	      10000 :         18   8.83%
          	     100000 :          6  20.33%
          	    1000000 :          0   0.00%
          	   10000000 :          0   0.00%
          	  100000000 :          0   0.00%
          	 1000000000 :          0   0.00%
          	10000000000 :          0   0.00%
          
          	min:      36.00 max: 330,380.00 avg:     281.73 stddev:   3,901.56
          $
          

          Below is the output of running this command on a file compacted using the code in the PR. None of the largest keys are in the index and the average key size in the index is less than half of whats in the data.

          $ accumulo rfile-info --keyStats /accumulo/tables/2/default_tablet/A0000005.rf
          Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf
          RFile Version            : 8
          
          Locality group           : <DEFAULT>
          	Num   blocks           : 21,758
          	Index level 1          : 3,048 bytes  1 blocks
          	Index level 0          : 1,873,885 bytes  8 blocks
          	First key              : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false
          	Last key               : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false
          	Num entries            : 24,299,468
          	Column families        : [data]
          
          Meta block     : BCFile.index
                Raw size             : 4 bytes
                Compressed size      : 12 bytes
                Compression type     : gz
          
          Meta block     : RFile.index
                Raw size             : 3,163 bytes
                Compressed size      : 1,515 bytes
                Compression type     : gz
          
          
          Statistics for keys in data :
          	Up to size      count      %-age
          	         10 :   10768926  26.51%
          	        100 :   13471699  70.82%
          	       1000 :      58725   2.56%
          	      10000 :        112   0.07%
          	     100000 :          6   0.04%
          	    1000000 :          0   0.00%
          	   10000000 :          0   0.00%
          	  100000000 :          0   0.00%
          	 1000000000 :          0   0.00%
          	10000000000 :          0   0.00%
          
          	min:      31.00 max: 330,380.00 avg:     122.99 stddev:     157.51
          
          Statistics for keys in index :
          	Up to size      count      %-age
          	         10 :      18153  68.40%
          	        100 :       3602  31.43%
          	       1000 :          1   0.17%
          	      10000 :          0   0.00%
          	     100000 :          0   0.00%
          	    1000000 :          0   0.00%
          	   10000000 :          0   0.00%
          	  100000000 :          0   0.00%
          	 1000000000 :          0   0.00%
          	10000000000 :          0   0.00%
          
          	min:       9.00 max:   2,134.00 avg:      58.49 stddev:      36.23
          $
          
          Show
          Keith Turner added a comment - I pushed a commit to the PR that adds a --keyStats option to rfile-info. Below is the output of running this command on the original file. Can see that the 6 largest keys all ended up in the index. Also the average key size in the index is over twice that of the data. $ accumulo rfile-info --keyStats ~/A000rxoi.rf Reading file: file:/home/fluo/A000rxoi.rf RFile Version : 7 Locality group : notify Start block : 0 Num blocks : 0 Index level 0 : 0 bytes 1 blocks First key : null Last key : null Num entries : 0 Column families : [ntfy] Locality group : <DEFAULT> Start block : 0 Num blocks : 21,818 Index level 3 : 120,581 bytes 1 blocks Index level 2 : 451,008 bytes 2 blocks Index level 1 : 714,687 bytes 3 blocks Index level 0 : 6,915,137 bytes 25 blocks First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false Last key : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false Num entries : 24,299,468 Column families : [data] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 120,754 bytes Compressed size : 21,719 bytes Compression type : gz Statistics for keys in data : Up to size count %-age 10 : 10768926 26.51% 100 : 13471699 70.82% 1000 : 58725 2.56% 10000 : 112 0.07% 100000 : 6 0.04% 1000000 : 0 0.00% 10000000 : 0 0.00% 100000000 : 0 0.00% 1000000000 : 0 0.00% 10000000000 : 0 0.00% min: 31.00 max: 330,380.00 avg: 122.99 stddev: 157.51 Statistics for keys in index : Up to size count %-age 10 : 6192 7.67% 100 : 15024 49.96% 1000 : 578 13.21% 10000 : 18 8.83% 100000 : 6 20.33% 1000000 : 0 0.00% 10000000 : 0 0.00% 100000000 : 0 0.00% 1000000000 : 0 0.00% 10000000000 : 0 0.00% min: 36.00 max: 330,380.00 avg: 281.73 stddev: 3,901.56 $ Below is the output of running this command on a file compacted using the code in the PR. None of the largest keys are in the index and the average key size in the index is less than half of whats in the data. $ accumulo rfile-info --keyStats /accumulo/tables/2/default_tablet/A0000005.rf Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A0000005.rf RFile Version : 8 Locality group : <DEFAULT> Num blocks : 21,758 Index level 1 : 3,048 bytes 1 blocks Index level 0 : 1,873,885 bytes 8 blocks First key : um:d:385:%03;%01;10.30.170.244>>o>/2954%af; data:current [] 4611686019157309597 false Last key : um:d:395:%03;%01;%ff; com.facebook>.www>s>/dialog/feed?app_id=90376669494... TRUNCATED data:current [] -6917529026891043602 false Num entries : 24,299,468 Column families : [data] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 3,163 bytes Compressed size : 1,515 bytes Compression type : gz Statistics for keys in data : Up to size count %-age 10 : 10768926 26.51% 100 : 13471699 70.82% 1000 : 58725 2.56% 10000 : 112 0.07% 100000 : 6 0.04% 1000000 : 0 0.00% 10000000 : 0 0.00% 100000000 : 0 0.00% 1000000000 : 0 0.00% 10000000000 : 0 0.00% min: 31.00 max: 330,380.00 avg: 122.99 stddev: 157.51 Statistics for keys in index : Up to size count %-age 10 : 18153 68.40% 100 : 3602 31.43% 1000 : 1 0.17% 10000 : 0 0.00% 100000 : 0 0.00% 1000000 : 0 0.00% 10000000 : 0 0.00% 100000000 : 0 0.00% 1000000000 : 0 0.00% 10000000000 : 0 0.00% min: 9.00 max: 2,134.00 avg: 58.49 stddev: 36.23 $

            People

            • Assignee:
              Keith Turner
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 20m
                3h 20m

                  Development