Hive
  1. Hive
  2. HIVE-2097

Explore mechanisms for better compression with RC Files

    Details

      Description

      Optimization of the compression mechanisms used by RC File to be explored.

      Some initial ideas

      1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.

      For instance, storing sorted numeric values efficiently using some delta coding techniques

      2. More efficient compression based on type-specific and storage-specific knowledge

      Enable compression codecs to be specified based on types or individual columns

      3. Reordering the on-disk storage for better compression efficiency.

      1. datacomp.tar.gz
        23 kB
        Krishna Kumar

        Issue Links

          Activity

          Hide
          Krishna Kumar added a comment -

          Comment hijacked from HIVE-2065:

          He Yongqiang added a comment - 31/Mar/11 23:13

          we examined column groups, and sort the data internally based on one column in one column group. (But we did not try different compressions across column groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one table compared the previous RCFile. The main problems for this approach is that it is hard to find out the correct/most efficient column group definitions.
          One example, table tbl_1 has 20 columns, and user can define:

          col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1;

          This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4, col_15,col_16 into another column group, and reorder this column group based on sorting col_4, and finally put all other columns into the default column group with original order.
          And should be easy to allow different compression codec for different column groups.

          The main block issue for this approach is have a full set of utils to find out the best column group definition.

          Show
          Krishna Kumar added a comment - Comment hijacked from HIVE-2065 : He Yongqiang added a comment - 31/Mar/11 23:13 we examined column groups, and sort the data internally based on one column in one column group. (But we did not try different compressions across column groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one table compared the previous RCFile. The main problems for this approach is that it is hard to find out the correct/most efficient column group definitions. One example, table tbl_1 has 20 columns, and user can define: col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1; This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4, col_15,col_16 into another column group, and reorder this column group based on sorting col_4, and finally put all other columns into the default column group with original order. And should be easy to allow different compression codec for different column groups. The main block issue for this approach is have a full set of utils to find out the best column group definition.
          Hide
          alex gemini added a comment -

          a few suggestion:
          In columnar database,they always organize column order in the "high selectivity come first" way,then In each column,they store each value in sorted way.
          case 1:if we already know the pattern of each column in big datasets,for example we can calculated in database to get a sample column distribution.we need to know the distinct value of each column value.in create database statement
          create table a
          (col1,col2,col3,col4,col5,col6,xxx)
          TBLPROPERTIES (col1_sample=0.001,col2_sample_0.01,col3_sample=0.5,col4_sample=0.02,col5_sample=0.002,col6_sample=0.005)
          when we organize column group,we know which column is most high selectivity.in this example,the selectivity order of table a is : col3>col4>col2>col6>col5>col1 ,so we can organize column group like (col3,col4,col2),(col6,col5),col1
          case 2:if we didn't know the table properties when we create table.we can just store them like normally,then provide a utility like hive --service rcfile_reorder 'some_hive_table_here', when execute this command,submit several mapreduce job to calculate the selectivity of each column and store them in metastore.then decompression each rcfile to reorganized them in a more space efficience column group.
          hope this help.

          Show
          alex gemini added a comment - a few suggestion: In columnar database,they always organize column order in the "high selectivity come first" way,then In each column,they store each value in sorted way. case 1:if we already know the pattern of each column in big datasets,for example we can calculated in database to get a sample column distribution.we need to know the distinct value of each column value.in create database statement create table a (col1,col2,col3,col4,col5,col6,xxx) TBLPROPERTIES (col1_sample=0.001,col2_sample_0.01,col3_sample=0.5,col4_sample=0.02,col5_sample=0.002,col6_sample=0.005) when we organize column group,we know which column is most high selectivity.in this example,the selectivity order of table a is : col3>col4>col2>col6>col5>col1 ,so we can organize column group like (col3,col4,col2),(col6,col5),col1 case 2:if we didn't know the table properties when we create table.we can just store them like normally,then provide a utility like hive --service rcfile_reorder 'some_hive_table_here', when execute this command,submit several mapreduce job to calculate the selectivity of each column and store them in metastore.then decompression each rcfile to reorganized them in a more space efficience column group. hope this help.
          Hide
          Krishna Kumar added a comment -

          Thanks Alex for the suggestions.

          Just to be sure we are on the same page, I believe you are talking about #3 approach above given in the description which aligns with the ideas in the comment from He Yongqiang. I have been working on implementing #1 and #2 currently.

          Re #3 approaches, column grouping and row reordering are the general idea, but I do not understand your point re column selectivity. Why should selectivity play a role here where any grouping/reordering is done for better compression? There are two effects which we can exploit for better compression within column grouping (a) when the values in the two columns are similar and (b) where the values are correlated, that is, using conditional probabilities for better compression. In either case, my hope was that we would be able to create type-specific compressors for structs/maps etc which can exploit these features, i.e., a struct/map acts as a column group for compression purposes.

          Show
          Krishna Kumar added a comment - Thanks Alex for the suggestions. Just to be sure we are on the same page, I believe you are talking about #3 approach above given in the description which aligns with the ideas in the comment from He Yongqiang. I have been working on implementing #1 and #2 currently. Re #3 approaches, column grouping and row reordering are the general idea, but I do not understand your point re column selectivity. Why should selectivity play a role here where any grouping/reordering is done for better compression? There are two effects which we can exploit for better compression within column grouping (a) when the values in the two columns are similar and (b) where the values are correlated, that is, using conditional probabilities for better compression. In either case, my hope was that we would be able to create type-specific compressors for structs/maps etc which can exploit these features, i.e., a struct/map acts as a column group for compression purposes.
          Hide
          alex gemini added a comment -

          selectivity play an important role in columnar database is because they use run-length encoding compression to compress most dimension-attribute column,for example,we have a log table:create table (gender,age,region,message),we know that the selectivity order is :gender=1/2 > age= 1/20 >1/300, we can order table column like #1(gender,age,region,message) or #2(region,age,gender,message). for #1,we only need (2 + 2*20 + 2*20*300 +num_of_message) to store all the record in one dfs block, but if we organized table like #2,we will need (300 + 300*20 + 300*20*2 + num_of_message),discard num_of_message,the #1 is only take 66% of space #2 required,only difference is because run-length encoding will take more efficiently space when we organize table base on selectivity.

          Show
          alex gemini added a comment - selectivity play an important role in columnar database is because they use run-length encoding compression to compress most dimension-attribute column,for example,we have a log table:create table (gender,age,region,message),we know that the selectivity order is :gender=1/2 > age= 1/20 >1/300, we can order table column like #1(gender,age,region,message) or #2(region,age,gender,message). for #1,we only need (2 + 2*20 + 2*20*300 +num_of_message) to store all the record in one dfs block, but if we organized table like #2,we will need (300 + 300*20 + 300*20*2 + num_of_message),discard num_of_message,the #1 is only take 66% of space #2 required,only difference is because run-length encoding will take more efficiently space when we organize table base on selectivity.
          Hide
          alex gemini added a comment -

          another issue is about efficient serialization/deserialization,for the same example above,assume every gender,age,region have 100 message equally store in one dfs block,in gender column we store value like this:

          {'male'}

          [1-60k]

          {'female'}

          [60k+1 - 120k],age column look like this:

          {21}

          [1-3k]

          {22}

          [3k+1 - 6k]

          {23}

          [6k+1 - 9k],and region column is like:

          {'LA'}[1-300]{'NY'}[301-600].
          When we issue a query on a single table like :select sum(age) from logs where region='LA' and age=30,we count every column represented at 'select,where,group' clause,so we know the last column means lowest selectivity(in this example is region),we find the region value={'LA'}

          [(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....] and 'NY' value=

          {'NY'}

          [(301-600),(30k+301 - 30k+600),(60k+301 -60k+600)....]
          we just need to deserialization it but don't need to decompression it because we know the lowest selectivity column,then we organize inputSplit's key like

          {[age='21'][region='LA']}

          and value is

          {(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....}

          ,this inputSplit key and value is unique per dfs block because we already sort column by selectivity,the lowest selectivity column presented at (select,where,group) must be unique.

          Show
          alex gemini added a comment - another issue is about efficient serialization/deserialization,for the same example above,assume every gender,age,region have 100 message equally store in one dfs block,in gender column we store value like this: {'male'} [1-60k] {'female'} [60k+1 - 120k] ,age column look like this: {21} [1-3k] {22} [3k+1 - 6k] {23} [6k+1 - 9k] ,and region column is like: {'LA'} [1-300] {'NY'} [301-600] . When we issue a query on a single table like :select sum(age) from logs where region='LA' and age=30,we count every column represented at 'select,where,group' clause,so we know the last column means lowest selectivity(in this example is region),we find the region value={'LA'} [(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....] and 'NY' value= {'NY'} [(301-600),(30k+301 - 30k+600),(60k+301 -60k+600)....] we just need to deserialization it but don't need to decompression it because we know the lowest selectivity column,then we organize inputSplit's key like {[age='21'][region='LA']} and value is {(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....} ,this inputSplit key and value is unique per dfs block because we already sort column by selectivity,the lowest selectivity column presented at (select,where,group) must be unique.
          Show
          alex gemini added a comment - I'm not quit sure I explain the columnar database execution strategy clearly,I hope the following material will help: #1 http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases #2 http://www.oscon.com/oscon2010/public/schedule/detail/15561 #3 http://www.vertica.com/2010/05/26/why-verticas-compression-is-better/ #4 http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/ Good luck.
          Hide
          Krishna Kumar added a comment -

          unrefactored source for all the implemented compression codecs

          Show
          Krishna Kumar added a comment - unrefactored source for all the implemented compression codecs

            People

            • Assignee:
              Krishna Kumar
              Reporter:
              Krishna Kumar
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development