Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-2565

Store data for immutable tables in single KeyValue

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.0
    • Labels:
      None

      Description

      Since an immutable table (i.e. declared with IMMUTABLE_ROWS=true) will never update a column value, it'd be more efficient to store all column values for a row in a single KeyValue. We could use the existing format we have for variable length arrays.

      For backward compatibility, we'd need to support the current mechanism. Also, you'd no longer be allowed to transition an existing table to/from being immutable. I think the best approach would be to introduce a new IMMUTABLE keyword and use it like this:

      CREATE IMMUTABLE TABLE ...
      
      1. PHOENIX-2565-wip.patch
        113 kB
        Thomas D'Silva
      2. PHOENIX-2565-v2.patch
        93 kB
        Thomas D'Silva
      3. PHOENIX-2565.patch
        100 kB
        Thomas D'Silva

        Issue Links

          Activity

          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          Before starting up with one - currently if we declare a table with IMMUTABLE_ROWS = true, if we try do an update to the table to the same row - do we throw any exception or it just works fine? Is this property only useful in case of Index so that the index need not be updated every time?

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - Before starting up with one - currently if we declare a table with IMMUTABLE_ROWS = true, if we try do an update to the table to the same row - do we throw any exception or it just works fine? Is this property only useful in case of Index so that the index need not be updated every time?
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          I tried out some test cases by upserting same row twice but could not see any exceptions while doing so.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - I tried out some test cases by upserting same row twice but could not see any exceptions while doing so.
          Hide
          ram_krish ramkrishna.s.vasudevan added a comment -

          So in order to do this compression of the Cells - we need to pass on to the server that this mutation is getting applied on an IMMUTABLE_ROWS = true table and if so implement the preBatchMutate of the RegionObserver and do the compression. Do we already have a coprocessor implementing the preBatchMutate() other than Indexer? If so can use it or we need to introduce a new one specific for mutations I believe.

          Show
          ram_krish ramkrishna.s.vasudevan added a comment - So in order to do this compression of the Cells - we need to pass on to the server that this mutation is getting applied on an IMMUTABLE_ROWS = true table and if so implement the preBatchMutate of the RegionObserver and do the compression. Do we already have a coprocessor implementing the preBatchMutate() other than Indexer? If so can use it or we need to introduce a new one specific for mutations I believe.
          Hide
          tdsilva Thomas D'Silva added a comment -

          Parking WIP patch.

          Show
          tdsilva Thomas D'Silva added a comment - Parking WIP patch.
          Hide
          enis Enis Soztutar added a comment -

          This is a great addition. Is there a short design doc that explains the feature?
          I think we should not couple immutable tables with "compact storage". There can be cases where you would want to keep columns non-compact in immutable tables, and compact for mutable tables. The features should be orthogonal. If you are mutating an mutable column value, it can be read-modify-write similar to secondary index updates.
          Is compact storage all or nothing for a table? What about column families? Should we look into having a subset of columns to be stored together, something like the original locality-groups versus column family from bigtable (HBase column family is both locality group and column family in the big table terminology).

          Show
          enis Enis Soztutar added a comment - This is a great addition. Is there a short design doc that explains the feature? I think we should not couple immutable tables with "compact storage". There can be cases where you would want to keep columns non-compact in immutable tables, and compact for mutable tables. The features should be orthogonal. If you are mutating an mutable column value, it can be read-modify-write similar to secondary index updates. Is compact storage all or nothing for a table? What about column families? Should we look into having a subset of columns to be stored together, something like the original locality-groups versus column family from bigtable (HBase column family is both locality group and column family in the big table terminology).
          Hide
          tdsilva Thomas D'Silva added a comment -

          Enis Soztutar

          I didn't add a design doc because we were planning on enabling this feature only for immutable tables and using the existing array serialization format, so the implementation seemed straightforward.

          All column values for a given column family are stored in a single KeyValue. A new StorageScheme (to be added as part of PHOENIX-1598) COLUMNS_STORED_IN_SINGLE_CELL is used to denote a table with columns stored in this format. Existing tables will have a StorageSchema of NON_ENCODED_COLUMN names and will work as before. Once a table is stored with the COLUMNS_STORED_IN_SINGLE_CELL storage scheme you cannot transition a table to/from being immutable.

          The existing serialization format used to store arrays (see PArrayDataType) will be used to serialize multiple columns into a single byte[]. An ArrayConstructor Expression will be constructed with the column values as LiteralExpressions and evaluated to generate the byte array.
          A new column expression ArrayColumnExpression that stores the index at which the column is stored in the array will be used instead of KeyValueColumn expression. The getEncodedColumnQualifier() method of PColumn (to be added as part of PHOENIX-1598) will be used for the index.

          The remaining changes involved handling the new ArrayColumnExpression where previously we only used a KeyValueColumnExpression (for example in WhereCompiler.setScanFilter()). Currently when a column is deleted we don't remove the entry from the array as this would involve rewriting all KeyValues. We were thinking of investigating whether we could remove the deleted column values from the array during compaction.

          James Taylor what do you think about allowing users to specify a subset of columns that are stored together in single KeyValue?

          Show
          tdsilva Thomas D'Silva added a comment - Enis Soztutar I didn't add a design doc because we were planning on enabling this feature only for immutable tables and using the existing array serialization format, so the implementation seemed straightforward. All column values for a given column family are stored in a single KeyValue. A new StorageScheme (to be added as part of PHOENIX-1598 ) COLUMNS_STORED_IN_SINGLE_CELL is used to denote a table with columns stored in this format. Existing tables will have a StorageSchema of NON_ENCODED_COLUMN names and will work as before. Once a table is stored with the COLUMNS_STORED_IN_SINGLE_CELL storage scheme you cannot transition a table to/from being immutable. The existing serialization format used to store arrays (see PArrayDataType) will be used to serialize multiple columns into a single byte[]. An ArrayConstructor Expression will be constructed with the column values as LiteralExpressions and evaluated to generate the byte array. A new column expression ArrayColumnExpression that stores the index at which the column is stored in the array will be used instead of KeyValueColumn expression. The getEncodedColumnQualifier() method of PColumn (to be added as part of PHOENIX-1598 ) will be used for the index. The remaining changes involved handling the new ArrayColumnExpression where previously we only used a KeyValueColumnExpression (for example in WhereCompiler.setScanFilter()). Currently when a column is deleted we don't remove the entry from the array as this would involve rewriting all KeyValues. We were thinking of investigating whether we could remove the deleted column values from the array during compaction. James Taylor what do you think about allowing users to specify a subset of columns that are stored together in single KeyValue?
          Hide
          tdsilva Thomas D'Silva added a comment -

          James Taylor Samarth Jain

          I have attached a patch. Please review when you get a chance.

          Thanks,
          Thomas

          Show
          tdsilva Thomas D'Silva added a comment - James Taylor Samarth Jain I have attached a patch. Please review when you get a chance. Thanks, Thomas
          Hide
          jamestaylor James Taylor added a comment -

          Thanks for the patch, Thomas D'Silva! Feel free to commit it to the encodecolumns branch. Here are a few comments/questions:

          • What's the scoop with the ReplaceArrayColumnWithKeyValueColumnExpressionVisitor in IndexMaintainer and ArrayColumnExpression needing to keep two types of expressions? Can't we just push the correct one into the IndexMaintainer based on the PTable storage scheme?
            +        this.arrayExpression = new KeyValueColumnExpression(column, cf, cf);
            +        this.origKVExpression = new KeyValueColumnExpression(column, displayName, encodedColumnName);
            
          • Looks like the imports in ExpressionType were changed to a *. Maybe we should keep them explicit (and increase our # of imports before turning it into a *)? Otherwise merges back to other branches may get painful?
            +import org.apache.phoenix.expression.function.*;
            
          • Don't think you need this method override in SingleKeyValueComparisonFilter as it doesn't do anything:
            +    @Override
            +    protected Boolean evaluate(Tuple input) {
            +        return super.evaluate(input);
            +    }
            
          Show
          jamestaylor James Taylor added a comment - Thanks for the patch, Thomas D'Silva ! Feel free to commit it to the encodecolumns branch. Here are a few comments/questions: What's the scoop with the ReplaceArrayColumnWithKeyValueColumnExpressionVisitor in IndexMaintainer and ArrayColumnExpression needing to keep two types of expressions? Can't we just push the correct one into the IndexMaintainer based on the PTable storage scheme? + this .arrayExpression = new KeyValueColumnExpression(column, cf, cf); + this .origKVExpression = new KeyValueColumnExpression(column, displayName, encodedColumnName); Looks like the imports in ExpressionType were changed to a *. Maybe we should keep them explicit (and increase our # of imports before turning it into a *)? Otherwise merges back to other branches may get painful? + import org.apache.phoenix.expression.function.*; Don't think you need this method override in SingleKeyValueComparisonFilter as it doesn't do anything: + @Override + protected Boolean evaluate(Tuple input) { + return super .evaluate(input); + }
          Hide
          jamestaylor James Taylor added a comment -

          Also, more of a note to Samarth Jain, but upon further thinking, encoded columns will work fine for transactional tables, so any special cases around those should be removed:

               public static boolean setMinMaxQualifiersOnScan(PTable table) {
          -        return EncodedColumnsUtil.usesEncodedColumnNames(table) && !table.isTransactional() && !hasDynamicColumns(table);
          +        return table.getStorageScheme() != null && table.getStorageScheme() == StorageScheme.ENCODED_COLUMN_NAMES
          +        		&& !table.isTransactional() && !hasDynamicColumns(table);
          
          Show
          jamestaylor James Taylor added a comment - Also, more of a note to Samarth Jain , but upon further thinking, encoded columns will work fine for transactional tables, so any special cases around those should be removed: public static boolean setMinMaxQualifiersOnScan(PTable table) { - return EncodedColumnsUtil.usesEncodedColumnNames(table) && !table.isTransactional() && !hasDynamicColumns(table); + return table.getStorageScheme() != null && table.getStorageScheme() == StorageScheme.ENCODED_COLUMN_NAMES + && !table.isTransactional() && !hasDynamicColumns(table);
          Hide
          tdsilva Thomas D'Silva added a comment -

          Thanks for the feedback.

          ReplaceArrayColumnWithKeyValueColumnExpressionVisitor is only used in one place in IndexUtil.generateIndexData because we use a ValueGetter to get the value of the data table column using the original data table column reference. This is also why ArrayColumnExpression needs to keep track of the original key value column expression.
          If we don't replace the array column expression with the original column expression when it looks up the column by the qualifier it won't find it.
          I will make the other changes you suggested.

          ValueGetter valueGetter = new ValueGetter() {
                              	
                              	@Override
                                  public byte[] getRowKey() {
                              		return dataMutation.getRow();
                              	}
                  
                                  @Override
                                  public ImmutableBytesWritable getLatestValue(ColumnReference ref) {
                                      // Always return null for our empty key value, as this will cause the index
                                      // maintainer to always treat this Put as a new row.
                                      if (isEmptyKeyValue(table, ref)) {
                                          return null;
                                      }
                                      byte[] family = ref.getFamily();
                                      byte[] qualifier = ref.getQualifier();
                                      RowMutationState rowMutationState = valuesMap.get(ptr);
                                      PColumn column = null;
                                      try {
                                          column = table.getColumnFamily(family).getPColumnForColumnQualifier(qualifier);
                                      } catch (ColumnNotFoundException e) {
                                      } catch (ColumnFamilyNotFoundException e) {
                                      }
                                      if (rowMutationState!=null && column!=null) {
                                          byte[] value = rowMutationState.getColumnValues().get(column);
                                          ImmutableBytesPtr ptr = new ImmutableBytesPtr();
                                          ptr.set(value==null ? ByteUtil.EMPTY_BYTE_ARRAY : value);
                                          SchemaUtil.padData(table.getName().getString(), column, ptr);
                                          return ptr;
                                      }
                                      return null;
                                  }
                                  
                              };
          
          Show
          tdsilva Thomas D'Silva added a comment - Thanks for the feedback. ReplaceArrayColumnWithKeyValueColumnExpressionVisitor is only used in one place in IndexUtil.generateIndexData because we use a ValueGetter to get the value of the data table column using the original data table column reference. This is also why ArrayColumnExpression needs to keep track of the original key value column expression. If we don't replace the array column expression with the original column expression when it looks up the column by the qualifier it won't find it. I will make the other changes you suggested. ValueGetter valueGetter = new ValueGetter() { @Override public byte [] getRowKey() { return dataMutation.getRow(); } @Override public ImmutableBytesWritable getLatestValue(ColumnReference ref) { // Always return null for our empty key value, as this will cause the index // maintainer to always treat this Put as a new row. if (isEmptyKeyValue(table, ref)) { return null ; } byte [] family = ref.getFamily(); byte [] qualifier = ref.getQualifier(); RowMutationState rowMutationState = valuesMap.get(ptr); PColumn column = null ; try { column = table.getColumnFamily(family).getPColumnForColumnQualifier(qualifier); } catch (ColumnNotFoundException e) { } catch (ColumnFamilyNotFoundException e) { } if (rowMutationState!= null && column!= null ) { byte [] value = rowMutationState.getColumnValues().get(column); ImmutableBytesPtr ptr = new ImmutableBytesPtr(); ptr.set(value== null ? ByteUtil.EMPTY_BYTE_ARRAY : value); SchemaUtil.padData(table.getName().getString(), column, ptr); return ptr; } return null ; } };
          Hide
          jamestaylor James Taylor added a comment -

          Thanks for the explanation, Thomas D'Silva. Would you mind filing a follow up JIRA to get rid of ReplaceArrayColumnWithKeyValueColumnExpressionVisitor with the above explanation?

          +1 to commit to encodecolumns branch.

          Show
          jamestaylor James Taylor added a comment - Thanks for the explanation, Thomas D'Silva . Would you mind filing a follow up JIRA to get rid of ReplaceArrayColumnWithKeyValueColumnExpressionVisitor with the above explanation? +1 to commit to encodecolumns branch.
          Hide
          tdsilva Thomas D'Silva added a comment -

          Attached final patch.

          Show
          tdsilva Thomas D'Silva added a comment - Attached final patch.
          Hide
          ankit@apache.org Ankit Singhal added a comment - - edited

          Thomas D'Silva/James Taylor, are we considering any other encoding for storing data in single keyValue apart from using VARBINARY array serialisation format?
          Because in our case, when there are 100s of columns and each column is not null, then VARBINARY array serialisation is taking 2x space(as offset being stored along with data) when compared to RowKey format used for serialising all the columns in a single byte array.

          should we look into optimising the encoding here, storing columns in byte array separated by delimiter like VAR array encoding but without offset, OR avro serialization or any other option?

          +Nick Dimiduk/Enis Soztutar/Sergey Soldatov/Devaraj Das

          Show
          ankit@apache.org Ankit Singhal added a comment - - edited Thomas D'Silva / James Taylor , are we considering any other encoding for storing data in single keyValue apart from using VARBINARY array serialisation format? Because in our case, when there are 100s of columns and each column is not null, then VARBINARY array serialisation is taking 2x space(as offset being stored along with data) when compared to RowKey format used for serialising all the columns in a single byte array. should we look into optimising the encoding here, storing columns in byte array separated by delimiter like VAR array encoding but without offset, OR avro serialization or any other option? + Nick Dimiduk / Enis Soztutar / Sergey Soldatov / Devaraj Das
          Hide
          tdsilva Thomas D'Silva added a comment - - edited

          Ankit Singhal

          For your use case, are all the columns of the same fixed width datatype? If so we can use the fixed width array serialization which doesn't need to store array offsets.
          If the columns are fixed length but of different data types, since the offsets are the same for all of the rows, we don't need to serialize the offset array for every row.

          Show
          tdsilva Thomas D'Silva added a comment - - edited Ankit Singhal For your use case, are all the columns of the same fixed width datatype? If so we can use the fixed width array serialization which doesn't need to store array offsets. If the columns are fixed length but of different data types, since the offsets are the same for all of the rows, we don't need to serialize the offset array for every row.
          Hide
          ankit@apache.org Ankit Singhal added a comment -

          Thomas D'Silva, we had both fixed and variable width datatypes. I'm not sure why we can't just concatenate the bytes with a delimiter (including special encoding for null and tracking of a length of fixed width datatype by schema ).

          Is the offset array used to optimised the reading time?

          Show
          ankit@apache.org Ankit Singhal added a comment - Thomas D'Silva , we had both fixed and variable width datatypes. I'm not sure why we can't just concatenate the bytes with a delimiter (including special encoding for null and tracking of a length of fixed width datatype by schema ). Is the offset array used to optimised the reading time?
          Hide
          jamestaylor James Taylor added a comment -

          Yes, the format is optimized for read time (we're seeing a 5-7x perf improvement). There could be different storage formats in the future. The KeyValueSchema format is as you've described.

          Show
          jamestaylor James Taylor added a comment - Yes, the format is optimized for read time (we're seeing a 5-7x perf improvement). There could be different storage formats in the future. The KeyValueSchema format is as you've described.
          Hide
          jamestaylor James Taylor added a comment -

          Also, this format is optimized for dense data. See PHOENIX-3559. I'm not sure we'll find a serialization format that's good for both dense and sparse storage, IMHO it's ok to optimize for dense storage provided we support plugging in other storage formats optimized in other dimensions.

          I'm not sure why we can't just concatenate the bytes with a delimiter (including special encoding for null and tracking of a length of fixed width datatype by schema).

          This is more or less the format we use for the bytes that make up the row key. There are limitations in that a VARBINARY and an ARRAY may only appear at the end of the row key since there's no delimiter byte that we can count on not appearing in the data. You'd also need to walk through the bytes to get to the start of the column data (which would get slower and slower as the number of columns increase). The new format allows you to look up the byte offset via an array lookup so it's pretty fast. We also don't need to store any separator bytes.

          Show
          jamestaylor James Taylor added a comment - Also, this format is optimized for dense data. See PHOENIX-3559 . I'm not sure we'll find a serialization format that's good for both dense and sparse storage, IMHO it's ok to optimize for dense storage provided we support plugging in other storage formats optimized in other dimensions. I'm not sure why we can't just concatenate the bytes with a delimiter (including special encoding for null and tracking of a length of fixed width datatype by schema). This is more or less the format we use for the bytes that make up the row key. There are limitations in that a VARBINARY and an ARRAY may only appear at the end of the row key since there's no delimiter byte that we can count on not appearing in the data. You'd also need to walk through the bytes to get to the start of the column data (which would get slower and slower as the number of columns increase). The new format allows you to look up the byte offset via an array lookup so it's pretty fast. We also don't need to store any separator bytes.
          Hide
          samarthjain Samarth Jain added a comment -

          Another possible option would be to separate out the offset information in key values of their own instead of including it in the packed cell itself. In fact, instead of storing the offset information, we could then just store the starting position and length of the column bytes. We can use our number based qualifiers for storing this start-length info. This helps us get away with storing the metadata information only for those columns whose values have been packed in our "data" cell.

          Let me try to explain with an example:

          Consider the data cell that looks something like this:
          single_cell_column_qualifier :: col1\col20\col60\col100 where \ represents a separator byte.

          Then, the metadata cells with this new encoding scheme would look something like this:
          cq :: start position/length (value)
          1 :: 0/10
          20 :: 11/20
          60 :: 31/10
          100 :: 41/20

          So now for fetching the value of our col1 whose qualifier also is 1, we would first get the start position and length information by fetching value of column qualifier 1. Then using the start position/length information, we could easily extract the bytes out of the packed data cell.

          Show
          samarthjain Samarth Jain added a comment - Another possible option would be to separate out the offset information in key values of their own instead of including it in the packed cell itself. In fact, instead of storing the offset information, we could then just store the starting position and length of the column bytes. We can use our number based qualifiers for storing this start-length info. This helps us get away with storing the metadata information only for those columns whose values have been packed in our "data" cell. Let me try to explain with an example: Consider the data cell that looks something like this: single_cell_column_qualifier :: col1\col20\col60\col100 where \ represents a separator byte. Then, the metadata cells with this new encoding scheme would look something like this: cq :: start position/length (value) 1 :: 0/10 20 :: 11/20 60 :: 31/10 100 :: 41/20 So now for fetching the value of our col1 whose qualifier also is 1, we would first get the start position and length information by fetching value of column qualifier 1. Then using the start position/length information, we could easily extract the bytes out of the packed data cell.
          Hide
          tdsilva Thomas D'Silva added a comment -

          This format reduces the bytes we need to read since we don't have to read the entire offset array, but it doesn't reduce the serialized bytes we store, right?

          Show
          tdsilva Thomas D'Silva added a comment - This format reduces the bytes we need to read since we don't have to read the entire offset array, but it doesn't reduce the serialized bytes we store, right?
          Hide
          samarthjain Samarth Jain added a comment -

          It only stores the start position-length information only for those columns that have a value. This would provide us savings for sparse data unlike our VARBINARY ARRAY scheme where we end up storing the OFFSET information even for those columns that are not present in the packed cell. For dense data, I expect this scheme to occupy slightly more space since we will be storing length and start position instead of just offset. Would be interesting to see how this would behave if we have snappy or some other compression on.

          James Taylor - WDYT?

          Show
          samarthjain Samarth Jain added a comment - It only stores the start position-length information only for those columns that have a value. This would provide us savings for sparse data unlike our VARBINARY ARRAY scheme where we end up storing the OFFSET information even for those columns that are not present in the packed cell. For dense data, I expect this scheme to occupy slightly more space since we will be storing length and start position instead of just offset. Would be interesting to see how this would behave if we have snappy or some other compression on. James Taylor - WDYT?
          Hide
          enis Enis Soztutar added a comment -

          From the experience of trying to use this for billions of rows and hundreds of columns (where the schema is a regular RDBMS one), there are a couple of problems that the array encoding has in terms of packing data efficiently.

          • Array encoding uses all three of separators, and offsets / lengths, as well as nullability encoding. This means that there is a lot of unnecessary overhead for representing repetitive information.
          • Run-length encoding-like null representation gets really expensive, if you have data like a, <null>, b, <null>, c, <null>. A simple bitset is easier and more efficien. Or, if you are already encoding the offsets, you do not have to re-encode nullability. If offset_i and offset_i+1 are equal, the field is null.
          • The offsets are 4 or 2 bytes fixed length, not using varint encoding. This makes a difference for majority of data where expected num columns is <128.

          I think array encoding is this way because arrays can be part of the row key. However, for packing column values, we do not need the lexicographic sortable guarantee, meaning that we can do a way better job than the array encoding. The way forward for this I think is to leave the array encoding as it is, but instead do a PStructDataType that implements the new scheme.

          This is the exact problem that avro / PB and Thrift encodings solve already. However, the requirements are a little different for phoenix.

          • First, we have to figure out how we are gonna deal with schema evolution.
          • We need efficient way to access individual fields within the byte array without deserializing the whole byte[] (although notice that it is already read from disk and in-memory).
          • Nullability support.
            Looking at this, I think something like Flatbuffers / Capn proto looks more like the direction (especially with the requirement that we do not want to deserialize the whole thing).

          If we want to do a custom format with the given encodings, I think we can do something like this:

          <format_id><column_1><column_2>...<column_n> <offset_1><offset_2><offset_3><offset_start>
          

          where

          • format_id : single byte showing the format of the data,
          • column_n : column data, NO separators
          • offset_n : byte offset of the nth column. It can be varint, if we can cache this data. Otherwise, can make this 1/2/4 bytes and encode that information at the tail.
          • offset_start : this is the offset of <offset_1>. The reader can find and cache how many columns are there in the encoded data by reading all of the offsets. Notice that we can only add columns to an existing table, and the schema is still in the catalog table. Columns not used anymore are always null.
            To read a column, you would find the offset of the column, and the length would be offset_n+1 - offset_n. If a column is null, it is always encoded as 0 bytes, and offset_n+1 would be equal to offset_n.
          Show
          enis Enis Soztutar added a comment - From the experience of trying to use this for billions of rows and hundreds of columns (where the schema is a regular RDBMS one), there are a couple of problems that the array encoding has in terms of packing data efficiently. Array encoding uses all three of separators, and offsets / lengths, as well as nullability encoding. This means that there is a lot of unnecessary overhead for representing repetitive information. Run-length encoding-like null representation gets really expensive, if you have data like a, <null>, b, <null>, c, <null> . A simple bitset is easier and more efficien. Or, if you are already encoding the offsets, you do not have to re-encode nullability. If offset_i and offset_i+1 are equal, the field is null. The offsets are 4 or 2 bytes fixed length, not using varint encoding. This makes a difference for majority of data where expected num columns is <128. I think array encoding is this way because arrays can be part of the row key. However, for packing column values, we do not need the lexicographic sortable guarantee, meaning that we can do a way better job than the array encoding. The way forward for this I think is to leave the array encoding as it is, but instead do a PStructDataType that implements the new scheme. This is the exact problem that avro / PB and Thrift encodings solve already. However, the requirements are a little different for phoenix. First, we have to figure out how we are gonna deal with schema evolution. We need efficient way to access individual fields within the byte array without deserializing the whole byte[] (although notice that it is already read from disk and in-memory). Nullability support. Looking at this, I think something like Flatbuffers / Capn proto looks more like the direction (especially with the requirement that we do not want to deserialize the whole thing). If we want to do a custom format with the given encodings, I think we can do something like this: <format_id><column_1><column_2>...<column_n> <offset_1><offset_2><offset_3><offset_start> where format_id : single byte showing the format of the data, column_n : column data, NO separators offset_n : byte offset of the nth column. It can be varint, if we can cache this data. Otherwise, can make this 1/2/4 bytes and encode that information at the tail. offset_start : this is the offset of <offset_1>. The reader can find and cache how many columns are there in the encoded data by reading all of the offsets. Notice that we can only add columns to an existing table, and the schema is still in the catalog table. Columns not used anymore are always null. To read a column, you would find the offset of the column, and the length would be offset_n+1 - offset_n . If a column is null, it is always encoded as 0 bytes, and offset_n+1 would be equal to offset_n .
          Hide
          jamestaylor James Taylor added a comment -

          Enis Soztutar - the format you outlined is the format of the single key value format. It's very similar to the array format with the differences being:

          • no separators are stored
          • we store the number of elements and calculate the offset_start instead of storing it.

          Samarth Jain I think we should proceed with what we have but make sure that new/alternate storage schemes can be introduced. The bulk of the change is to create a level of indirection for column names and to have a way of doing a positional lookup. Our use cases don't require aggregation to be fast and our data is not sparse. There will be other use cases that can be better optimized with a different format, but given that new storage schemes can be added, these can be added in the future.

          For the particular format you mentioned, Samarth Jain:

          • You'd need to do a binary search given the position since you wouldn't be able to find the byte offset directly through an array dereference. This type of scheme would probably be similar in performance to our column encoding scheme for mutable data.
          • Only a count ( * ) query would get faster for aggregation - other types of aggregation would still be slower (as they'd require reading the larger, single KeyValue that contains all the values).
          • It's not necessary to store the length because you can look at the offset of the next element to calculate the length

          I don't think we can do a whole lot to speed up aggregation. I think we're being hit with the cost of reading the large single KeyValue. There might be some simple things we could do to improve sparse storage. The extra storage cost is due to storing the byte offset for all elements between values that are set. For example, if column 1 is set and column 102 is set, we're storing offsets for column2 through column 101. We could instead introduce a bit set that tracks if a value is set. Instead of storing 100 shorts (200 bytes) we'd store 13 bytes for the bit set. I'm not sure this is going to make a big difference, though.

          Show
          jamestaylor James Taylor added a comment - Enis Soztutar - the format you outlined is the format of the single key value format. It's very similar to the array format with the differences being: no separators are stored we store the number of elements and calculate the offset_start instead of storing it. Samarth Jain I think we should proceed with what we have but make sure that new/alternate storage schemes can be introduced. The bulk of the change is to create a level of indirection for column names and to have a way of doing a positional lookup. Our use cases don't require aggregation to be fast and our data is not sparse. There will be other use cases that can be better optimized with a different format, but given that new storage schemes can be added, these can be added in the future. For the particular format you mentioned, Samarth Jain : You'd need to do a binary search given the position since you wouldn't be able to find the byte offset directly through an array dereference. This type of scheme would probably be similar in performance to our column encoding scheme for mutable data. Only a count ( * ) query would get faster for aggregation - other types of aggregation would still be slower (as they'd require reading the larger, single KeyValue that contains all the values). It's not necessary to store the length because you can look at the offset of the next element to calculate the length I don't think we can do a whole lot to speed up aggregation. I think we're being hit with the cost of reading the large single KeyValue. There might be some simple things we could do to improve sparse storage. The extra storage cost is due to storing the byte offset for all elements between values that are set. For example, if column 1 is set and column 102 is set, we're storing offsets for column2 through column 101. We could instead introduce a bit set that tracks if a value is set. Instead of storing 100 shorts (200 bytes) we'd store 13 bytes for the bit set. I'm not sure this is going to make a big difference, though.
          Hide
          enis Enis Soztutar added a comment -

          the format you outlined is the format of the single key value format. It's very similar to the array format with the differences being:

          Thanks. Reading the code of PArrayDataType and and looking at the serialized data lead us to believe otherwise. Maybe I am missing something.

          Show
          enis Enis Soztutar added a comment - the format you outlined is the format of the single key value format. It's very similar to the array format with the differences being: Thanks. Reading the code of PArrayDataType and and looking at the serialized data lead us to believe otherwise. Maybe I am missing something.
          Hide
          jamestaylor James Taylor added a comment -

          Yep, I can see the confusion as the changes aren't in master yet. Take a look at the encodecolumns2 branch instead and the changes Thomas D'Silva made most recently. Samarth Jain & him intend to create a pull request very soon.

          Show
          jamestaylor James Taylor added a comment - Yep, I can see the confusion as the changes aren't in master yet. Take a look at the encodecolumns2 branch instead and the changes Thomas D'Silva made most recently. Samarth Jain & him intend to create a pull request very soon.
          Hide
          tdsilva Thomas D'Silva added a comment -

          The 01ef5d commit in encodeColumns2 branch changed the PArrayDataType serialization to not use a separator byte. We currrently use 2 or 4 bytes for the offsets, but we are looking into using a bitset to save space while storing the column offsets.

          Show
          tdsilva Thomas D'Silva added a comment - The 01ef5d commit in encodeColumns2 branch changed the PArrayDataType serialization to not use a separator byte. We currrently use 2 or 4 bytes for the offsets, but we are looking into using a bitset to save space while storing the column offsets.
          Hide
          enis Enis Soztutar added a comment -

          Thanks, I was checking the code in the branch before 01ef5d.

          Here:
          https://github.com/apache/phoenix/blob/encodecolumns2/phoenix-core/src/main/java/org/apache/phoenix/schema/types/PArrayDataType.java#L1299
          we are still serializing the nulls, no?

          For example, if column 1 is set and column 102 is set, we're storing offsets for column2 through column 101. We could instead introduce a bit set that tracks if a value is set

          For doing nulls in Avro, you do a union of the type with the Null type, so all nullable fields are encoded like <is_null:1byte><type_data:0 or more bytes>. So avro has to spend 1 byte per nullable field, regardless of whether the field is there or not. PB has a different model, where each type is prefixed with the id of the field, which also means that if the field is not there it is null. So, the cost is 1 varint per field that is not-null (as opposed to per field in the schema). Obviously what is optimal depends on average whether there is a lot of null-fields in the data or not.

          The cost of doing a bitset for nullability fields would be 1 byte per 8 "declared" fields (regardless of whether there is null or not). If there is a single null field, we are saving 2 or 4 bytes (for the offset). So if on average, we expect the data to have at least 1 null per 16 columns or so it looks like a good idea to implement this.

          Show
          enis Enis Soztutar added a comment - Thanks, I was checking the code in the branch before 01ef5d. Here: https://github.com/apache/phoenix/blob/encodecolumns2/phoenix-core/src/main/java/org/apache/phoenix/schema/types/PArrayDataType.java#L1299 we are still serializing the nulls, no? For example, if column 1 is set and column 102 is set, we're storing offsets for column2 through column 101. We could instead introduce a bit set that tracks if a value is set For doing nulls in Avro, you do a union of the type with the Null type, so all nullable fields are encoded like <is_null:1byte><type_data:0 or more bytes> . So avro has to spend 1 byte per nullable field, regardless of whether the field is there or not. PB has a different model, where each type is prefixed with the id of the field, which also means that if the field is not there it is null. So, the cost is 1 varint per field that is not-null (as opposed to per field in the schema). Obviously what is optimal depends on average whether there is a lot of null-fields in the data or not. The cost of doing a bitset for nullability fields would be 1 byte per 8 "declared" fields (regardless of whether there is null or not). If there is a single null field, we are saving 2 or 4 bytes (for the offset). So if on average, we expect the data to have at least 1 null per 16 columns or so it looks like a good idea to implement this.
          Hide
          samarthjain Samarth Jain added a comment -

          If we go by the bitset approach where we don't store offset for missing or null columns, then figuring out where the bytes for a column are stored in the packed cell would no longer be a O(1) operation. For finding every column, on an average, we would have to go through half of the bitset to figure out the bytes for a column. We could possibly optimize that though by caching the bitset/offset information. This would come in handy especially when we have to evaluate multiple ArrayColumnExpressions against the packed cell. The caching would have to be done at a higher level though (parent of ArrayColumnExpression?). The cache would be built when evaluating the first ArrayColumnExpression. Subsequent evaluate() calls won't then have to compute where the bytes are.

          Show
          samarthjain Samarth Jain added a comment - If we go by the bitset approach where we don't store offset for missing or null columns, then figuring out where the bytes for a column are stored in the packed cell would no longer be a O(1) operation. For finding every column, on an average, we would have to go through half of the bitset to figure out the bytes for a column. We could possibly optimize that though by caching the bitset/offset information. This would come in handy especially when we have to evaluate multiple ArrayColumnExpressions against the packed cell. The caching would have to be done at a higher level though (parent of ArrayColumnExpression?). The cache would be built when evaluating the first ArrayColumnExpression. Subsequent evaluate() calls won't then have to compute where the bytes are.
          Hide
          jamestaylor James Taylor added a comment -

          Enis Soztutar - for a null value, we store an offset, but no data. There's a slight hack to differentiate setting something to null versus the value being absent (based on the offset being positive or negative). We need to be able to differentiate these two based on our default column value implementation. There's no cost for trailing null values.

          Using a bit set would still be close to O(1) as using Long.bitCount() would check 64 bits worth pretty quickly, but there'd be some overhead. Not sure if it's worth doing this right now or not - we could always add it later as a new storage format - it'd basically cut the storage cost of the offsets by 1/8. The data value bytes would remain the same.

          IMHO, I think PHOENIX-3570 is higher priority.

          Show
          jamestaylor James Taylor added a comment - Enis Soztutar - for a null value, we store an offset, but no data. There's a slight hack to differentiate setting something to null versus the value being absent (based on the offset being positive or negative). We need to be able to differentiate these two based on our default column value implementation. There's no cost for trailing null values. Using a bit set would still be close to O(1) as using Long.bitCount() would check 64 bits worth pretty quickly, but there'd be some overhead. Not sure if it's worth doing this right now or not - we could always add it later as a new storage format - it'd basically cut the storage cost of the offsets by 1/8. The data value bytes would remain the same. IMHO, I think PHOENIX-3570 is higher priority.
          Hide
          jamestaylor James Taylor added a comment -

          Another idea for further optimization: don't store offsets for fixed width columns. We'd still need an entry in our bitset, but we could be smart about not expecting an offset for these columns when we traverse the offsets (at the expense of a little more CPU to calculate the offset). It's difficult to judge the impact of these optimizations.

          Show
          jamestaylor James Taylor added a comment - Another idea for further optimization: don't store offsets for fixed width columns. We'd still need an entry in our bitset, but we could be smart about not expecting an offset for these columns when we traverse the offsets (at the expense of a little more CPU to calculate the offset). It's difficult to judge the impact of these optimizations.
          Hide
          tdsilva Thomas D'Silva added a comment -

          From Mujtaba Chohan's testing the performance degrades if the single key value that stores the serialized columns is large. Users will also not be able to write rows if the single key value is larger than the hbase.client.keyvalue.maxsize setting.
          We could store N columns of a column family in a single key value (and make N configurable).

          Show
          tdsilva Thomas D'Silva added a comment - From Mujtaba Chohan 's testing the performance degrades if the single key value that stores the serialized columns is large. Users will also not be able to write rows if the single key value is larger than the hbase.client.keyvalue.maxsize setting. We could store N columns of a column family in a single key value (and make N configurable).
          Hide
          jamestaylor James Taylor added a comment -

          Users are already able to put columns in different column families. I don't think we need a different mechanism at this point.

          Show
          jamestaylor James Taylor added a comment - Users are already able to put columns in different column families. I don't think we need a different mechanism at this point.
          Hide
          tdsilva Thomas D'Silva added a comment -

          Shehzaad Nakhoda We has discussed implementing a different storage format that uses a bitset to efficiently determine if a value was null or missing . If we did this, then we could use this storage format for multi-tenant tables with views.

          Show
          tdsilva Thomas D'Silva added a comment - Shehzaad Nakhoda We has discussed implementing a different storage format that uses a bitset to efficiently determine if a value was null or missing . If we did this, then we could use this storage format for multi-tenant tables with views.

            People

            • Assignee:
              tdsilva Thomas D'Silva
              Reporter:
              jamestaylor James Taylor
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development