Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Information Provided
    • Affects Version/s: None
    • Fix Version/s: 4.11.0
    • Labels:
      None

      Description

      Schema with 5K columns

      create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT PK PRIMARY KEY (K1, K2)) 
      VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=true
      

      In this schema, only 100 random columns are filled with random 15 chars. Rest are nulls.

      Data size is 6X larger with encoded columns scheme compare to non-encoded. That is 12GB/1M rows encoded vs ~2GB/1M rows non-encoded.

      When compressed GZ, size with encoded column scheme is still 35% higher.

        Activity

        Hide
        mujtabachohan Mujtaba Chohan added a comment -

        Related issue due to higher space usage is that the performance gain for point and aggregate query using encoded column scheme compared to non-encoded disappear and performance is almost the same for the case described above.

        Show
        mujtabachohan Mujtaba Chohan added a comment - Related issue due to higher space usage is that the performance gain for point and aggregate query using encoded column scheme compared to non-encoded disappear and performance is almost the same for the case described above.
        Hide
        jamestaylor James Taylor added a comment -

        The encoding scheme isn't optimized for sparse storage. The idea would be to use it if your storage is dense. Potentially you could use the column encoding scheme but still use multiple key values which would be a good choice for sparse data. You'd want to use realistic column names for a test like this (instead of c1, c2, c3) as that's where you'd get some space savings. It'd be good to determine where the break even point is in terms of sparseness.

        We could potentially improve our new storage format for sparse storage, but I'm not sure we'll find one optimum format for both dense and sparse storage. Enabling new storage formats to be defined will be valuable for this reason.

        Show
        jamestaylor James Taylor added a comment - The encoding scheme isn't optimized for sparse storage. The idea would be to use it if your storage is dense. Potentially you could use the column encoding scheme but still use multiple key values which would be a good choice for sparse data. You'd want to use realistic column names for a test like this (instead of c1, c2, c3) as that's where you'd get some space savings. It'd be good to determine where the break even point is in terms of sparseness. We could potentially improve our new storage format for sparse storage, but I'm not sure we'll find one optimum format for both dense and sparse storage. Enabling new storage formats to be defined will be valuable for this reason.
        Hide
        ankit@apache.org Ankit Singhal added a comment -

        I did some comparison earlier with table having all data types(except array) with dense records. Posting here, if it can save some Mujtaba Chohan efforts.

        Table(input data size=121M) TABLE_SINGLE_KV(PHOENIX-2565) TABLE_LARGE_COLUMN_NAME TABLE_SMALL_COLUMN_NAME
        UPSERT 25.295 sec 47.315 sec 46.779 sec
        COUNT 5.95 sec 7.719 sec 7.91 sec
        No compression(AFTER COMPACTION) 183M 182M 182M
        GZ(compression ratio) 38M(4.32:1) 44M(2.75:1) 41M(2.95:1)
        Snappy(compression ratio) 50M(2.42:1) 56M(2.16:1) 56M(2.16:1)

        PHOENIX-2565 vs RowkeySchema

        Encoding(input data size=111M) Phoenix(Single KV PHOENIX-2565) Phoenix With new Encoding(like RowKey)
        No compression(encoding ratio) 143M(1:0.77) 95M(1.16:1)
        Snappy( compression ratio) 51M(2.17:1) 46M(2.41:1)
        Show
        ankit@apache.org Ankit Singhal added a comment - I did some comparison earlier with table having all data types(except array) with dense records. Posting here, if it can save some Mujtaba Chohan efforts. Table(input data size=121M) TABLE_SINGLE_KV( PHOENIX-2565 ) TABLE_LARGE_COLUMN_NAME TABLE_SMALL_COLUMN_NAME UPSERT 25.295 sec 47.315 sec 46.779 sec COUNT 5.95 sec 7.719 sec 7.91 sec No compression(AFTER COMPACTION) 183M 182M 182M GZ(compression ratio) 38M(4.32:1) 44M(2.75:1) 41M(2.95:1) Snappy(compression ratio) 50M(2.42:1) 56M(2.16:1) 56M(2.16:1) PHOENIX-2565 vs RowkeySchema Encoding(input data size=111M) Phoenix(Single KV PHOENIX-2565 ) Phoenix With new Encoding(like RowKey) No compression(encoding ratio) 143M(1:0.77) 95M(1.16:1) Snappy( compression ratio) 51M(2.17:1) 46M(2.41:1)
        Hide
        mujtabachohan Mujtaba Chohan added a comment -

        Sure James Taylor agreed. I see this is not optimized for sparse columns but for one of our internal use case which is based on schema driven by customers, encoded columns could potentially be used in this way so at least it's good to know the limits and breakeven point.

        I also tested with slightly longer column names as column_1 ...column_5000 and the comparative data sizes were the same which might be due to FAST_DIFF encoding that we have on by default.

        Thanks Ankit Singhal for those data points.

        Show
        mujtabachohan Mujtaba Chohan added a comment - Sure James Taylor agreed. I see this is not optimized for sparse columns but for one of our internal use case which is based on schema driven by customers, encoded columns could potentially be used in this way so at least it's good to know the limits and breakeven point. I also tested with slightly longer column names as column_1 ...column_5000 and the comparative data sizes were the same which might be due to FAST_DIFF encoding that we have on by default. Thanks Ankit Singhal for those data points.
        Hide
        jamestaylor James Taylor added a comment -

        Mujtaba Chohan - is this essentially an issue that will be covered by documentation on when to use the column encoding feature?

        Show
        jamestaylor James Taylor added a comment - Mujtaba Chohan - is this essentially an issue that will be covered by documentation on when to use the column encoding feature?
        Hide
        mujtabachohan Mujtaba Chohan added a comment -

        Correct James Taylor it's for documentation and somewhat circumvented by using snappy compression.

        Show
        mujtabachohan Mujtaba Chohan added a comment - Correct James Taylor it's for documentation and somewhat circumvented by using snappy compression.

          People

          • Assignee:
            samarthjain Samarth Jain
            Reporter:
            mujtabachohan Mujtaba Chohan
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development