Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-910

Implement Partition feature

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Invalid
    • None
    • None
    • core, data-load, data-query
    • None

    Description

      Why need partition table
      Partition table provide an option to divide table into some smaller pieces.
      With partition table:
      1. Data could be better managed, organized and stored.
      2. We can avoid full table scan in some scenario and improve query performance. (partition column in filter,
      multiple partition tables join in the same partition column etc.)

      Partitioning design
      Range Partitioning
      range partitioning maps data to partitions according to the range of partition column values, operator '<' defines non-inclusive upper bound of current partition.
      List Partitioning
      list partitioning allows you map data to partitions with specific value list
      Hash Partitioning
      hash partitioning maps data to partitions with hash algorithm and put them to the given number of partitions
      Composite Partitioning(2 levels at most for now)
      Range-Range, Range-List, Range-Hash, List-Range, List-List, List-Hash, Hash-Range, Hash-List, Hash-Hash

      DDL-Create
      Create table sales(
      itemid long,
      logdate datetime,
      customerid int
      ...
      ...)
      [partition by range logdate(...)]
      [subpartition by list area(...)]
      Stored By 'carbondata'
      [tblproperties(...)];

      range partition:
      partition by range logdate(< '2016-01-01', < '2017-01-01', < '2017-02-01', < '2017-03-01', < '2099-01-01')
      list partition:
      partition by list area('Asia', 'Europe', 'North America', 'Africa', 'Oceania')
      hash partition:
      partition by hash(itemid, 9)
      composite partition:
      partition by range logdate(< '2016- -01', < '2017-01-01', < '2017-02-01', < '2017-03-01', < '2099-01-01')
      subpartition by list area('Asia', 'Europe', 'North America', 'Africa', 'Oceania')

      DDL-Rebuild, Add
      Alter table sales rebuild partition by (range|list|hash)(...);
      Alter table salse add partition (< '2018-01-01'); #only support range partitioning, list partitioning
      Alter table salse add partition ('South America');

      #Note: No delete operation for partition, please use rebuild.
      If need delete data, use delete statement, but the definition of partition will not be deleted.

      Partition Table Data Store
      [Option One]
      Use the current design, keep partition folder out of segments
      Fact

      ___Part0
        ___Segment_0
        ___ *******[bucketId].carbondata
        ___ *******[bucketId].carbondata
        ___Segment_1
      ...
      ___Part1
        ___Segment_0
        ___Segment_1
      ...

      [Option Two]
      remove partition folder, add partition id into file name and build btree in driver side.
      Fact

      ___Segment_0
        ___ *******[bucketId][partitionId].carbondata
        ___ *******[bucketId][partitionId].carbondata
      ___Segment_1
      ___Segment_2
      ...

      Pros & Cons:
      Option one would be faster to locate target files
      Option two need to store more metadata of folders

      Partition Table MetaData Store
      partitioni info should be stored in file footer/index file and load into memory before user query.

      Relationship with Bucket
      Bucket should be lower level of partition.

      Partition Table Query

      Example:
      Select * from sales
      where logdate <= date '2016-12-01';

      User should remember to add a partition filter when write SQL on a partition table.

      Attachments

        Issue Links

          Activity

            People

              lucao Cao, Lionel
              lucao Cao, Lionel
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 82.5h
                  82.5h