Details

      Description

      Table partitioning gives many facilities to maintain large tables. First of all, it enables the data management system to prune many input data which are actually not necessary. In addition, it gives the system more optimization opportunities that exploit the physical layouts.

      Basically, Tajo should follow the RDBMS-style partitioning system, including range, list, hash, and so on. In order to keep Hive compatibility, we need to add Hive partition type that does not exists in existing DBMS systems.

        Activity

        Hyunsik Choi created issue -
        Hide
        Min Zhou added a comment -

        How do you implements partition tables? One hdfs file per partition? or one hdfs directory per partition? Is there any shuffle when generating partition table?

        Show
        Min Zhou added a comment - How do you implements partition tables? One hdfs file per partition? or one hdfs directory per partition? Is there any shuffle when generating partition table?
        Hide
        Hyunsik Choi added a comment -

        Hi Min,

        Now, we assume that one hdfs directory is one partition. First of all, we will support the hive-style partition. Later, we will support range/interval partition, list, hash and their composition partition.

        For hive-style partition, ColumnPartitionedTableStoreExec was implmented inTAJO-329. Tajo already uses hash, range shuffles for distributed groupby, join, and sort. I'm expecting that we can easily implement other types of partitions.

        Show
        Hyunsik Choi added a comment - Hi Min, Now, we assume that one hdfs directory is one partition. First of all, we will support the hive-style partition. Later, we will support range/interval partition, list, hash and their composition partition. For hive-style partition, ColumnPartitionedTableStoreExec was implmented inTAJO-329. Tajo already uses hash, range shuffles for distributed groupby, join, and sort. I'm expecting that we can easily implement other types of partitions.
        Hide
        Min Zhou added a comment -

        How will user define a partitioned table? Seems that you guys only have implemented the DDL - create table.

        The CTAS(create table as select ) can not directly produce a partitioned table , right?

        What kind of SQL statement will be translated into a ColumnPartitionedTableStoreExec operator ?

        Show
        Min Zhou added a comment - How will user define a partitioned table? Seems that you guys only have implemented the DDL - create table. The CTAS(create table as select ) can not directly produce a partitioned table , right? What kind of SQL statement will be translated into a ColumnPartitionedTableStoreExec operator ?
        Hide
        Jihoon Son added a comment -

        Hi, Min.
        TAJO-285 will help you.
        Thanks.

        Show
        Jihoon Son added a comment - Hi, Min. TAJO-285 will help you. Thanks.
        Hide
        Hyunsik Choi added a comment -

        Currently, as you mentioned, we only have implemented the DDL and 'INSERT OVERWRITE INTO' for partitioned tables.

        Please take a look at TestInsertQuery::testInsertOverwritePartition* unit tests. Firstly, you need to create a partitioned table with 'PARTITION BY' clause. Then, you can store some data into a partitioned table by executing 'INSERT OVERWRITE INTO' statement. Later, Tajo will support CTAS with a partitioned table.

        Basically, a logical planner can know whether an insert statement is for partitioned or not. If it is for a partitioned table, StoreTableNode will has partition information. Then, PhysicalPlannerImpl::createStorePlan chooses a proper partitioned store executor. Note that Tajo has used a word 'partition' as a meaning of shuffle. So, it definitely makes you very confuse. We will reafactor those names as soon as possible.

        In addition, the partitioned tables in Tajo still is under heavy development. Now, I'm implementing the query optimization part for partitioning pruning. I think that we need to have more refactoring and refinement steps on the codes of partitioned tables.

        Show
        Hyunsik Choi added a comment - Currently, as you mentioned, we only have implemented the DDL and 'INSERT OVERWRITE INTO' for partitioned tables. Please take a look at TestInsertQuery::testInsertOverwritePartition* unit tests. Firstly, you need to create a partitioned table with 'PARTITION BY' clause. Then, you can store some data into a partitioned table by executing 'INSERT OVERWRITE INTO' statement. Later, Tajo will support CTAS with a partitioned table. Basically, a logical planner can know whether an insert statement is for partitioned or not. If it is for a partitioned table, StoreTableNode will has partition information. Then, PhysicalPlannerImpl::createStorePlan chooses a proper partitioned store executor. Note that Tajo has used a word 'partition' as a meaning of shuffle. So, it definitely makes you very confuse. We will reafactor those names as soon as possible. In addition, the partitioned tables in Tajo still is under heavy development. Now, I'm implementing the query optimization part for partitioning pruning. I think that we need to have more refactoring and refinement steps on the codes of partitioned tables.
        Hide
        Min Zhou added a comment -

        Thanks for the explanation. I noticed that the files generated by ColumnPartitionedTableStoreExec is in child directories of staging dir, typically those files should be on hdfs. If we scan a regular table and insert it to a partitioned table, there is no data shuffling, just like a map-only hash, right?

        Show
        Min Zhou added a comment - Thanks for the explanation. I noticed that the files generated by ColumnPartitionedTableStoreExec is in child directories of staging dir, typically those files should be on hdfs. If we scan a regular table and insert it to a partitioned table, there is no data shuffling, just like a map-only hash, right?
        Hide
        Hyunsik Choi added a comment -

        The result in staging dir is finally moved to a specified output directory. Usually, the output is moved to warehouse dir (e.g., /tajo/warehouse/xxxx).

        In TAJO-329, Jaehwa implemented a table partition executor for column partitioned table. Interestingly, TAJO-329 works correctly without no shuffle. However, this way will create too many output files equivalent to the number of HDFS blocks. It is not fit for HDFS's characteristics.

        So, I'm going to modify a distributed planner to allow a partitioned table store operator to have a proper shuffle method. For example, hash shuffle is good for column, list, and hash partition types, and range shuffle is good for range partition. In some special case, table partitions without shuffle may be useful after TAJO-385, which merges a number of fragments into fewer fragments.

        Thanks!

        Show
        Hyunsik Choi added a comment - The result in staging dir is finally moved to a specified output directory. Usually, the output is moved to warehouse dir (e.g., /tajo/warehouse/xxxx). In TAJO-329 , Jaehwa implemented a table partition executor for column partitioned table. Interestingly, TAJO-329 works correctly without no shuffle. However, this way will create too many output files equivalent to the number of HDFS blocks. It is not fit for HDFS's characteristics. So, I'm going to modify a distributed planner to allow a partitioned table store operator to have a proper shuffle method. For example, hash shuffle is good for column, list, and hash partition types, and range shuffle is good for range partition. In some special case, table partitions without shuffle may be useful after TAJO-385 , which merges a number of fragments into fewer fragments. Thanks!
        Hide
        Min Zhou added a comment -

        Great! thanks for the information.

        I was considering about the small hdfs files issue if we won't do a merge through shuffle. The file number should be M * R, where M is the mapper tasks number and R is the reducer tasks number. If data shuffling is added, files numbers would drop into R.

        Show
        Min Zhou added a comment - Great! thanks for the information. I was considering about the small hdfs files issue if we won't do a merge through shuffle. The file number should be M * R, where M is the mapper tasks number and R is the reducer tasks number. If data shuffling is added, files numbers would drop into R.
        Hide
        Hyunsik Choi added a comment -

        Min,

        You are right. In Tajo, the the number of files will be at most T x K, where T is the number of leaf tasks, and K is the number of distinct keys. I confused the point. Thank you for correcting that point.

        Show
        Hyunsik Choi added a comment - Min, You are right. In Tajo, the the number of files will be at most T x K, where T is the number of leaf tasks, and K is the number of distinct keys. I confused the point. Thank you for correcting that point.
        Hide
        Min Zhou added a comment -

        I spent couple of hours into those patches related to this feature. Finally realized that it's quit different from hive's partition, but similar with hive's bucket which is generated by DISTRIBUTED BY / CLUSTERED BY clause.

        CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) 
        COMMENT 'A bucketed copy of user_info' 
        CLUSTERED BY(user_id) INTO 256 BUCKETS;
        

        Hive's partition is quite simple, normally each partition map to a HDFS directory. It's used like a column of a table . For example SELECT * FROM tbl WHERE part_date = '20131222'. There is one record in hive's metadata for storing one partition of a hive table. Thus if someone just select one partition, say '20131222', hive find the partitions involved by the SQL through metadata, and skip the hdfs directories which is not useful for the query. Those above, so called partition pruning, are executed by a the planner side of hive.

        While, tajo store only one record in catalog for the partitions of a table, storing the quantity of those partitions rather than storing the details for each partition. This may works on hashed/columned partitions, but how about list/ range partitions? Further more, if we wanna benefit from partition pruning like hive did, how can we skip the I/O when there isn't any metadata recording the io path for each partition?

        The reason why I think tajo's partition is like hive's bucket is that both are designed to distribute their row according to one column's value of this row. In the early days of hive, we use hive like this way. Each table has a daily update, people need create a branch new table for the new report day, like tbl_20090101, tbl_20090102, .... This is quite ugly and mess. So facebook guys create table partitioning, and later partition pruning to avoid scanning the whole table.

        Show
        Min Zhou added a comment - I spent couple of hours into those patches related to this feature. Finally realized that it's quit different from hive's partition, but similar with hive's bucket which is generated by DISTRIBUTED BY / CLUSTERED BY clause. CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) COMMENT 'A bucketed copy of user_info' CLUSTERED BY(user_id) INTO 256 BUCKETS; Hive's partition is quite simple, normally each partition map to a HDFS directory. It's used like a column of a table . For example SELECT * FROM tbl WHERE part_date = '20131222'. There is one record in hive's metadata for storing one partition of a hive table. Thus if someone just select one partition, say '20131222', hive find the partitions involved by the SQL through metadata, and skip the hdfs directories which is not useful for the query. Those above, so called partition pruning, are executed by a the planner side of hive. While, tajo store only one record in catalog for the partitions of a table, storing the quantity of those partitions rather than storing the details for each partition. This may works on hashed/columned partitions, but how about list/ range partitions? Further more, if we wanna benefit from partition pruning like hive did, how can we skip the I/O when there isn't any metadata recording the io path for each partition? The reason why I think tajo's partition is like hive's bucket is that both are designed to distribute their row according to one column's value of this row. In the early days of hive, we use hive like this way. Each table has a daily update, people need create a branch new table for the new report day, like tbl_20090101, tbl_20090102, .... This is quite ugly and mess. So facebook guys create table partitioning, and later partition pruning to avoid scanning the whole table.
        Hide
        Min Zhou added a comment -

        I wrote the above comment is just for the purpose to suggest each partition should have one record in catalog.

        Min

        Show
        Min Zhou added a comment - I wrote the above comment is just for the purpose to suggest each partition should have one record in catalog. Min
        Hide
        Jihoon Son added a comment -

        Thanks Min for your comments!
        Your suggestion looks reasonable and it will increase the query performance by reducing I/O overheads.
        But, are there any problems when the number of partitions is large?

        Jihoon

        Show
        Jihoon Son added a comment - Thanks Min for your comments! Your suggestion looks reasonable and it will increase the query performance by reducing I/O overheads. But, are there any problems when the number of partitions is large? Jihoon
        Hide
        Hyunsik Choi added a comment -

        Min,

        Thank you for your comment and reminding what Tajo will go. I totally agree with you, and your suggestion is exactly what we planned. Actually, similar issues were discussed in TAJO-284. We haven't summarized the detail so far, so we may need to describe the issue in more detail.

        Later, each partition of hash, range, and column partitions will have each row with indexable and searchable columns in RDBMS for efficient pruning. We should do that. Currently, Tajo's partition is under heavy development, and we usually work in a quick-and-dirty manner until release. So, the partition work is under progress =) In addition, we need 'alter partition' statement and a distributed plan part in order to complete column-partitioned table support.

        I really welcome such suggestions and comments, and please keep going.

        Merry Christmas!

        Show
        Hyunsik Choi added a comment - Min, Thank you for your comment and reminding what Tajo will go. I totally agree with you, and your suggestion is exactly what we planned. Actually, similar issues were discussed in TAJO-284 . We haven't summarized the detail so far, so we may need to describe the issue in more detail. Later, each partition of hash, range, and column partitions will have each row with indexable and searchable columns in RDBMS for efficient pruning. We should do that. Currently, Tajo's partition is under heavy development, and we usually work in a quick-and-dirty manner until release. So, the partition work is under progress =) In addition, we need 'alter partition' statement and a distributed plan part in order to complete column-partitioned table support. I really welcome such suggestions and comments, and please keep going. Merry Christmas!
        Hide
        Min Zhou added a comment - - edited

        Jihoon Son
        That's absolutely a good question!
        I have thought about this problem. Firstly, we should figure out how large the number of partitions is acceptable. From my experience, MySQL works well if we insert thousands of rows in a time, even tens of thousands are still acceptable. But if the order of magnitude grows to hundreds of thousands , even millions or more, MySQL would be very slow when inserting&retrieving those records.
        When we are using HASH partition, since we can defined the buckets number of hash function, I think the number is under control. Normally it should be tens or hundreds . For RANGE and LIST partition, it works as well due to the partitions is enumerable. The worst situation I think is when we are using COLUMN partitions on a table, which is quite similar with hive's dynamic partition list below.

        CREATE TABLE dst_tbl (key int, value string) PARTITIONED BY (col1 string, col2 it) AS
        SELECT key, value,  col1, col2 FROM src_tbl
        

        Query users always have no knowledge about this table's value distribution. If the table is with high cardinality (a.k.a with so many distinct values), that should be a disaster for the below area
        1. The number of files/directories on hdfs would be very large, big pressure for HDFS namenode's memory
        2. As you mentioned, this would be a big problem for catalog.

        Acutally, due to the above reasons. In Alibaba.com, my previous employer, which has one of the largest single hadoop cluster in the world, we disabled dynamic partitioning. I think you should run into the same problem when you are using column partitioning. I don't know why you guys decide to support such feature, could you give me some background about it? How can we benefit from column partitions?

        Hyunsik Choi
        It's good to know tajo will support indexes. I saw the binary search tree index in the branch. Actually, I am considering about adding lucene index into tajo, through which we can implements an online BA system on the top of tajo like senseidb. We can do group by aggregations on billions of rows with only a few milliseconds. If I implement it, we can put tajo into production in linkedin, my current employer.

        Hyunsik Choi Jihoon Son
        Thank you. Merry Christmas!

        Min

        Show
        Min Zhou added a comment - - edited Jihoon Son That's absolutely a good question! I have thought about this problem. Firstly, we should figure out how large the number of partitions is acceptable. From my experience, MySQL works well if we insert thousands of rows in a time, even tens of thousands are still acceptable. But if the order of magnitude grows to hundreds of thousands , even millions or more, MySQL would be very slow when inserting&retrieving those records. When we are using HASH partition, since we can defined the buckets number of hash function, I think the number is under control. Normally it should be tens or hundreds . For RANGE and LIST partition, it works as well due to the partitions is enumerable. The worst situation I think is when we are using COLUMN partitions on a table, which is quite similar with hive's dynamic partition list below. CREATE TABLE dst_tbl (key int, value string) PARTITIONED BY (col1 string, col2 it) AS SELECT key, value, col1, col2 FROM src_tbl Query users always have no knowledge about this table's value distribution. If the table is with high cardinality (a.k.a with so many distinct values), that should be a disaster for the below area 1. The number of files/directories on hdfs would be very large, big pressure for HDFS namenode's memory 2. As you mentioned, this would be a big problem for catalog. Acutally, due to the above reasons. In Alibaba.com, my previous employer, which has one of the largest single hadoop cluster in the world, we disabled dynamic partitioning. I think you should run into the same problem when you are using column partitioning. I don't know why you guys decide to support such feature, could you give me some background about it? How can we benefit from column partitions? Hyunsik Choi It's good to know tajo will support indexes. I saw the binary search tree index in the branch. Actually, I am considering about adding lucene index into tajo, through which we can implements an online BA system on the top of tajo like senseidb. We can do group by aggregations on billions of rows with only a few milliseconds. If I implement it, we can put tajo into production in linkedin, my current employer. Hyunsik Choi Jihoon Son Thank you. Merry Christmas! Min
        Hide
        Jihoon Son added a comment -

        Min Zhou
        The most advantage of supporting the column partition is the compatibility with Hive.
        Since many Hive users already use the column partition, more users can consider Tajo as the replacement of Hive if the column partition is supported.
        As you said, the column partition can incur a problem when there are a large number of partitions.
        We should devise a solution to handle it.

        In my opinion, supporting the lucene index looks great!
        Since unstructured data are generally processed in the Hadoop world, I think that Tajo also has a need to provide the processing of unstructured data as well as the relational data.
        Since Lucene is optimized for processing documents, it would be useful in Tajo, too.
        But, it should be transparent to Tajo users and its query should be presented in a SQL-like form.

        Thanks for your detailed response and great suggestion.

        Min Zhou Hyunsik Choi
        Happy Christmas!

        Jihoon

        Show
        Jihoon Son added a comment - Min Zhou The most advantage of supporting the column partition is the compatibility with Hive. Since many Hive users already use the column partition, more users can consider Tajo as the replacement of Hive if the column partition is supported. As you said, the column partition can incur a problem when there are a large number of partitions. We should devise a solution to handle it. In my opinion, supporting the lucene index looks great! Since unstructured data are generally processed in the Hadoop world, I think that Tajo also has a need to provide the processing of unstructured data as well as the relational data. Since Lucene is optimized for processing documents, it would be useful in Tajo, too. But, it should be transparent to Tajo users and its query should be presented in a SQL-like form. Thanks for your detailed response and great suggestion. Min Zhou Hyunsik Choi Happy Christmas! Jihoon
        Hyunsik Choi made changes -
        Field Original Value New Value
        Fix Version/s 1.0-incubating [ 12325340 ]
        Fix Version/s 0.8-incubating [ 12324253 ]
        Hyunsik Choi made changes -
        Affects Version/s 0.8-incubating [ 12324253 ]
        Affects Version/s 1.0-incubating [ 12325340 ]

          People

          • Assignee:
            Hyunsik Choi
            Reporter:
            Hyunsik Choi
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development