Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0, 0.9.0
    • Component/s: None
    • Labels:

      Description

      Parquet is a columnar storage format developed by Twitter. Implement Parquet (http://parquet.io/) support for Tajo.

      The implementation consists of the following:

      • ParquetScanner and ParquetAppender - FileScanner and FileAppenders for reading and writing Parquet.
      • TajoParquetReader and TajoParquetWriter - Top-level reader and writer for serializing/deserializing to Tajo Tuples.
      • TajoReadSupport and TajoWriteSupport - Abstractions to perform conversion between Parquet and Tajo records.
      • TajoRecordMaterializer - Materializes Tajo Tuples from Parquet's internal representation.
      • TajoRecordConverter - Used by TajoRecordMateriailzer to materialize a Tajo Tuple.
      • TajoSchemaConverter - Converts between Tajo and Parquet schemas.
      1. TAJO-30_20140326_21:04:36.patch
        74 kB
        David Chen
      2. TAJO-30_20140326_05:34:17.patch
        74 kB
        David Chen
      3. TAJO-30_20140326_05:06:57.patch
        74 kB
        David Chen
      4. null_handling.patch
        4 kB
        Hyunsik Choi
      5. TAJO-30.patch
        56 kB
        David Chen

        Issue Links

          Activity

          Hide
          davidzchen David Chen added a comment -

          Excellent. Thanks!

          Show
          davidzchen David Chen added a comment - Excellent. Thanks!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-master-build #139 (See https://builds.apache.org/job/Tajo-master-build/139/)
          TAJO-30: Parquet Integration. (David Chen via hyunsik) (hyunsik: rev 9d00f9fefc36074098ce8117f752713ae020655b)

          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java
          • tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/package-info.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/TestStorages.java
          • tajo-storage/src/test/resources/storage-default.xml
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/v2/TestStorages.java
          • tajo-storage/src/main/resources/storage-default.xml
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java
          • tajo-storage/pom.xml
          • tajo-storage/src/test/java/org/apache/tajo/storage/TestMergeScanner.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java
          • tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java
          • CHANGES.txt
          • tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #139 (See https://builds.apache.org/job/Tajo-master-build/139/ ) TAJO-30 : Parquet Integration. (David Chen via hyunsik) (hyunsik: rev 9d00f9fefc36074098ce8117f752713ae020655b) tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto tajo-storage/src/main/java/org/apache/tajo/storage/parquet/package-info.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java tajo-storage/src/test/java/org/apache/tajo/storage/TestStorages.java tajo-storage/src/test/resources/storage-default.xml tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java tajo-storage/src/test/java/org/apache/tajo/storage/v2/TestStorages.java tajo-storage/src/main/resources/storage-default.xml tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java tajo-storage/pom.xml tajo-storage/src/test/java/org/apache/tajo/storage/TestMergeScanner.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java CHANGES.txt tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-0.8.0-build #44 (See https://builds.apache.org/job/Tajo-0.8.0-build/44/)
          TAJO-30: Parquet Integration. (David Chen via hyunsik) (hyunsik: rev 7c21fce8891d4343c21c7550e983c25ee388cfef)

          • tajo-storage/src/test/resources/storage-default.xml
          • tajo-storage/src/test/java/org/apache/tajo/storage/TestMergeScanner.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/package-info.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java
          • tajo-storage/src/main/resources/storage-default.xml
          • tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/TestStorages.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java
          • tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto
          • tajo-storage/src/test/java/org/apache/tajo/storage/v2/TestStorages.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java
          • tajo-storage/pom.xml
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java
          • tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java
          • CHANGES.txt
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java
          • tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-0.8.0-build #44 (See https://builds.apache.org/job/Tajo-0.8.0-build/44/ ) TAJO-30 : Parquet Integration. (David Chen via hyunsik) (hyunsik: rev 7c21fce8891d4343c21c7550e983c25ee388cfef) tajo-storage/src/test/resources/storage-default.xml tajo-storage/src/test/java/org/apache/tajo/storage/TestMergeScanner.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/package-info.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoSchemaConverter.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetAppender.java tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestSchemaConverter.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordConverter.java tajo-storage/src/main/resources/storage-default.xml tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/CatalogUtil.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetWriter.java tajo-storage/src/test/java/org/apache/tajo/storage/TestStorages.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoRecordMaterializer.java tajo-catalog/tajo-catalog-common/src/main/proto/CatalogProtos.proto tajo-storage/src/test/java/org/apache/tajo/storage/v2/TestStorages.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoWriteSupport.java tajo-storage/pom.xml tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoParquetReader.java tajo-storage/src/test/java/org/apache/tajo/storage/parquet/TestReadWrite.java CHANGES.txt tajo-storage/src/main/java/org/apache/tajo/storage/parquet/TajoReadSupport.java tajo-storage/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java
          Hide
          jihoonson Jihoon Son added a comment -

          I'm very happy that Tajo finally support Parquet!!

          Show
          jihoonson Jihoon Son added a comment - I'm very happy that Tajo finally support Parquet!!
          Hide
          hyunsik Hyunsik Choi added a comment -

          committed the latest patch to master and branch-0.8.0. Thanks a lot!

          Here is the commit log.
          https://git-wip-us.apache.org/repos/asf?p=tajo.git;a=commit;h=9d00f9fefc36074098ce8117f752713ae020655b

          Show
          hyunsik Hyunsik Choi added a comment - committed the latest patch to master and branch-0.8.0. Thanks a lot! Here is the commit log. https://git-wip-us.apache.org/repos/asf?p=tajo.git;a=commit;h=9d00f9fefc36074098ce8117f752713ae020655b
          Hide
          hyunsik Hyunsik Choi added a comment -

          +1

          Thank you for your nice contribution. Let's commit it!

          Show
          hyunsik Hyunsik Choi added a comment - +1 Thank you for your nice contribution. Let's commit it!
          Hide
          davidzchen David Chen added a comment -

          Hi Hyunsik,

          No problem. I have posted a new patch with the comment fix correcting the default Parquet block size. I think this patch is now ready to be committed.

          Thanks!
          David

          Show
          davidzchen David Chen added a comment - Hi Hyunsik, No problem. I have posted a new patch with the comment fix correcting the default Parquet block size. I think this patch is now ready to be committed. Thanks! David
          Hide
          davidzchen David Chen added a comment -

          Updated the review request against branch master in reviewboard
          https://reviews.apache.org/r/19573/

          Show
          davidzchen David Chen added a comment - Updated the review request against branch master in reviewboard https://reviews.apache.org/r/19573/
          Hide
          hyunsik Hyunsik Choi added a comment -

          Thanks a lot. If you submit the third revision, I'll commit the patch with a brief review.

          • hyunsik
          Show
          hyunsik Hyunsik Choi added a comment - Thanks a lot. If you submit the third revision, I'll commit the patch with a brief review. hyunsik
          Hide
          davidzchen David Chen added a comment - - edited

          Nice! That looks great!

          Ah good catch. The actual default block size is 128 MB according to the current parquet-mr sources (https://github.com/Parquet/parquet-mr/blob/8cc3e29cc28896cfd47c90abec97f1aa866832c1/parquet-hadoop/src/main/java/parquet/hadoop/ParquetWriter.java). I adapted that comment from parquet-avro's AvroParquetWriter, which was probably not updated when the default block size is changed.

          I believe the patch file for my third revision for fixing the Javadoc comments also includes all of the other changes as well. I can easily post a new patch that fixes this comment as well, or I can fix that comment in a future patch, such as for TAJO-714. What works best for you?

          Show
          davidzchen David Chen added a comment - - edited Nice! That looks great! Ah good catch. The actual default block size is 128 MB according to the current parquet-mr sources ( https://github.com/Parquet/parquet-mr/blob/8cc3e29cc28896cfd47c90abec97f1aa866832c1/parquet-hadoop/src/main/java/parquet/hadoop/ParquetWriter.java ). I adapted that comment from parquet-avro's AvroParquetWriter , which was probably not updated when the default block size is changed. I believe the patch file for my third revision for fixing the Javadoc comments also includes all of the other changes as well. I can easily post a new patch that fixes this comment as well, or I can fix that comment in a future patch, such as for TAJO-714 . What works best for you?
          Hide
          hyunsik Hyunsik Choi added a comment -

          Big +1 for the latest patch.

          I have a trivial question. As you can see the comments in TajoParquetWriter, the default block is stated as 50 MB. I'm expecting that you may intend 128MB. Would you mind checking it?

          If we need to modify the comment, I'll commit your latest patch + the comment fix without letting you resubmit a updated patch. Because this is is very trivial, this way would be more productive.

          /**
             * Creates a new TajoParquetWriter. The default block size is 50 MB.
             * The default page size is 1 MB. Default compression is no compression.
             *
             * @param file The Path of the file to write to.
             * @param schema The Tajo schema of the table.
             * @throws IOException
             */
            public TajoParquetWriter(Path file, Schema schema) throws IOException {
          

          FYI, I share how to create parquet tables in tajo shell. You may already know the way.

          tpch> create table orders_parquet5 using parquet as select * from orders;
          Progress: 0%, response time: 0.405 sec
          Progress: 0%, response time: 1.207 sec
          Progress: 33%, response time: 2.209 sec
          Progress: 83%, response time: 3.211 sec
          Progress: 100%, response time: 3.471 sec
          final state: QUERY_SUCCEEDED, response time: 3.471 sec
          OK
          tpch> select count(*) from orders_parquet5;
          Progress: 8%, response time: 0.396 sec
          Progress: 100%, response time: 0.923 sec
          2014-03-27 11:02:49,276 WARN  storage.AbstractStorageManager (AbstractStorageManager.java:<init>(86)) - does not support block metadata. ('dfs.datanode.hdfs-blocks-metadata.enabled')
          final state: QUERY_SUCCEEDED, response time: 0.923 sec
          result: hdfs://127.0.0.1:8020/tmp/tajo-hyunsik/staging/q_1395884147706_0005/RESULT, 1 rows (8 B)
          ?count
          -------------------------------
          1500000
          tpch> 
          
          Show
          hyunsik Hyunsik Choi added a comment - Big +1 for the latest patch. I have a trivial question. As you can see the comments in TajoParquetWriter, the default block is stated as 50 MB. I'm expecting that you may intend 128MB. Would you mind checking it? If we need to modify the comment, I'll commit your latest patch + the comment fix without letting you resubmit a updated patch. Because this is is very trivial, this way would be more productive. /** * Creates a new TajoParquetWriter. The default block size is 50 MB. * The default page size is 1 MB. Default compression is no compression. * * @param file The Path of the file to write to. * @param schema The Tajo schema of the table. * @throws IOException */ public TajoParquetWriter(Path file, Schema schema) throws IOException { FYI, I share how to create parquet tables in tajo shell. You may already know the way. tpch> create table orders_parquet5 using parquet as select * from orders; Progress: 0%, response time: 0.405 sec Progress: 0%, response time: 1.207 sec Progress: 33%, response time: 2.209 sec Progress: 83%, response time: 3.211 sec Progress: 100%, response time: 3.471 sec final state: QUERY_SUCCEEDED, response time: 3.471 sec OK tpch> select count(*) from orders_parquet5; Progress: 8%, response time: 0.396 sec Progress: 100%, response time: 0.923 sec 2014-03-27 11:02:49,276 WARN storage.AbstractStorageManager (AbstractStorageManager.java:<init>(86)) - does not support block metadata. ('dfs.datanode.hdfs-blocks-metadata.enabled') final state: QUERY_SUCCEEDED, response time: 0.923 sec result: hdfs: //127.0.0.1:8020/tmp/tajo-hyunsik/staging/q_1395884147706_0005/RESULT, 1 rows (8 B) ?count ------------------------------- 1500000 tpch>
          Hide
          tajoqa Tajo QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636910/TAJO-30_20140326_05%3A34%3A17.patch
          against master revision e06ffa9.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The applied patch does not increase the total number of javadoc warnings.

          +1 checkstyle. The patch generated 0 code style errors.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-common tajo-storage.

          Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/266//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/266//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-catalog-common.html
          Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/266//console

          This message is automatically generated.

          Show
          tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636910/TAJO-30_20140326_05%3A34%3A17.patch against master revision e06ffa9. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The applied patch does not increase the total number of javadoc warnings. +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-common tajo-storage. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/266//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/266//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-catalog-common.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/266//console This message is automatically generated.
          Hide
          jihoonson Jihoon Son added a comment -

          +1.
          The latest patch is great to me, too!

          Show
          jihoonson Jihoon Son added a comment - +1. The latest patch is great to me, too!
          Hide
          davidzchen David Chen added a comment -

          Updated the review request against branch master in reviewboard
          https://reviews.apache.org/r/19573/

          Show
          davidzchen David Chen added a comment - Updated the review request against branch master in reviewboard https://reviews.apache.org/r/19573/
          Hide
          tajoqa Tajo QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636908/TAJO-30_20140326_05%3A06%3A57.patch
          against master revision e06ffa9.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The applied patch generated 50 javadoc warnings (more than the trunk's current 46 warnings).

          +1 checkstyle. The patch generated 0 code style errors.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-common tajo-storage.

          Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/265//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/265//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-catalog-common.html
          Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/265//console

          This message is automatically generated.

          Show
          tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636908/TAJO-30_20140326_05%3A06%3A57.patch against master revision e06ffa9. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The applied patch generated 50 javadoc warnings (more than the trunk's current 46 warnings). +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-common tajo-storage. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/265//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/265//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-catalog-common.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/265//console This message is automatically generated.
          Hide
          davidzchen David Chen added a comment -

          Hi Hyunsik,

          Excellent! That is really exciting to hear!

          Agreed, I have not added support for setting Parquet's tuning parameters such as column compression, row group and page size, enabling dictionary encoding, etc. I have opened TAJO-714 to track the work for adding support for those options.

          I have updated the review with a new patch. In addition to the changes I listed previously, I also added some unit tests for TajoSchemaConverter and, in doing so, fixed a bug in the conversion from Parquet strings to Tajo strings. I think this patch is ready unless there is additional feedback.

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - Hi Hyunsik, Excellent! That is really exciting to hear! Agreed, I have not added support for setting Parquet's tuning parameters such as column compression, row group and page size, enabling dictionary encoding, etc. I have opened TAJO-714 to track the work for adding support for those options. I have updated the review with a new patch. In addition to the changes I listed previously, I also added some unit tests for TajoSchemaConverter and, in doing so, fixed a bug in the conversion from Parquet strings to Tajo strings. I think this patch is ready unless there is additional feedback. Thanks, David
          Hide
          davidzchen David Chen added a comment -

          Updated the review request against branch master in reviewboard
          https://reviews.apache.org/r/19573/

          Show
          davidzchen David Chen added a comment - Updated the review request against branch master in reviewboard https://reviews.apache.org/r/19573/
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          Thank you for opening new tickets.

          After I added the missed property to storage-default.xml, I created a parquet format file from CTAS on a real cluster and executed some queries. it worked well and was super fast. Your patch is unquestionable. The patch is straightforward and contains enough unit tests. I don't have additional comment. After your submit your in-progress work, I'll finish the review.

          There may be additional issues related to Parquet, such as how to specify column compression, how to set block or page size, and how to enable dictionary. I think that they don't need to be solved right now. It would be great if we create additional issues for them in order to avoid one big issue.

          Thanks,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, Thank you for opening new tickets. After I added the missed property to storage-default.xml, I created a parquet format file from CTAS on a real cluster and executed some queries. it worked well and was super fast. Your patch is unquestionable. The patch is straightforward and contains enough unit tests. I don't have additional comment. After your submit your in-progress work, I'll finish the review. There may be additional issues related to Parquet, such as how to specify column compression, how to set block or page size, and how to enable dictionary. I think that they don't need to be solved right now. It would be great if we create additional issues for them in order to avoid one big issue. Thanks, Hyunsik
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          You don't need to add the change log in CHANGES.txt because someone who finally commits your patch will add changes with your and his names. I also think that we can add your Parquet work to 0.8.0.

          Best regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, You don't need to add the change log in CHANGES.txt because someone who finally commits your patch will add changes with your and his names. I also think that we can add your Parquet work to 0.8.0. Best regards, Hyunsik
          Hide
          davidzchen David Chen added a comment -

          Great to hear. Thanks, Jinho Kim!

          I noticed that you guys add an entry to CHANGES.txt with your commits. Do I need to do the same with my changes as well? Will Parquet support be going into 0.8.0?

          Show
          davidzchen David Chen added a comment - Great to hear. Thanks, Jinho Kim ! I noticed that you guys add an entry to CHANGES.txt with your commits. Do I need to do the same with my changes as well? Will Parquet support be going into 0.8.0?
          Hide
          davidzchen David Chen added a comment - - edited

          Thanks for the clarification, Hyunsik.

          I have fixed the way NULLs are handled. I will post an updated patch after adding some more Javadoc comments and cleanup.

          By the way, I have opened a few tickets:

          • TAJO-709 - Add .reviewboardrc and clean up request-patch-review.py to use rbt
          • TAJO-710 - Support nested schemas and non-scalar types
          • TAJO-711 - Add Avro storage support. This is not for switching from Protobuf to Avro as discussed in TAJO-28 but to add a FileScanner and FileAppender for Avro. We currently store most of our data in Avro, and having Avro storage support is a must for us. I have a pretty good understanding of Avro and have contributed some to improvements to Parquet's Avro integration and would be happy to pick this up after we commit this patch.

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - - edited Thanks for the clarification, Hyunsik. I have fixed the way NULLs are handled. I will post an updated patch after adding some more Javadoc comments and cleanup. By the way, I have opened a few tickets: TAJO-709 - Add .reviewboardrc and clean up request-patch-review.py to use rbt TAJO-710 - Support nested schemas and non-scalar types TAJO-711 - Add Avro storage support. This is not for switching from Protobuf to Avro as discussed in TAJO-28 but to add a FileScanner and FileAppender for Avro. We currently store most of our data in Avro, and having Avro storage support is a must for us. I have a pretty good understanding of Avro and have contributed some to improvements to Parquet's Avro integration and would be happy to pick this up after we commit this patch. Thanks, David
          Hide
          jhkim Jinho Kim added a comment -

          Guys, TAJO-705 is done.

          Show
          jhkim Jinho Kim added a comment - Guys, TAJO-705 is done.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          Yes, you are correct. You need to set NullDatum to the column positions in the Tuple. Thank you for your additional work.

          Thanks,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, Yes, you are correct. You need to set NullDatum to the column positions in the Tuple. Thank you for your additional work. Thanks, Hyunsik
          Hide
          davidzchen David Chen added a comment -

          Hi Hyunsik,

          Not a problem. I have applied your patch and I think I understand the problem with the way I am handling NULLs. Just to confirm that I am understanding it correctly, is it the case that if a field's value is NULL, then a NullDatum is set for that field in the Tuple?

          Here is the status of what I am working on before posting an updated patch:

          • [Done] Add missing property to tajo-storage/src/main/resource/storage-default.xml.
          • [Done] Add TableStatistics to ParquetAppender
          • [Done] Fix findbugs warnings
          • [Done] Apply patch to add testNullHandlingTypes
          • [Done] Add package-info.java documentation file that describes overview of Parquet support and mapping between Parquet and Tajo data types.
          • [In Progress] Fix the way parquet-tajo handles NULLs
          • Clean up and add more Javadoc comments

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - Hi Hyunsik, Not a problem. I have applied your patch and I think I understand the problem with the way I am handling NULLs. Just to confirm that I am understanding it correctly, is it the case that if a field's value is NULL, then a NullDatum is set for that field in the Tuple? Here is the status of what I am working on before posting an updated patch: [Done] Add missing property to tajo-storage/src/main/resource/storage-default.xml. [Done] Add TableStatistics to ParquetAppender [Done] Fix findbugs warnings [Done] Apply patch to add testNullHandlingTypes [Done] Add package-info.java documentation file that describes overview of Parquet support and mapping between Parquet and Tajo data types. [In Progress] Fix the way parquet-tajo handles NULLs Clean up and add more Javadoc comments Thanks, David
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          While I'm testing this patch, I found some bug (TAJO-705) which already exist in CTAS operation. I'm suspecting that this bug is recently introduced. TAJO-705 is blocking some cluster tests. After we fix TAJO-705 as soon as possible, I'll finish the review of your work.

          Thanks,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, While I'm testing this patch, I found some bug ( TAJO-705 ) which already exist in CTAS operation. I'm suspecting that this bug is recently introduced. TAJO-705 is blocking some cluster tests. After we fix TAJO-705 as soon as possible, I'll finish the review of your work. Thanks, Hyunsik
          Hide
          tajoqa Tajo QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636292/null_handling.patch
          against master revision 7283c58.

          -1 patch. The patch command could not apply the patch.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/251//console

          This message is automatically generated.

          Show
          tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636292/null_handling.patch against master revision 7283c58. -1 patch. The patch command could not apply the patch. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/251//console This message is automatically generated.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          I'm sorry for late response. The patch looks nice to me. Due to my other schedule, I'm still reviewing the patch.

          While I'm reviewing, I found that your Parquet work handles NULL values in a different way. So, I created a diff file which can be applied to your work. This diff contains a unit test to show the different null handling. I think that it would be great if this patch is included in your work.

          In addition, in this patch, the unit tests for Trevni format are commented out because our Trevni implementation cannot handle NULL value correctly. This problem will be addressed in another issue, so you don't need to care about Trevni.

          Regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, I'm sorry for late response. The patch looks nice to me. Due to my other schedule, I'm still reviewing the patch. While I'm reviewing, I found that your Parquet work handles NULL values in a different way. So, I created a diff file which can be applied to your work. This diff contains a unit test to show the different null handling. I think that it would be great if this patch is included in your work. In addition, in this patch, the unit tests for Trevni format are commented out because our Trevni implementation cannot handle NULL value correctly. This problem will be addressed in another issue, so you don't need to care about Trevni. Regards, Hyunsik
          Hide
          tajoqa Tajo QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636167/TAJO-30.patch
          against master revision 7283c58.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The applied patch generated 51 javadoc warnings (more than the trunk's current 48 warnings).

          +1 checkstyle. The patch generated 0 code style errors.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-common tajo-storage.

          Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/250//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/250//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/250//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-catalog-common.html
          Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/250//console

          This message is automatically generated.

          Show
          tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636167/TAJO-30.patch against master revision 7283c58. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The applied patch generated 51 javadoc warnings (more than the trunk's current 48 warnings). +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in tajo-catalog/tajo-catalog-common tajo-storage. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/250//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/250//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/250//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-catalog-common.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/250//console This message is automatically generated.
          Hide
          hyunsik Hyunsik Choi added a comment -

          David,

          Thank you for submitting your patch to RB. If you create the jira issue for improving the script request-patch-review.py, I'll appreciate your effort.

          Show
          hyunsik Hyunsik Choi added a comment - David, Thank you for submitting your patch to RB. If you create the jira issue for improving the script request-patch-review.py , I'll appreciate your effort.
          Hide
          jhkim Jinho Kim added a comment -

          Thank you for your contribution!
          I've just updated status to patch available

          Show
          jhkim Jinho Kim added a comment - Thank you for your contribution! I've just updated status to patch available
          Hide
          davidzchen David Chen added a comment - - edited

          Apologies for the spurious posts. I had a few issues with my review tools setup. I have posted a RB review: https://reviews.apache.org/r/19573/

          I'll post an updated patch with some changes and clean-up soon.

          By the way, I noticed that the request-patch-review.py tool uses the old post-review tool, which will be deprecated in favor of rbt soon. If you would like, I can open a ticket and help clean it up, add a .reviewboardrc to the source tree, and use rbt to post the review.

          Show
          davidzchen David Chen added a comment - - edited Apologies for the spurious posts. I had a few issues with my review tools setup. I have posted a RB review: https://reviews.apache.org/r/19573/ I'll post an updated patch with some changes and clean-up soon. By the way, I noticed that the request-patch-review.py tool uses the old post-review tool, which will be deprecated in favor of rbt soon. If you would like, I can open a ticket and help clean it up, add a .reviewboardrc to the source tree, and use rbt to post the review.
          Hide
          davidzchen David Chen added a comment -

          Created a review request against branch master in reviewboard

          Show
          davidzchen David Chen added a comment - Created a review request against branch master in reviewboard
          Hide
          jihoonson Jihoon Son added a comment - - edited

          David, the patch contains a large changes.
          Would you put your patch on RB, please?
          It will help us review the patch and leave comments more conveniently.
          Thanks!

          Show
          jihoonson Jihoon Son added a comment - - edited David, the patch contains a large changes. Would you put your patch on RB, please? It will help us review the patch and leave comments more conveniently. Thanks!
          Hide
          blrunner Jaehwa Jung added a comment -

          Great Job, David!
          Thank you for your contribution.

          Show
          blrunner Jaehwa Jung added a comment - Great Job, David! Thank you for your contribution.
          Hide
          davidzchen David Chen added a comment -

          Hyunsik Choi, Jihoon Son Thanks! I look forward to iterating on your feedback.

          Hyoungjun Kim Thanks for running the benchmark! Really excited that we're already seeing performance gains with Parquet. I'll add the missing property.

          Henry Saputra Agreed. In the meantime, I'm adding some more detailed Javadoc comments in the source code and a package-info.java that will give an overview of the Parquet support code, including a mapping of Tajo to Parquet data types.

          Yes, I have moved my changes to https://github.com/davidzchen/tajo/tree/parquet after seeing the great news that Tajo has graduated to a TLP. Congratulations!

          Show
          davidzchen David Chen added a comment - Hyunsik Choi , Jihoon Son Thanks! I look forward to iterating on your feedback. Hyoungjun Kim Thanks for running the benchmark! Really excited that we're already seeing performance gains with Parquet. I'll add the missing property. Henry Saputra Agreed. In the meantime, I'm adding some more detailed Javadoc comments in the source code and a package-info.java that will give an overview of the Parquet support code, including a mapping of Tajo to Parquet data types. Yes, I have moved my changes to https://github.com/davidzchen/tajo/tree/parquet after seeing the great news that Tajo has graduated to a TLP. Congratulations!
          Hide
          hsaputra Henry Saputra added a comment -

          Looks like the work has been moved to https://github.com/davidzchen/tajo/tree/parquet

          Show
          hsaputra Henry Saputra added a comment - Looks like the work has been moved to https://github.com/davidzchen/tajo/tree/parquet
          Hide
          hsaputra Henry Saputra added a comment -

          Thanks Hyunsik, that sounds like a plan.
          I will file a ticket under TAJO-658 for Parquet format support once this JIRA is resolved.

          Show
          hsaputra Henry Saputra added a comment - Thanks Hyunsik, that sounds like a plan. I will file a ticket under TAJO-658 for Parquet format support once this JIRA is resolved.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Henry Saputra No problem. As you mentioned, we need some documentation for file formats. The best place is will be here (http://tajo.incubator.apache.org/docs/0.8.0/table_management/file_formats.html). If there is no volunteer, I'll take the documentation issue and will do until 0.8.0 release. If you want this issue, I can yield this issue for you.

          Show
          hyunsik Hyunsik Choi added a comment - Henry Saputra No problem. As you mentioned, we need some documentation for file formats. The best place is will be here ( http://tajo.incubator.apache.org/docs/0.8.0/table_management/file_formats.html ). If there is no volunteer, I'll take the documentation issue and will do until 0.8.0 release. If you want this issue, I can yield this issue for you.
          Hide
          hsaputra Henry Saputra added a comment - - edited

          The code looks good. One question though, how do we want to document this?
          I strongly suggest we add good documentation for this one.

          Apache Spark do very well in doc, for example in the new "SQL" support module the patch is accompanied with doc file [1] which makes easier to follow.

          [1] http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html

          Show
          hsaputra Henry Saputra added a comment - - edited The code looks good. One question though, how do we want to document this? I strongly suggest we add good documentation for this one. Apache Spark do very well in doc, for example in the new "SQL" support module the patch is accompanied with doc file [1] which makes easier to follow. [1] http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
          Hide
          jihoonson Jihoon Son added a comment -

          Awesome work, David!!

          Show
          jihoonson Jihoon Son added a comment - Awesome work, David!!
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          I briefly take a look at the patch. Thank you David for your awesome contribution!
          I'll review the patch and comment today's night.

          Show
          hyunsik Hyunsik Choi added a comment - - edited I briefly take a look at the patch. Thank you David for your awesome contribution! I'll review the patch and comment today's night.
          Hide
          hjkim Hyoungjun Kim added a comment -

          Thanks David. I really wanted parquet file format.
          I already tested your patch shortly with TPCH 100 scale data.

          select sum(l_extendedprice*l_discount) as revenue from lineitem_100 where l_shipdate >= '1994-01-01' and l_shipdate < '1995-01-01' and l_discount >= 0.05 and l_discount <= 0.07 and l_quantity < 24;
          
          • TextFile: 23.113 sec
          • Parquet File: 10.996 sec

          The following property is missing in the patch(tajo-storage/src/main/resource/storage-default.xml).

            <property>
              <name>tajo.storage.scanner-handler.parquet.class</name>
              <value>org.apache.tajo.storage.parquet.ParquetScanner</value>
            </property>
          
          Show
          hjkim Hyoungjun Kim added a comment - Thanks David. I really wanted parquet file format. I already tested your patch shortly with TPCH 100 scale data. select sum(l_extendedprice*l_discount) as revenue from lineitem_100 where l_shipdate >= '1994-01-01' and l_shipdate < '1995-01-01' and l_discount >= 0.05 and l_discount <= 0.07 and l_quantity < 24; TextFile: 23.113 sec Parquet File: 10.996 sec The following property is missing in the patch(tajo-storage/src/main/resource/storage-default.xml). <property> <name>tajo.storage.scanner-handler.parquet.class</name> <value>org.apache.tajo.storage.parquet.ParquetScanner</value> </property>
          Hide
          davidzchen David Chen added a comment -

          Initial patch for adding Parquet support.

          Show
          davidzchen David Chen added a comment - Initial patch for adding Parquet support.
          Hide
          davidzchen David Chen added a comment - - edited

          I ended up special-casing handling of the NULL_TYPE columns. On the write side, NULL_TYPE columns are ignored and not written to the Parquet file. On the read side, I added code to handle the case where NULL_TYPE columns are in the projection.

          All the tajo-storage tests now pass. I am posting a patch.

          FYI, as of now, there is a test in Tajo Core Backend that is failing (org.apache.tajo.benchmarkTestTPCH). This appears to be unrelated to my changes because this test fails on the master branch as well.

          Show
          davidzchen David Chen added a comment - - edited I ended up special-casing handling of the NULL_TYPE columns. On the write side, NULL_TYPE columns are ignored and not written to the Parquet file. On the read side, I added code to handle the case where NULL_TYPE columns are in the projection. All the tajo-storage tests now pass. I am posting a patch. FYI, as of now, there is a test in Tajo Core Backend that is failing (org.apache.tajo.benchmarkTestTPCH). This appears to be unrelated to my changes because this test fails on the master branch as well.
          Hide
          davidzchen David Chen added a comment - - edited

          Hi Hyunsik,

          That's an interesting idea. Do you mean that Tajo will use Parquet as the default storage format or have all storage formats deserialize into a representation that follows the Dremel model? Parquet doesn't really have its own in-memory representation. Each of the Parquet packages basically deserialize into a given in-memory representation using the readers and writers. For example, parquet-avro deserializes into Avro GenericRecords (or SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code deserializes into Tajo Tuples.

          My changes are currently in the parquet branch in my fork on GitHub: https://github.com/davidzchen/tajo/tree/parquet

          They are almost ready. During further testing, I found a few more issues, most of them I have now fixed. One thing I noticed was that when reading a projection, the resulting Tuple still has all the columns of the table schema but the non-projected fields are simply null. What is the motivation for retaining all the columns in the Tuple rather than having the Tuple only contain the projected columns?

          There is one last test that is failing which is caused by the fact that I am not handling the NULL_TYPE data type when converting the Tajo schema to a Parquet schema on write. What is NULL_TYPE used for? I wasn't able to find much documentation on its use. I can always write this as a placeholder column or special-case it. Once I fix this, I will post a review request.

          There are some follow-up work items that I plan to do, most likely as review changes:

          • Add TableStats to ParquetAppender.
          • Figure out of ParquetAppender.flush() is needed.
          • Additional end-to-end testing
          • Add some documentation

          Thanks,
          David

          Edit: Update GitHub link.

          Show
          davidzchen David Chen added a comment - - edited Hi Hyunsik, That's an interesting idea. Do you mean that Tajo will use Parquet as the default storage format or have all storage formats deserialize into a representation that follows the Dremel model? Parquet doesn't really have its own in-memory representation. Each of the Parquet packages basically deserialize into a given in-memory representation using the readers and writers. For example, parquet-avro deserializes into Avro GenericRecords (or SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code deserializes into Tajo Tuples. My changes are currently in the parquet branch in my fork on GitHub: https://github.com/davidzchen/tajo/tree/parquet They are almost ready. During further testing, I found a few more issues, most of them I have now fixed. One thing I noticed was that when reading a projection, the resulting Tuple still has all the columns of the table schema but the non-projected fields are simply null. What is the motivation for retaining all the columns in the Tuple rather than having the Tuple only contain the projected columns? There is one last test that is failing which is caused by the fact that I am not handling the NULL_TYPE data type when converting the Tajo schema to a Parquet schema on write. What is NULL_TYPE used for? I wasn't able to find much documentation on its use. I can always write this as a placeholder column or special-case it. Once I fix this, I will post a review request. There are some follow-up work items that I plan to do, most likely as review changes: Add TableStats to ParquetAppender. Figure out of ParquetAppender.flush() is needed. Additional end-to-end testing Add some documentation Thanks, David Edit: Update GitHub link.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David Chen,

          I have a plan for nested schema. Currently, Tajo only supports a flat schema like relational DBMS. So, even though Tajo is extended to nested data mode, it will not break the compatibility.

          I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). When I consider nested data model, I thought two main points. Parquet data model satisfies with these points. The first point that I've thought is the processing model on nested data. Parquet data model is the same to that of BigQuery, and BigQuery already concreted the processing model including flattening, cross production on repeated fields, and aggregation on repeated fields [1][2]. The second point is file format. Parquet is a native file format for this model. Parquet already includes the efficient record assembly method. Besides, Parquet is already mature and is widely used in many systems.

          [1] http://research.google.com/pubs/pub36632.html
          [2] https://developers.google.com/bigquery/docs/data

          I'm thinking that we need three stages for this work. Firstly, we can start with a small change to improve our schema system. Then, we will add some physical operator to just flatten one nested row into a number of flattened rows. Finally, we will solve some query optimization issues like projection/filter push down on nested schema and will add some physical operators to directly process nested rows.

          If you have any idea, feel free to share with us.

          Thanks,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David Chen , I have a plan for nested schema. Currently, Tajo only supports a flat schema like relational DBMS. So, even though Tajo is extended to nested data mode, it will not break the compatibility. I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). When I consider nested data model, I thought two main points. Parquet data model satisfies with these points. The first point that I've thought is the processing model on nested data. Parquet data model is the same to that of BigQuery, and BigQuery already concreted the processing model including flattening, cross production on repeated fields, and aggregation on repeated fields [1][2]. The second point is file format. Parquet is a native file format for this model. Parquet already includes the efficient record assembly method. Besides, Parquet is already mature and is widely used in many systems. [1] http://research.google.com/pubs/pub36632.html [2] https://developers.google.com/bigquery/docs/data I'm thinking that we need three stages for this work. Firstly, we can start with a small change to improve our schema system. Then, we will add some physical operator to just flatten one nested row into a number of flattened rows. Finally, we will solve some query optimization issues like projection/filter push down on nested schema and will add some physical operators to directly process nested rows. If you have any idea, feel free to share with us. Thanks, Hyunsik
          Hide
          davidzchen David Chen added a comment -

          Thanks for the clarification, Hyunsik. I will clean up my changes and then post a patch.

          Just wondering, is support for nested schemas on the roadmap for Tajo?

          Show
          davidzchen David Chen added a comment - Thanks for the clarification, Hyunsik. I will clean up my changes and then post a patch. Just wondering, is support for nested schemas on the roadmap for Tajo?
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          Hi David Chen,

          I think that it is sufficient to support only flat schema right now. Of course, we have a plan to expand our schema and physical operators to support nested data model, but not now. Also, you can skip PROTOBUF data type because currently it is only used in intermediate data.

          Since you use only your own time, it may be hard to proceed this work faster. I understand. I think that if you share your progress or create some subtasks, some of us can collaborate this work with you.

          Best regards,
          Hyunsik Choi

          Show
          hyunsik Hyunsik Choi added a comment - - edited Hi David Chen , I think that it is sufficient to support only flat schema right now. Of course, we have a plan to expand our schema and physical operators to support nested data model, but not now. Also, you can skip PROTOBUF data type because currently it is only used in intermediate data. Since you use only your own time, it may be hard to proceed this work faster. I understand. I think that if you share your progress or create some subtasks, some of us can collaborate this work with you. Best regards, Hyunsik Choi
          Hide
          davidzchen David Chen added a comment -

          I have read and write pretty much working for flat schemas but am still working on getting nested schemas to work.

          Hyunsik - To confirm, does Tajo implement nested schemas using the PROTOBUF data type?

          Apologies for the delay. As I mentioned before, I have been busy with other priorities this quarter. I will try to finish Parquet support for Tajo as soon as possible.

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - I have read and write pretty much working for flat schemas but am still working on getting nested schemas to work. Hyunsik - To confirm, does Tajo implement nested schemas using the PROTOBUF data type? Apologies for the delay. As I mentioned before, I have been busy with other priorities this quarter. I will try to finish Parquet support for Tajo as soon as possible. Thanks, David
          Hide
          pkshit98 kisoo park added a comment -

          Hi All.

          I have waited the parquet's implementation. When does you finish this issue ?

          I hope that The tajo will support parquet as soon as possible.

          Please Let me know When you work it out.

          Thanks you.

          Show
          pkshit98 kisoo park added a comment - Hi All. I have waited the parquet's implementation. When does you finish this issue ? I hope that The tajo will support parquet as soon as possible. Please Let me know When you work it out. Thanks you.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          Above all, I appreciate your effort. I understand your situation. I hope that you don't feel any pressure

          Parquet will be an important role in Tajo. I just want to know the progress and want to estimate when Tajo supports Parquet format. The work seems to be done mostly. Please keep going your work.

          Thanks,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, Above all, I appreciate your effort. I understand your situation. I hope that you don't feel any pressure Parquet will be an important role in Tajo. I just want to know the progress and want to estimate when Tajo supports Parquet format. The work seems to be done mostly. Please keep going your work. Thanks, Hyunsik
          Hide
          davidzchen David Chen added a comment -

          Hi Hyunsik,

          Sorry for the wait. This quarter has been a bit busy, and I have been working on this entirely in my spare time.

          I actually have most of the code written and have both reading from and writing to Parquet mostly working in my tests. There are still some bugs and more tests that I would like to write.

          Would you like me to post a patch of what I have so far?

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - Hi Hyunsik, Sorry for the wait. This quarter has been a bit busy, and I have been working on this entirely in my spare time. I actually have most of the code written and have both reading from and writing to Parquet mostly working in my tests. There are still some bugs and more tests that I would like to write. Would you like me to post a patch of what I have so far? Thanks, David
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David Chen,

          You may be very busy in these days. Is there any progress update? If it is hard for you to have enough time to proceed this issue, I would like to take this issue.

          Regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David Chen , You may be very busy in these days. Is there any progress update? If it is hard for you to have enough time to proceed this issue, I would like to take this issue. Regards, Hyunsik
          Hide
          hyunsik Hyunsik Choi added a comment -

          Because now is the end of year, you may be busy. You don't need to hasten this work. If you need other supports for this work, feel free to ask us or create jira issue. Thanks!

          Show
          hyunsik Hyunsik Choi added a comment - Because now is the end of year, you may be busy. You don't need to hasten this work. If you need other supports for this work, feel free to ask us or create jira issue. Thanks!
          Hide
          davidzchen David Chen added a comment -

          I now mostly understand the Tajo data model and have coded the boilerplate for the Parquet Tajo module. I have been busy with other work items, and we just had our Thanksgiving holiday last week. I am hoping to have a basic prototype working next week and will update you as soon as I can.

          Thanks!
          David

          Show
          davidzchen David Chen added a comment - I now mostly understand the Tajo data model and have coded the boilerplate for the Parquet Tajo module. I have been busy with other work items, and we just had our Thanksgiving holiday last week. I am hoping to have a basic prototype working next week and will update you as soon as I can. Thanks! David
          Hide
          hyunsik Hyunsik Choi added a comment -

          Any progress update?

          Show
          hyunsik Hyunsik Choi added a comment - Any progress update?
          Hide
          blrunner Jaehwa Jung added a comment -

          Great! Welcome David!
          I'm glad to work with you on Tajo, too.

          Show
          blrunner Jaehwa Jung added a comment - Great! Welcome David! I'm glad to work with you on Tajo, too.
          Hide
          jihoonson Jihoon Son added a comment -

          Great!
          Welcome to Tajo, David!

          Show
          jihoonson Jihoon Son added a comment - Great! Welcome to Tajo, David!
          Hide
          hsaputra Henry Saputra added a comment -

          David, welcome to Tajo.

          Thanks for the comment Dmitriy.

          Show
          hsaputra Henry Saputra added a comment - David, welcome to Tajo. Thanks for the comment Dmitriy.
          Hide
          jhkim Jinho Kim added a comment -

          David, welcome to Tajo!!

          Show
          jhkim Jinho Kim added a comment - David, welcome to Tajo!!
          Hide
          davidzchen David Chen added a comment -

          Thanks, Hyunsik and Min!

          I am excited to work with all of you on Tajo!

          Show
          davidzchen David Chen added a comment - Thanks, Hyunsik and Min! I am excited to work with all of you on Tajo!
          Hide
          hyunsik Hyunsik Choi added a comment -

          David, Min, and I meet together in the meetup yesterday. Min was accepted to assign this issue to David. So, I've assigned this issue to David.

          David, welcome to Tajo!

          Show
          hyunsik Hyunsik Choi added a comment - David, Min, and I meet together in the meetup yesterday. Min was accepted to assign this issue to David. So, I've assigned this issue to David. David, welcome to Tajo!
          Hide
          jihoonson Jihoon Son added a comment -

          Dmitriy,
          thanks for your comment!

          We need to modify the project dependency to take on only the Parquet dependency.

          Thanks,
          Jihoon

          Show
          jihoonson Jihoon Son added a comment - Dmitriy, thanks for your comment! We need to modify the project dependency to take on only the Parquet dependency. Thanks, Jihoon
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Hi folks,
          Just popping in to confirm that you do not need to take on a thrift dependency to use Parquet.

          Although internally the parquet format is described as thrift, the dependency is not forced on consumers; it's an implementation details that does not leak out of the Parquet APIs. Furthermore, we "shade" all the thrift classes, so even if you do take a thrift dependency later on, and if it's a conflicting version of thrift (some API-incompatible change or something like that), things will just keep on working.

          I think so far, we've managed to not have any non-parquet dependencies at all for the core packages (when we do pull things in, we shade them; and we are very picky about what we do pull in).

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Hi folks, Just popping in to confirm that you do not need to take on a thrift dependency to use Parquet. Although internally the parquet format is described as thrift, the dependency is not forced on consumers; it's an implementation details that does not leak out of the Parquet APIs. Furthermore, we "shade" all the thrift classes, so even if you do take a thrift dependency later on, and if it's a conflicting version of thrift (some API-incompatible change or something like that), things will just keep on working. I think so far, we've managed to not have any non-parquet dependencies at all for the core packages (when we do pull things in, we shade them; and we are very picky about what we do pull in).
          Hide
          jihoonson Jihoon Son added a comment -

          Hi David,

          I'm very glad that you are interested in contributing to Tajo!
          Because Tajo is a very interesting project, there will be a lot of parts that you may have an interesting.
          I expect to cooperate with you!

          Thanks,
          Jihoon

          Show
          jihoonson Jihoon Son added a comment - Hi David, I'm very glad that you are interested in contributing to Tajo! Because Tajo is a very interesting project, there will be a lot of parts that you may have an interesting. I expect to cooperate with you! Thanks, Jihoon
          Hide
          davidzchen David Chen added a comment -

          Hi all,

          I am new to the Tajo project. I am very excited about Tajo's capabilities and am interested in contributing to the project.

          I am one of the main engineers working on deploying Parquet at LinkedIn and have made a number of contributions to the Parquet project such as adding support for the FIXED_LEN_BYTE_ARRAY data type and a number of Avro support improvements.

          I am excited to see that Parquet support is planned for Tajo as well. Due to the Parquet's generic design, adding Tajo integration would mostly involve writing a FileReader, FileWriter, and a SchemaConverter so that Tajo can automatically convert the schema and records to Tajo's internal representation on the read side and then vice versa on the right side. This is the approach that most of the packages under parquet-mr take, such as parquet-avro, parquet-thrift, etc.

          Min, have you started working on Parquet support for Tajo? If not, would it be fine if I take this ticket?

          Thanks!
          David

          Show
          davidzchen David Chen added a comment - Hi all, I am new to the Tajo project. I am very excited about Tajo's capabilities and am interested in contributing to the project. I am one of the main engineers working on deploying Parquet at LinkedIn and have made a number of contributions to the Parquet project such as adding support for the FIXED_LEN_BYTE_ARRAY data type and a number of Avro support improvements. I am excited to see that Parquet support is planned for Tajo as well. Due to the Parquet's generic design, adding Tajo integration would mostly involve writing a FileReader, FileWriter, and a SchemaConverter so that Tajo can automatically convert the schema and records to Tajo's internal representation on the read side and then vice versa on the right side. This is the approach that most of the packages under parquet-mr take, such as parquet-avro, parquet-thrift, etc. Min, have you started working on Parquet support for Tajo? If not, would it be fine if I take this ticket? Thanks! David
          Hide
          miniway Dongmin Yu added a comment -

          Never mind, we don't have to use parquet-format directly unless we're building a brand-new java parquet parser.

          We can just use parquet-mr's java implementation.

              <dependency>
                <groupId>com.twitter</groupId>
                <artifactId>parquet-column</artifactId>
                <version>1.0.0-SNAPSHOT</version>
              </dependency>
          
          Show
          miniway Dongmin Yu added a comment - Never mind, we don't have to use parquet-format directly unless we're building a brand-new java parquet parser. We can just use parquet-mr's java implementation. <dependency> <groupId>com.twitter</groupId> <artifactId>parquet-column</artifactId> <version>1.0.0-SNAPSHOT</version> </dependency>
          Hide
          hsaputra Henry Saputra added a comment -

          Hi Min, I was wondering if we could another approach than using Thrift dependency. Once we go using Thrift then there is no turning back =)

          Show
          hsaputra Henry Saputra added a comment - Hi Min, I was wondering if we could another approach than using Thrift dependency. Once we go using Thrift then there is no turning back =)
          Hide
          hyunsik Hyunsik Choi added a comment -

          Yes, we can have thrift dependency.

          There may be another approach. According to the following pull request, we could inject some converter to Parquet source tree.

          https://github.com/Parquet/parquet-mr/commit/f3ee0c9b2380aefc2b9d861a9e721bfce755bde0

          Show
          hyunsik Hyunsik Choi added a comment - Yes, we can have thrift dependency. There may be another approach. According to the following pull request, we could inject some converter to Parquet source tree. https://github.com/Parquet/parquet-mr/commit/f3ee0c9b2380aefc2b9d861a9e721bfce755bde0
          Hide
          miniway Dongmin Yu added a comment -

          Parquet format is described as a thrift file.

          https://github.com/Parquet/parquet-format

          So Tajo could have the thrift dependency. Another option is using pre-generated files or building a manually managed formatter.

          Show
          miniway Dongmin Yu added a comment - Parquet format is described as a thrift file. https://github.com/Parquet/parquet-format So Tajo could have the thrift dependency. Another option is using pre-generated files or building a manually managed formatter.
          Hide
          hyunsik Hyunsik Choi added a comment -

          Thank you, Dongmin! I've assigned this issue to you.

          Show
          hyunsik Hyunsik Choi added a comment - Thank you, Dongmin! I've assigned this issue to you.
          Hide
          miniway Dongmin Yu added a comment -

          +1

          Hope that I would take a role of investigating and implementing the parquet
          file format.

          Thanks
          Min

          Show
          miniway Dongmin Yu added a comment - +1 Hope that I would take a role of investigating and implementing the parquet file format. Thanks Min

            People

            • Assignee:
              davidzchen David Chen
              Reporter:
              hyunsik Hyunsik Choi
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development