Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0, 0.9.0
    • Component/s: Documentation
    • Labels:
      None

      Description

      Jinho and I wrote some user documentations for file formats. This patch contains documentations for CSV file, RCFile, and Parquet file.

      1. TAJO-736_20140408_11:44:44.patch
        17 kB
        Hyunsik Choi
      2. TAJO-736.patch
        17 kB
        Hyunsik Choi
      3. TAJO-736.patch
        17 kB
        Hyunsik Choi

        Issue Links

          Activity

          Hide
          hyunsik Hyunsik Choi added a comment -

          Actually, we are not native English speakers. Our work can include some not natural English expressions. So, we are really welcome to any suggestion and revisions.

          Besides, I've tried to upload the generated HTML in people.a.o. But, people.a.o has some problem, and I cannot connect its ssh server. So, I'll show you guys the generated documentation when people.a.o becomes available.

          Show
          hyunsik Hyunsik Choi added a comment - Actually, we are not native English speakers. Our work can include some not natural English expressions. So, we are really welcome to any suggestion and revisions. Besides, I've tried to upload the generated HTML in people.a.o. But, people.a.o has some problem, and I cannot connect its ssh server. So, I'll show you guys the generated documentation when people.a.o becomes available.
          Show
          hyunsik Hyunsik Choi added a comment - Here is the generated web pages. http://people.apache.org/~hyunsik/TAJO-736/table_management/csv.html http://people.apache.org/~hyunsik/TAJO-736/table_management/rcfile.html http://people.apache.org/~hyunsik/TAJO-736/table_management/parquet.html http://people.apache.org/~hyunsik/TAJO-736/partitioning/column_partitioning.html Welcome to any suggestion. Thanks!
          Hide
          jihoonson Jihoon Son added a comment -

          Hyunsik Choi and Jinho Kim,
          first of all, appreciate for your efforts.
          These documents will be very useful and helpful to Tajo users.

          Documents look nice, but I have some simple suggestions.

          • In CSV, there are some characters which are forbidden for delimiters. For example, the line feed (\n) cannot be used as the delimiter, because it is used to distinguish each line. It would be great to add some descriptions about this.
          • In RCFile, you may miss to put a period at the end of the first paragraph.
          • In Parquet, it would be great to add an example of DDL that creates a table with compression.
          • In Column Partitioning, the "Todo" section should be removed. Also, I think that there is a compatibility issue with Hive. For example, can Tajo directly read partitioned tables of Hive? Whether it can or cannot, it would be better to add a simple description of the compatibility.

          In addition, I think that David Chen's review will be very helpful for the Parquet document.
          David Chen, would you mind reviewing the Parquet document, please?

          Best regards,
          Jihoon Son

          Show
          jihoonson Jihoon Son added a comment - Hyunsik Choi and Jinho Kim , first of all, appreciate for your efforts. These documents will be very useful and helpful to Tajo users. Documents look nice, but I have some simple suggestions. In CSV, there are some characters which are forbidden for delimiters. For example, the line feed (\n) cannot be used as the delimiter, because it is used to distinguish each line. It would be great to add some descriptions about this. In RCFile, you may miss to put a period at the end of the first paragraph. In Parquet, it would be great to add an example of DDL that creates a table with compression. In Column Partitioning, the "Todo" section should be removed. Also, I think that there is a compatibility issue with Hive. For example, can Tajo directly read partitioned tables of Hive? Whether it can or cannot, it would be better to add a simple description of the compatibility. In addition, I think that David Chen 's review will be very helpful for the Parquet document. David Chen , would you mind reviewing the Parquet document, please? Best regards, Jihoon Son
          Hide
          davidzchen David Chen added a comment -

          Not a problem! The documentation looks great so far. I have some minor comments:

          For more details, please refer Parquet File Format.

          Should be "please refer to the Parquet File Format."

          If you are not familiar with CREATE TABLE statement, please refer Data Definition Language Data Definition Language.

          Similar to above, add a "to" after "refer." Also, "Data Definition Language" appears to be repeated.

          In order to specify a certain file format for your table, you need to use USING PARQUET clause

          Add a "the" after "use."

          The below is an example statement for creating a table using PARQUET files. WITH clause allows users to specify a set of physical properties.

          I'm not sure whether PARQUET needs to be all-caps here. Also, add a "the" before "WITH."

          Some table file formats provide special enable/disable features and the ways to adjust some physical parameters. WITH clause in CREATE TABLE statement allows users to set those physical parameters.

          I think it might be better to rephrase this as: "Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. The WITH clause in the CREATE TABLE statement allows users to set those parameters."

          Now, Parquet file provides the following physical properties.

          "The Parquet storage format provides the following physical properties:"

          Larger values will improve the IO when reading but consume more memory when writing.

          IO should be I/O.

          The compression algorithm used to compress pages. It should be one of UNCOMPRESSED, SNAPPY, GZIP, LZO. Default is UNCOMPRESSED.

          I believe the convention for the compression codec names from the Parquet documentation is to use all lowercase (even though code-wise, ALL CAPS still works anyway ).

          Compatibility Issues

          It might be possible that users might try to use Parquet files with nested schemas and non-scalar types, which is currently a compatibility issue. Should we add a note that we are currently working on adding support for nested schemas on non-scalar types?

          --------

          If you would like, I would be glad to review the documentation for the other storage formats too. Can you create a RB review for this patch? It might be easier to review on RB.

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - Not a problem! The documentation looks great so far. I have some minor comments: For more details, please refer Parquet File Format. Should be "please refer to the Parquet File Format." If you are not familiar with CREATE TABLE statement, please refer Data Definition Language Data Definition Language. Similar to above, add a "to" after "refer." Also, "Data Definition Language" appears to be repeated. In order to specify a certain file format for your table, you need to use USING PARQUET clause Add a "the" after "use." The below is an example statement for creating a table using PARQUET files. WITH clause allows users to specify a set of physical properties. I'm not sure whether PARQUET needs to be all-caps here. Also, add a "the" before "WITH." Some table file formats provide special enable/disable features and the ways to adjust some physical parameters. WITH clause in CREATE TABLE statement allows users to set those physical parameters. I think it might be better to rephrase this as: "Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. The WITH clause in the CREATE TABLE statement allows users to set those parameters." Now, Parquet file provides the following physical properties. "The Parquet storage format provides the following physical properties:" Larger values will improve the IO when reading but consume more memory when writing. IO should be I/O. The compression algorithm used to compress pages. It should be one of UNCOMPRESSED, SNAPPY, GZIP, LZO. Default is UNCOMPRESSED. I believe the convention for the compression codec names from the Parquet documentation is to use all lowercase (even though code-wise, ALL CAPS still works anyway ). Compatibility Issues It might be possible that users might try to use Parquet files with nested schemas and non-scalar types, which is currently a compatibility issue. Should we add a note that we are currently working on adding support for nested schemas on non-scalar types? -------- If you would like, I would be glad to review the documentation for the other storage formats too. Can you create a RB review for this patch? It might be easier to review on RB. Thanks, David
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi Jihoon Son,

          Thank you for your comments. I got the missed points from your comments. I'll reflect your comment to the next patch.

          Hi David Chen,

          I really appreciate your effort. I'm really welcome to your reviews on other storage format. I'll submit the updated patch which reflects Jihoon's and your comments on RB.

          Regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi Jihoon Son , Thank you for your comments. I got the missed points from your comments. I'll reflect your comment to the next patch. Hi David Chen , I really appreciate your effort. I'm really welcome to your reviews on other storage format. I'll submit the updated patch which reflects Jihoon's and your comments on RB. Regards, Hyunsik
          Hide
          hyunsik Hyunsik Choi added a comment -

          Created a review request against branch master in reviewboard
          https://reviews.apache.org/r/20060/

          Show
          hyunsik Hyunsik Choi added a comment - Created a review request against branch master in reviewboard https://reviews.apache.org/r/20060/
          Hide
          hyunsik Hyunsik Choi added a comment -

          Thank you for your comments. I just submitted the updated patch which reflects your comments.

          Show
          hyunsik Hyunsik Choi added a comment - Thank you for your comments. I just submitted the updated patch which reflects your comments.
          Hide
          davidzchen David Chen added a comment - - edited

          Hi Hyunsik,

          I have added my comments to the RB. Sorry of they are a bit nit-picky. I try to be as thorough as possible.

          Thanks for putting together the documentation!

          Best,
          David

          Show
          davidzchen David Chen added a comment - - edited Hi Hyunsik, I have added my comments to the RB. Sorry of they are a bit nit-picky. I try to be as thorough as possible. Thanks for putting together the documentation! Best, David
          Hide
          hyunsik Hyunsik Choi added a comment -

          Hi David,

          Rather, I really appreciate your detailed review! I'll reflect all your comments to our documentation. Thank you very much again!

          Warm regards,
          Hyunsik

          Show
          hyunsik Hyunsik Choi added a comment - Hi David, Rather, I really appreciate your detailed review! I'll reflect all your comments to our documentation. Thank you very much again! Warm regards, Hyunsik
          Hide
          jihoonson Jihoon Son added a comment -

          Hi David.

          Truly appreciate for your detailed review!
          Thanks!

          Best Regards,
          Jihoon

          Show
          jihoonson Jihoon Son added a comment - Hi David. Truly appreciate for your detailed review! Thanks! Best Regards, Jihoon
          Hide
          davidzchen David Chen added a comment -

          No problem! Feel free to let me know if you have any questions.

          Thanks,
          David

          Show
          davidzchen David Chen added a comment - No problem! Feel free to let me know if you have any questions. Thanks, David
          Hide
          jhkim Jinho Kim added a comment -

          Thank you so much guys!

          Show
          jhkim Jinho Kim added a comment - Thank you so much guys!
          Hide
          hyunsik Hyunsik Choi added a comment - - edited

          Updated the review request against branch master in reviewboard
          https://reviews.apache.org/r/20060/

          The patch reflected David's comments. Thank you David for your nice comments.
          I think that the patch is ready to be committed.

          Show
          hyunsik Hyunsik Choi added a comment - - edited Updated the review request against branch master in reviewboard https://reviews.apache.org/r/20060/ The patch reflected David's comments. Thank you David for your nice comments. I think that the patch is ready to be committed.
          Show
          hyunsik Hyunsik Choi added a comment - You can see the updated pages at below links: http://people.apache.org/~hyunsik/TAJO-736_3/table_management/csv.html http://people.apache.org/~hyunsik/TAJO-736_3/table_management/rcfile.html http://people.apache.org/~hyunsik/TAJO-736_3/table_management/parquet.html http://people.apache.org/~hyunsik/TAJO-736_3/partitioning/column_partitioning.html
          Hide
          jhkim Jinho Kim added a comment -

          +1 for the patch.
          Thanks

          Show
          jhkim Jinho Kim added a comment - +1 for the patch. Thanks
          Hide
          jihoonson Jihoon Son added a comment -

          Thanks for your work.
          The patch looks great to me.
          If there are some unknown missing features or insufficient things, we can improve them later.
          +1

          Show
          jihoonson Jihoon Son added a comment - Thanks for your work. The patch looks great to me. If there are some unknown missing features or insufficient things, we can improve them later. +1
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-0.8.0-build #63 (See https://builds.apache.org/job/Tajo-0.8.0-build/63/)
          TAJO-736: Add table management documentation. (hyunsik) (hyunsik: rev 50e8c230194ccd2e9e3bb0065770823064dd243c)

          • tajo-docs/src/main/sphinx/table_management/rcfile.rst
          • tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst
          • tajo-docs/src/main/sphinx/table_management/csv.rst
          • CHANGES.txt
          • tajo-docs/src/main/sphinx/table_management/parquet.rst
          • tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-0.8.0-build #63 (See https://builds.apache.org/job/Tajo-0.8.0-build/63/ ) TAJO-736 : Add table management documentation. (hyunsik) (hyunsik: rev 50e8c230194ccd2e9e3bb0065770823064dd243c) tajo-docs/src/main/sphinx/table_management/rcfile.rst tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst tajo-docs/src/main/sphinx/table_management/csv.rst CHANGES.txt tajo-docs/src/main/sphinx/table_management/parquet.rst tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
          Hide
          hyunsik Hyunsik Choi added a comment -

          Just committed it to master branch and 0.8.0 branch.

          Thank you all guys. Especially, thank you David very much for your detailed review. That was very big help for us.

          Show
          hyunsik Hyunsik Choi added a comment - Just committed it to master branch and 0.8.0 branch. Thank you all guys. Especially, thank you David very much for your detailed review. That was very big help for us.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tajo-master-build #163 (See https://builds.apache.org/job/Tajo-master-build/163/)
          TAJO-736: Add table management documentation. (hyunsik) (hyunsik: rev d952b61ae1f3a8b4e32e2e390421f8d973b9f6fb)

          • tajo-docs/src/main/sphinx/table_management/rcfile.rst
          • CHANGES.txt
          • tajo-docs/src/main/sphinx/table_management/parquet.rst
          • tajo-docs/src/main/sphinx/configuration/cluster_setup.rst
          • tajo-docs/src/main/sphinx/table_management/csv.rst
          • tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst
          • tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #163 (See https://builds.apache.org/job/Tajo-master-build/163/ ) TAJO-736 : Add table management documentation. (hyunsik) (hyunsik: rev d952b61ae1f3a8b4e32e2e390421f8d973b9f6fb) tajo-docs/src/main/sphinx/table_management/rcfile.rst CHANGES.txt tajo-docs/src/main/sphinx/table_management/parquet.rst tajo-docs/src/main/sphinx/configuration/cluster_setup.rst tajo-docs/src/main/sphinx/table_management/csv.rst tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst

            People

            • Assignee:
              hyunsik Hyunsik Choi
              Reporter:
              hyunsik Hyunsik Choi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development