Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1209

Pluggable line (de)serializer for DelimitedTextFile

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: Storage
    • Labels:
      None

      Description

      DelimitedTextFile directly parses line delimited text files and parses each line into CSV or TSV field. It has many limits when we deal with custom text-based file format.

      This patch enables DelimitedTextFile to use a pluggable line (de) serializer.

      First of all, I add an abstract class for user-defined line serde class as follows:

      public abstract class TextLineSerde {
        protected Schema schema;
        protected TableMeta meta;
        protected int [] targetColumnIndexes;
      
        public TextLineSerde(Schema schema, TableMeta meta, int[] targetColumnIndexes) {
          this.schema = schema;
          this.meta = meta;
          this.targetColumnIndexes = targetColumnIndexes;
        }
      
        public abstract void init();
      
        public abstract void buildTuple(final ByteBuf buf, Tuple tuple) throws IOException;
      
        public abstract void release();
      }
      

      I also added a table property text.serde.class which allows users to specify a custom line serder. This table property affects only TEXT file format. You can specify your own line serder as follows:

      CREATE XXX (x int, y int) USING TEXT WITH ('text.serde.class' = 'org.apache.tajo.storage.text.CSVLineSerde')
      

        Activity

        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Tajo-master-CODEGEN-build #108 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/108/)
        TAJO-1209: Pluggable line (de)serializer for DelimitedTextFile. (hyunsik: rev 72dd29c520981a3ffaac2150ee7306ca41192893)

        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerializer.java
        • tajo-common/src/main/java/org/apache/tajo/util/ReflectionUtil.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java
        • tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java
        • CHANGES
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerializer.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerDe.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerDe.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-CODEGEN-build #108 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/108/ ) TAJO-1209 : Pluggable line (de)serializer for DelimitedTextFile. (hyunsik: rev 72dd29c520981a3ffaac2150ee7306ca41192893) tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerializer.java tajo-common/src/main/java/org/apache/tajo/util/ReflectionUtil.java tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java CHANGES tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerializer.java tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerDe.java tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerDe.java tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-build #466 (See https://builds.apache.org/job/Tajo-master-build/466/)
        TAJO-1209: Pluggable line (de)serializer for DelimitedTextFile. (hyunsik: rev 72dd29c520981a3ffaac2150ee7306ca41192893)

        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerDe.java
        • tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerializer.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java
        • tajo-common/src/main/java/org/apache/tajo/util/ReflectionUtil.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerializer.java
        • CHANGES
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerDe.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #466 (See https://builds.apache.org/job/Tajo-master-build/466/ ) TAJO-1209 : Pluggable line (de)serializer for DelimitedTextFile. (hyunsik: rev 72dd29c520981a3ffaac2150ee7306ca41192893) tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerDe.java tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineSerializer.java tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java tajo-common/src/main/java/org/apache/tajo/util/ReflectionUtil.java tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerializer.java CHANGES tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineSerDe.java
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik closed the pull request at:

        https://github.com/apache/tajo/pull/271

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik closed the pull request at: https://github.com/apache/tajo/pull/271
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/271#issuecomment-64775068

        committed.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/271#issuecomment-64775068 committed.
        Hide
        hyunsik Hyunsik Choi added a comment -

        committed.

        Show
        hyunsik Hyunsik Choi added a comment - committed.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jinossy commented on the pull request:

        https://github.com/apache/tajo/pull/271#issuecomment-64770248

        +1
        looks good to me.

        Show
        githubbot ASF GitHub Bot added a comment - Github user jinossy commented on the pull request: https://github.com/apache/tajo/pull/271#issuecomment-64770248 +1 looks good to me.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/271#issuecomment-64761757

        I've updated the patch.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/271#issuecomment-64761757 I've updated the patch.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/271#issuecomment-64743403

        See the description in https://issues.apache.org/jira/browse/TAJO-1209.

        I keep the existing CSVFile class because TAJO-838 needs CSVFile which is a seekable scanner.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/271#issuecomment-64743403 See the description in https://issues.apache.org/jira/browse/TAJO-1209 . I keep the existing CSVFile class because TAJO-838 needs CSVFile which is a seekable scanner.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user hyunsik opened a pull request:

        https://github.com/apache/tajo/pull/271

        TAJO-1209: Pluggable line (de)serializer for DelimitedTextFile.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/hyunsik/tajo TAJO-1209

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tajo/pull/271.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #271


        commit 7837dde631cb4d66635663b5c80db47a944db044
        Author: Hyunsik Choi <hyunsik@apache.org>
        Date: 2014-11-27T03:39:12Z

        TAJO-1209: Pluggable line (de)serializer for DelimitedTextFile.


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user hyunsik opened a pull request: https://github.com/apache/tajo/pull/271 TAJO-1209 : Pluggable line (de)serializer for DelimitedTextFile. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hyunsik/tajo TAJO-1209 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/271.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #271 commit 7837dde631cb4d66635663b5c80db47a944db044 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-11-27T03:39:12Z TAJO-1209 : Pluggable line (de)serializer for DelimitedTextFile.

          People

          • Assignee:
            hyunsik Hyunsik Choi
            Reporter:
            hyunsik Hyunsik Choi
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development