Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1222

DelimitedTextFile should be tolerant against parsing errors.

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: Storage
    • Labels:
      None

      Description

      DelimitedTextFile is a base class for plan-text file formats like CSV or JSON. In practice, due to various reasons, parsing errors are usual. But, the current implementation does not allow any parsing error. It is inconvenient in many cases.

      The objective of this issue is to enable DelimitedTextFile to tolerate parsing errors up to the number given by users.

      1. TAJO-1222_2.patch
        22 kB
        Hyunsik Choi
      2. TAJO-1222_3.patch
        23 kB
        Hyunsik Choi
      3. TAJO-1222.patch
        22 kB
        Hyunsik Choi

        Activity

        Hide
        hyunsik Hyunsik Choi added a comment -

        This patch enables plan-text based file formats like Json and CSV to be tolerant against parsing errors.
        I added a table property text.error-tolerance.max-num to allow users to set the maximum number of
        permissible parsing errors. You can set this table property as follows:

        CREATE EXTERNAL TABLE ... (
          ...
          ...
          model TEXT,
          screen_height INT,
          os TEXT,
          createdat TEXT,
          "data usage" TEXT
        ) USING JSON WITH ('text.error-tolerance.max-num' = '0') PARTITION BY
        COLUMN (year TEXT, month TEXT) LOCATION 'file:///.../locket';
        

        As I mentioned above, text.error-tolerance.max-num means the maximum number of permissible parsing errors. Additionally, it can have three different behaviors according to the number.

        • If text.error-tolerance.max-num is set to negative integer (i.e., < 0), a running query will ignore all parsing errors.
        • If text.error-tolerance.max-num is set to 0, a running query does not allow any parsing error.
        • If text.error-tolerance.max-num is set to some positive integer number (i.e., > 0), a running query only ignores the given number of parsing errors in each task.

        By default, 'text.error-tolerance.max-num' will be set to 0 if this property is not specified.

        Show
        hyunsik Hyunsik Choi added a comment - This patch enables plan-text based file formats like Json and CSV to be tolerant against parsing errors. I added a table property text.error-tolerance.max-num to allow users to set the maximum number of permissible parsing errors. You can set this table property as follows: CREATE EXTERNAL TABLE ... ( ... ... model TEXT, screen_height INT, os TEXT, createdat TEXT, "data usage" TEXT ) USING JSON WITH ('text.error-tolerance.max-num' = '0') PARTITION BY COLUMN (year TEXT, month TEXT) LOCATION 'file:///.../locket'; As I mentioned above, text.error-tolerance.max-num means the maximum number of permissible parsing errors. Additionally, it can have three different behaviors according to the number. If text.error-tolerance.max-num is set to negative integer (i.e., < 0), a running query will ignore all parsing errors. If text.error-tolerance.max-num is set to 0, a running query does not allow any parsing error. If text.error-tolerance.max-num is set to some positive integer number (i.e., > 0), a running query only ignores the given number of parsing errors in each task. By default, 'text.error-tolerance.max-num' will be set to 0 if this property is not specified.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user hyunsik opened a pull request:

        https://github.com/apache/tajo/pull/277

        TAJO-1222: DelimitedTextFile should be tolerant against parsing errors.

        Please see the description and comments at https://issues.apache.org/jira/browse/TAJO-1222.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/hyunsik/tajo TAJO-1222

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tajo/pull/277.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #277


        commit b260fcebd410e9942b515267ae8708214cf326d6
        Author: Hyunsik Choi <hyunsik@apache.org>
        Date: 2014-12-02T17:50:02Z

        TAJO-1222: DelimitedTextFile should be tolerant against parsing errors.


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user hyunsik opened a pull request: https://github.com/apache/tajo/pull/277 TAJO-1222 : DelimitedTextFile should be tolerant against parsing errors. Please see the description and comments at https://issues.apache.org/jira/browse/TAJO-1222 . You can merge this pull request into a Git repository by running: $ git pull https://github.com/hyunsik/tajo TAJO-1222 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/277.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #277 commit b260fcebd410e9942b515267ae8708214cf326d6 Author: Hyunsik Choi <hyunsik@apache.org> Date: 2014-12-02T17:50:02Z TAJO-1222 : DelimitedTextFile should be tolerant against parsing errors.
        Hide
        tajoqa Tajo QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12684688/TAJO-1222.patch
        against master revision release-0.9.0-rc0-64-g1cdbe46.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The applied patch does not increase the total number of javadoc warnings.

        +1 checkstyle. The patch generated 0 code style errors.

        -1 findbugs. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings.

        -1 release audit. The applied patch generated 359 release audit warnings.

        +1 core tests. The patch passed unit tests in tajo-common tajo-storage.

        Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/521//testReport/
        Release audit warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/521//artifact/incubator-tajo/patchprocess/patchReleaseAuditProblems.txt
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/521//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/521//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html
        Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/521//console

        This message is automatically generated.

        Show
        tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684688/TAJO-1222.patch against master revision release-0.9.0-rc0-64-g1cdbe46. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The applied patch does not increase the total number of javadoc warnings. +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings. -1 release audit. The applied patch generated 359 release audit warnings. +1 core tests. The patch passed unit tests in tajo-common tajo-storage. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/521//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/521//artifact/incubator-tajo/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/521//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/521//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/521//console This message is automatically generated.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jinossy commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/277#discussion_r21209793

        — Diff: tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java —
        @@ -355,21 +368,58 @@ public float getProgress() {

        @Override
        public Tuple next() throws IOException {
        +
        + if (!reader.isReadable())

        { + return null; + }

        +
        + if (targets.length == 0)

        { + return EmptyTuple.get(); + }

        +
        + VTuple tuple = new VTuple(schema.size());
        +
        try {

        • if (!reader.isReadable()) return null;
        • ByteBuf buf = readLine();
        • if (buf == null) return null;
          + // this loop will continue until one tuple is build or EOS (end of stream).
          + do {
        • if (targets.length == 0) { - return EmptyTuple.get(); - }

          + ByteBuf buf = readLine();
          + if (buf == null)

          { + return null; + }

          +
          + try {
          +
          + deserializer.deserialize(buf, tuple);
          +

            • End diff –

        Could you move the recordCount to this line?

        Show
        githubbot ASF GitHub Bot added a comment - Github user jinossy commented on a diff in the pull request: https://github.com/apache/tajo/pull/277#discussion_r21209793 — Diff: tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java — @@ -355,21 +368,58 @@ public float getProgress() { @Override public Tuple next() throws IOException { + + if (!reader.isReadable()) { + return null; + } + + if (targets.length == 0) { + return EmptyTuple.get(); + } + + VTuple tuple = new VTuple(schema.size()); + try { if (!reader.isReadable()) return null; ByteBuf buf = readLine(); if (buf == null) return null; + // this loop will continue until one tuple is build or EOS (end of stream). + do { if (targets.length == 0) { - return EmptyTuple.get(); - } + ByteBuf buf = readLine(); + if (buf == null) { + return null; + } + + try { + + deserializer.deserialize(buf, tuple); + End diff – Could you move the recordCount to this line?
        Hide
        hyunsik Hyunsik Choi added a comment -

        Jinho Kim,

        Thank you for your quick review

        I've updated the patch. In this patch, I moved the variable recordCount before 'return tuple'. As a result, recordCount will count only number of actual read records.

        Show
        hyunsik Hyunsik Choi added a comment - Jinho Kim , Thank you for your quick review I've updated the patch. In this patch, I moved the variable recordCount before 'return tuple'. As a result, recordCount will count only number of actual read records.
        Hide
        tajoqa Tajo QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12684810/TAJO-1222_2.patch
        against master revision release-0.9.0-rc0-64-g1cdbe46.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The applied patch does not increase the total number of javadoc warnings.

        +1 checkstyle. The patch generated 0 code style errors.

        -1 findbugs. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings.

        -1 release audit. The applied patch generated 115 release audit warnings.

        +1 core tests. The patch passed unit tests in tajo-common tajo-storage.

        Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/522//testReport/
        Release audit warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/522//artifact/incubator-tajo/patchprocess/patchReleaseAuditProblems.txt
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/522//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/522//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-common.html
        Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/522//console

        This message is automatically generated.

        Show
        tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684810/TAJO-1222_2.patch against master revision release-0.9.0-rc0-64-g1cdbe46. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The applied patch does not increase the total number of javadoc warnings. +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings. -1 release audit. The applied patch generated 115 release audit warnings. +1 core tests. The patch passed unit tests in tajo-common tajo-storage. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/522//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/522//artifact/incubator-tajo/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/522//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/522//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-common.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/522//console This message is automatically generated.
        Hide
        tajoqa Tajo QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12684820/TAJO-1222_3.patch
        against master revision release-0.9.0-rc0-64-g1cdbe46.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The applied patch does not increase the total number of javadoc warnings.

        +1 checkstyle. The patch generated 0 code style errors.

        -1 findbugs. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings.

        -1 release audit. The applied patch generated 116 release audit warnings.

        +1 core tests. The patch passed unit tests in tajo-common tajo-storage.

        Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/523//testReport/
        Release audit warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/523//artifact/incubator-tajo/patchprocess/patchReleaseAuditProblems.txt
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/523//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/523//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html
        Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/523//console

        This message is automatically generated.

        Show
        tajoqa Tajo QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684820/TAJO-1222_3.patch against master revision release-0.9.0-rc0-64-g1cdbe46. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The applied patch does not increase the total number of javadoc warnings. +1 checkstyle. The patch generated 0 code style errors. -1 findbugs. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings. -1 release audit. The applied patch generated 116 release audit warnings. +1 core tests. The patch passed unit tests in tajo-common tajo-storage. Test results: https://builds.apache.org/job/PreCommit-TAJO-Build/523//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/523//artifact/incubator-tajo/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/523//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TAJO-Build/523//artifact/incubator-tajo/patchprocess/newPatchFindbugsWarningstajo-storage.html Console output: https://builds.apache.org/job/PreCommit-TAJO-Build/523//console This message is automatically generated.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user jinossy commented on the pull request:

        https://github.com/apache/tajo/pull/277#issuecomment-65362188

        +1 LGTM!
        Ship it!

        Show
        githubbot ASF GitHub Bot added a comment - Github user jinossy commented on the pull request: https://github.com/apache/tajo/pull/277#issuecomment-65362188 +1 LGTM! Ship it!
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tajo/pull/277

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/277
        Hide
        hyunsik Hyunsik Choi added a comment -

        committed.

        Show
        hyunsik Hyunsik Choi added a comment - committed.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-build #476 (See https://builds.apache.org/job/Tajo-master-build/476/)
        TAJO-1222: DelimitedTextFile should be tolerant against parsing errors. (hyunsik: rev f69938abecd3d53968e318d97aba53d9acd3de40)

        • tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance1.json
        • tajo-docs/src/main/sphinx/table_management/csv.rst
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineParsingError.java
        • tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance2.json
        • tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/json/JsonLineDeserializer.java
        • tajo-storage/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java
        • CHANGES
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/FieldSerializerDeserializer.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #476 (See https://builds.apache.org/job/Tajo-master-build/476/ ) TAJO-1222 : DelimitedTextFile should be tolerant against parsing errors. (hyunsik: rev f69938abecd3d53968e318d97aba53d9acd3de40) tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance1.json tajo-docs/src/main/sphinx/table_management/csv.rst tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineParsingError.java tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance2.json tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java tajo-storage/src/main/java/org/apache/tajo/storage/json/JsonLineDeserializer.java tajo-storage/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java CHANGES tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java tajo-storage/src/main/java/org/apache/tajo/storage/FieldSerializerDeserializer.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Tajo-master-CODEGEN-build #117 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/117/)
        TAJO-1222: DelimitedTextFile should be tolerant against parsing errors. (hyunsik: rev f69938abecd3d53968e318d97aba53d9acd3de40)

        • tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java
        • tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
        • tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance1.json
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/FieldSerializerDeserializer.java
        • tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance2.json
        • tajo-storage/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/json/JsonLineDeserializer.java
        • CHANGES
        • tajo-docs/src/main/sphinx/table_management/csv.rst
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineParsingError.java
        • tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Tajo-master-CODEGEN-build #117 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/117/ ) TAJO-1222 : DelimitedTextFile should be tolerant against parsing errors. (hyunsik: rev f69938abecd3d53968e318d97aba53d9acd3de40) tajo-storage/src/main/java/org/apache/tajo/storage/text/CSVLineDeserializer.java tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance1.json tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineDeserializer.java tajo-storage/src/main/java/org/apache/tajo/storage/FieldSerializerDeserializer.java tajo-storage/src/test/resources/dataset/TestDelimitedTextFile/testErrorTolerance2.json tajo-storage/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java tajo-storage/src/main/java/org/apache/tajo/storage/json/JsonLineDeserializer.java CHANGES tajo-docs/src/main/sphinx/table_management/csv.rst tajo-storage/src/main/java/org/apache/tajo/storage/text/TextLineParsingError.java tajo-storage/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java

          People

          • Assignee:
            hyunsik Hyunsik Choi
            Reporter:
            hyunsik Hyunsik Choi
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development