Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-1374

Support multi-bytes delimiter for CSV file

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11.0, 0.10.1
    • Component/s: Storage
    • Labels:
      None

      Description

      Supports multi-character / non-ascii delimiter for CSV file.

        Issue Links

          Activity

          Hide
          jhkim Jinho Kim added a comment -

          committed it to branch-0.10.1

          Show
          jhkim Jinho Kim added a comment - committed it to branch-0.10.1
          Hide
          jhkim Jinho Kim added a comment -

          committed it

          Show
          jhkim Jinho Kim added a comment - committed it
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tajo/pull/400

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/400
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jinossy commented on the pull request:

          https://github.com/apache/tajo/pull/400#issuecomment-78809964

          +1 LGTM I'll commit it shortly.
          Thank you for your contribution

          Show
          githubbot ASF GitHub Bot added a comment - Github user jinossy commented on the pull request: https://github.com/apache/tajo/pull/400#issuecomment-78809964 +1 LGTM I'll commit it shortly. Thank you for your contribution
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on the pull request:

          https://github.com/apache/tajo/pull/400#issuecomment-77995916

          Fixed test fails

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on the pull request: https://github.com/apache/tajo/pull/400#issuecomment-77995916 Fixed test fails
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user navis commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/400#discussion_r26098848

          — Diff: tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/LazyTuple.java —
          @@ -134,7 +134,7 @@ else if (textBytes.length <= fieldId) {
          }
          textBytes[fieldId] = null;
          } else {

          • //non-projection
            + values[fieldId] = NullDatum.get(); //non-projection
              • End diff –

          reverted this part. I've hardly noticed tajo expects null for not-picked and byte[0] for not-existing picked.

          Show
          githubbot ASF GitHub Bot added a comment - Github user navis commented on a diff in the pull request: https://github.com/apache/tajo/pull/400#discussion_r26098848 — Diff: tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/LazyTuple.java — @@ -134,7 +134,7 @@ else if (textBytes.length <= fieldId) { } textBytes [fieldId] = null; } else { //non-projection + values [fieldId] = NullDatum.get(); //non-projection End diff – reverted this part. I've hardly noticed tajo expects null for not-picked and byte [0] for not-existing picked.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jinossy commented on the pull request:

          https://github.com/apache/tajo/pull/400#issuecomment-77982695

          @navis
          OK, I've create new jira issue TAJO-1381.
          If you fix test failure, we can start review.

          Thanks

          Show
          githubbot ASF GitHub Bot added a comment - Github user jinossy commented on the pull request: https://github.com/apache/tajo/pull/400#issuecomment-77982695 @navis OK, I've create new jira issue TAJO-1381 . If you fix test failure, we can start review. Thanks
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ykrips commented on a diff in the pull request:

          https://github.com/apache/tajo/pull/400#discussion_r26091955

          — Diff: tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/LazyTuple.java —
          @@ -134,7 +134,7 @@ else if (textBytes.length <= fieldId) {
          }
          textBytes[fieldId] = null;
          } else {

          • //non-projection
            + values[fieldId] = NullDatum.get(); //non-projection
              • End diff –

          This code fills the NullDatum to values array of LazyTuple when user does not insert any data on this index. However, VTuple class does not initialize the values array on itself.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ykrips commented on a diff in the pull request: https://github.com/apache/tajo/pull/400#discussion_r26091955 — Diff: tajo-storage/tajo-storage-common/src/main/java/org/apache/tajo/storage/LazyTuple.java — @@ -134,7 +134,7 @@ else if (textBytes.length <= fieldId) { } textBytes [fieldId] = null; } else { //non-projection + values [fieldId] = NullDatum.get(); //non-projection End diff – This code fills the NullDatum to values array of LazyTuple when user does not insert any data on this index. However, VTuple class does not initialize the values array on itself.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hyunsik commented on the pull request:

          https://github.com/apache/tajo/pull/400#issuecomment-77974324

          The patch looks nice to me. Of course, your work is useful.

          It would be great if this feature is applied to DelimitedTextFile too because DelimitedTextFile is new replacement to CSVFile. DelimitedTextFile's performance is really great. According to some benchmark result, it can parse more than 500MB CSV data sets per second. It also can boost up query response times in many cases, especially I/O intensive workloads.

          I think that the work for DelimitedTextFile does not need to be done in this issue. We can do in another jira.

          Anyway, could you fix some test failure? It still has one test failure.

          Show
          githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/400#issuecomment-77974324 The patch looks nice to me. Of course, your work is useful. It would be great if this feature is applied to DelimitedTextFile too because DelimitedTextFile is new replacement to CSVFile. DelimitedTextFile's performance is really great. According to some benchmark result, it can parse more than 500MB CSV data sets per second. It also can boost up query response times in many cases, especially I/O intensive workloads. I think that the work for DelimitedTextFile does not need to be done in this issue. We can do in another jira. Anyway, could you fix some test failure? It still has one test failure.

            People

            • Assignee:
              navis Navis
              Reporter:
              navis Navis
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development