Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5514

Enhance VectorContainer to merge two row sets

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.10.0
    • 1.11.0
    • None

    Description

      Consider the concept of a "record batch" in Drill. On the one hand, one can envision a record batch as a stack of records:

      | a1 | b1 | c1 |
      ----------------
      | a2 | b2 | c2 |
      

      But, Drill is columnar. So a record batch is really a "bundle" of vectors:

      | a1 |    | b1 |    | c1 |
      | a2 |    | b2 |    | c2 |
      

      There are times when it is handy to build up a record batch as a merge of two different vector bundles:

      -- bundle 1 --    -- bundle 2 --
      | a1 |    | b1 |        | c1 |
      | a2 |    | b2 |        | c2 |
      

      For example, consider a reader. The reader implementation might read columns (a, b) from a file, say. Then, the "ScanBatch" might add (c) as an implicit vector (the file name, say.) The merged set of vectors comprises the final schema: (a, b, c).

      This ticket asks for the code to do the merge:

      • Merge two schemas A = (a, b), B = (c) to create schema C = (a, b, c).
      • Merge two vector containers C1 and C2 to create a new container, C3, that holds the merger of the vectors from the first two.

      Clearly, the merge only makes sense if:

      • The two input containers have the same row count, and
      • The columns in each input container are distinct.

      Because this feature is also useful for tests, add the merge to the "row set" tools also.

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            githubbot ASF GitHub Bot added a comment -

            GitHub user paul-rogers opened a pull request:

            https://github.com/apache/drill/pull/837

            DRILL-5514: Enhance VectorContainer to merge two row sets

            Adds ability to merge two schemas and to merge two vector containers,
            in each case producing a new, merged result. See DRILL-5514 for details.

            Also provides a handy constructor to create a vector container given a
            pre-defined schema.

            You can merge this pull request into a Git repository by running:

            $ git pull https://github.com/paul-rogers/drill DRILL-5514

            Alternatively you can review and apply these changes as the patch at:

            https://github.com/apache/drill/pull/837.patch

            To close this pull request, make a commit to your master/trunk branch
            with (at least) the following in the commit message:

            This closes #837


            commit 5b2ceccd7d002b56b93abbff769bfb96b9ff0ff6
            Author: Paul Rogers <progers@maprtech.com>
            Date: 2017-05-15T22:59:35Z

            DRILL-5514: Enhance VectorContainer to merge two row sets

            Adds ability to merge two schemas and to merge two vector containers,
            in each case producing a new, merged result. See DRILL-5514 for details.

            Also provides a handy constructor to create a vector container given a
            pre-defined schema.


            githubbot ASF GitHub Bot added a comment - GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/837 DRILL-5514 : Enhance VectorContainer to merge two row sets Adds ability to merge two schemas and to merge two vector containers, in each case producing a new, merged result. See DRILL-5514 for details. Also provides a handy constructor to create a vector container given a pre-defined schema. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5514 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/837.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #837 commit 5b2ceccd7d002b56b93abbff769bfb96b9ff0ff6 Author: Paul Rogers <progers@maprtech.com> Date: 2017-05-15T22:59:35Z DRILL-5514 : Enhance VectorContainer to merge two row sets Adds ability to merge two schemas and to merge two vector containers, in each case producing a new, merged result. See DRILL-5514 for details. Also provides a handy constructor to create a vector container given a pre-defined schema.
            githubbot ASF GitHub Bot added a comment -

            Github user bitblender commented on a diff in the pull request:

            https://github.com/apache/drill/pull/837#discussion_r118797793

            — Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java —
            @@ -157,4 +158,26 @@ private boolean majorTypeEqual(MajorType t1, MajorType t2)

            { return true; }

            + /**
            + * Merge two schema to produce a new, merged schema. The caller is responsible
            + * for ensuring that column names are unique. The order of the fields in the
            + * new schema is the same as that of this schema, with the other schema's fields
            + * appended in the order defined in the other schema. The resulting selection
            + * vector mode is the same as this schema. (That is, this schema is assumed to
            + * be the main part of the batch, possibly with a selection vector, with the
            + * other schema representing additional, new columns.)
            + * @param otherSchema the schema to merge with this one
            + * @return the new, merged, schema
            + */
            +
            + public BatchSchema merge(BatchSchema otherSchema) {
            + if (otherSchema.selectionVectorMode != SelectionVectorMode.NONE &&
            + selectionVectorMode != otherSchema.selectionVectorMode) {
            + throw new IllegalArgumentException("Left schema must carry the selection vector mode");
            — End diff –

            "Left schema must carry the same selection vector mode" + "as the right schema"?

            githubbot ASF GitHub Bot added a comment - Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/837#discussion_r118797793 — Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java — @@ -157,4 +158,26 @@ private boolean majorTypeEqual(MajorType t1, MajorType t2) { return true; } + /** + * Merge two schema to produce a new, merged schema. The caller is responsible + * for ensuring that column names are unique. The order of the fields in the + * new schema is the same as that of this schema, with the other schema's fields + * appended in the order defined in the other schema. The resulting selection + * vector mode is the same as this schema. (That is, this schema is assumed to + * be the main part of the batch, possibly with a selection vector, with the + * other schema representing additional, new columns.) + * @param otherSchema the schema to merge with this one + * @return the new, merged, schema + */ + + public BatchSchema merge(BatchSchema otherSchema) { + if (otherSchema.selectionVectorMode != SelectionVectorMode.NONE && + selectionVectorMode != otherSchema.selectionVectorMode) { + throw new IllegalArgumentException("Left schema must carry the selection vector mode"); — End diff – "Left schema must carry the same selection vector mode" + "as the right schema"?
            githubbot ASF GitHub Bot added a comment -

            Github user bitblender commented on a diff in the pull request:

            https://github.com/apache/drill/pull/837#discussion_r120198724

            — Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java —
            @@ -157,4 +158,26 @@ private boolean majorTypeEqual(MajorType t1, MajorType t2)

            { return true; }

            + /**
            + * Merge two schema to produce a new, merged schema. The caller is responsible
            + * for ensuring that column names are unique. The order of the fields in the
            + * new schema is the same as that of this schema, with the other schema's fields
            + * appended in the order defined in the other schema. The resulting selection
            + * vector mode is the same as this schema. (That is, this schema is assumed to
            + * be the main part of the batch, possibly with a selection vector, with the
            + * other schema representing additional, new columns.)
            + * @param otherSchema the schema to merge with this one
            + * @return the new, merged, schema
            + */
            +
            + public BatchSchema merge(BatchSchema otherSchema) {
            + if (otherSchema.selectionVectorMode != SelectionVectorMode.NONE &&
            + selectionVectorMode != otherSchema.selectionVectorMode)

            { + throw new IllegalArgumentException("Left schema must carry the selection vector mode"); + }

            + List<MaterializedField> mergedFields = new ArrayList<>();
            — End diff –

            List<MaterializedField> mergedFields = new ArrayList(this.fields.size() + otherSchema.fields.size()) would avoid having to potentially grow the ArrayList twice.

            githubbot ASF GitHub Bot added a comment - Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/837#discussion_r120198724 — Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java — @@ -157,4 +158,26 @@ private boolean majorTypeEqual(MajorType t1, MajorType t2) { return true; } + /** + * Merge two schema to produce a new, merged schema. The caller is responsible + * for ensuring that column names are unique. The order of the fields in the + * new schema is the same as that of this schema, with the other schema's fields + * appended in the order defined in the other schema. The resulting selection + * vector mode is the same as this schema. (That is, this schema is assumed to + * be the main part of the batch, possibly with a selection vector, with the + * other schema representing additional, new columns.) + * @param otherSchema the schema to merge with this one + * @return the new, merged, schema + */ + + public BatchSchema merge(BatchSchema otherSchema) { + if (otherSchema.selectionVectorMode != SelectionVectorMode.NONE && + selectionVectorMode != otherSchema.selectionVectorMode) { + throw new IllegalArgumentException("Left schema must carry the selection vector mode"); + } + List<MaterializedField> mergedFields = new ArrayList<>(); — End diff – List<MaterializedField> mergedFields = new ArrayList(this.fields.size() + otherSchema.fields.size()) would avoid having to potentially grow the ArrayList twice.
            githubbot ASF GitHub Bot added a comment -

            Github user bitblender commented on a diff in the pull request:

            https://github.com/apache/drill/pull/837#discussion_r122287096

            — Diff: exec/java-exec/src/test/java/org/apache/drill/exec/record/TestVectorContainer.java —
            @@ -110,13 +110,16 @@ public void testContainerMerge() {
            RowSet mergedRs = left.merge(right);
            comparison.verifyAndClear(mergedRs);

            • // Add a selection vector. Ensure the SV appears in the merged
            • // result. Test as a row set since container's don't actually
            • // carry the selection vector.
              + // Add a selection vector. Merging is forbidden.
                • End diff –

            Maybe this can be changed to "//Merging data with a selection vector is forbidden". As is the comment implies that we are adding a selection vector.

            githubbot ASF GitHub Bot added a comment - Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/837#discussion_r122287096 — Diff: exec/java-exec/src/test/java/org/apache/drill/exec/record/TestVectorContainer.java — @@ -110,13 +110,16 @@ public void testContainerMerge() { RowSet mergedRs = left.merge(right); comparison.verifyAndClear(mergedRs); // Add a selection vector. Ensure the SV appears in the merged // result. Test as a row set since container's don't actually // carry the selection vector. + // Add a selection vector. Merging is forbidden. End diff – Maybe this can be changed to "//Merging data with a selection vector is forbidden". As is the comment implies that we are adding a selection vector.
            githubbot ASF GitHub Bot added a comment -

            Github user bitblender commented on a diff in the pull request:

            https://github.com/apache/drill/pull/837#discussion_r122287615

            — Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java —
            @@ -162,20 +162,22 @@ private boolean majorTypeEqual(MajorType t1, MajorType t2) {

            • Merge two schema to produce a new, merged schema. The caller is responsible
            • for ensuring that column names are unique. The order of the fields in the
            • new schema is the same as that of this schema, with the other schema's fields
            • * appended in the order defined in the other schema. The resulting selection
            • * vector mode is the same as this schema. (That is, this schema is assumed to
            • * be the main part of the batch, possibly with a selection vector, with the
            • * other schema representing additional, new columns.)
              + * appended in the order defined in the other schema.
              + * <p>
              + * Merging data with selection vectors is unlikely to be useful, or work well.
                • End diff –

            Can you please leave a comment about why this is unlikely to be useful, or work well?

            githubbot ASF GitHub Bot added a comment - Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/837#discussion_r122287615 — Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java — @@ -162,20 +162,22 @@ private boolean majorTypeEqual(MajorType t1, MajorType t2) { Merge two schema to produce a new, merged schema. The caller is responsible for ensuring that column names are unique. The order of the fields in the new schema is the same as that of this schema, with the other schema's fields * appended in the order defined in the other schema. The resulting selection * vector mode is the same as this schema. (That is, this schema is assumed to * be the main part of the batch, possibly with a selection vector, with the * other schema representing additional, new columns.) + * appended in the order defined in the other schema. + * <p> + * Merging data with selection vectors is unlikely to be useful, or work well. End diff – Can you please leave a comment about why this is unlikely to be useful, or work well?
            githubbot ASF GitHub Bot added a comment -

            Github user paul-rogers commented on the issue:

            https://github.com/apache/drill/pull/837

            Fixed the two comments per CR suggestions. Please review. If acceptable, I'll squash the commits in preparation for commit.

            githubbot ASF GitHub Bot added a comment - Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/837 Fixed the two comments per CR suggestions. Please review. If acceptable, I'll squash the commits in preparation for commit.
            githubbot ASF GitHub Bot added a comment -

            Github user paul-rogers commented on the issue:

            https://github.com/apache/drill/pull/837

            Thanks for the review!

            Squashed commits and rebased on master to prepare for commit.

            githubbot ASF GitHub Bot added a comment - Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/837 Thanks for the review! Squashed commits and rebased on master to prepare for commit.
            githubbot ASF GitHub Bot added a comment -

            Github user asfgit closed the pull request at:

            https://github.com/apache/drill/pull/837

            githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/837

            Fixed in be43a9edd148ef3af6f92c5ce7cda235c5ac1ad6.

            arina Arina Ielchiieva added a comment - Fixed in be43a9edd148ef3af6f92c5ce7cda235c5ac1ad6.

            People

              paul-rogers Paul Rogers
              paul-rogers Paul Rogers
              Karthikeyan Manivannan Karthikeyan Manivannan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified