Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32474

NullAwareAntiJoin multi-column support

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 3.0.0
    • None
    • SQL
    • None

    Description

      This is a follow up improvement of Issue SPARK-32290.

      In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support.

      See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

       

      FYI, code logical for single and multi column is defined at

      ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

      ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

       

      For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation.

       

      In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record

      (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow.

      (null, 2, 3)

      (1, null, 3)

      (1, 2, null)

      (null, null, 3)

      (null, 2, null)

      (1, null, null)

      (1, 2, 3)

       

      with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns.

      • buildSide is empty input => return all rows
      • allNullColumnKey Exists In buildSide input => reject all rows
      • if streamedSideRow.allNull is true => drop the row
      • if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row
      • if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row

       

      this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. 

       

       

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            leanken Leanken.Lin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: