Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-2520

left semi join will duplicate data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.7.0
    • 0.9.0
    • None
    • Reviewed

    Description

      CREATE TABLE sales (name STRING, id INT)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

      CREATE TABLE things (id INT, name STRING)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

      The 'sales' table has data in a file: sales.txt, and the data is:
      Joe 2
      Hank 2

      The 'things' table has data int two files: things.txt and things2.txt:
      The content of things.txt is :
      2 Tie
      The content of things2.txt is :
      2 Tie

      SELECT * FROM sales LEFT SEMI JOIN things ON (sales.id = things.id);
      will output:
      Joe 2
      Joe 2
      Hank 2
      Hank 2
      so the result is wrong.

      In CommonJoinOperator left semi join should use " genObject(null, 0, new IntermediateObject(new ArrayList[numAliases], 0), true); " to generate data.
      but now it uses " genUniqueJoinObject(0, 0); " to generate data.
      This patch will solve this problem.

      Attachments

        1. ASF.LICENSE.NOT.GRANTED--HIVE-2520.D717.1.patch
          7 kB
          Phabricator
        2. hive-2520.2.patch
          7 kB
          Lijin Bin
        3. hive-2520.patch
          1 kB
          Lijin Bin

        Activity

          People

            binlijin Lijin Bin
            binlijin Lijin Bin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: