Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      Consider this query:

      SELECT a.num FROM (
      SELECT a.num AS num, b.num AS num2
      FROM foo a LEFT OUTER JOIN bar b ON a.num=b.num
      ) a
      WHERE a.num2 IS NULL;

      ...in this case, the table alias 'a' is ambiguous. It could be the outer table (i.e., the subquery result), or it could be the inner table (foo).

      In the above case, Hive silently parses the outer reference to a as the inner reference. The result, then, is akin to:
      SELECT foo.num FROM foo WHERE bar.num IS NULL. This is bad.

      The bigger problem, however, is that Hive even lets people use the same table alias at multiple points in the query. We should simply throw an exception during the parse stage if there is any ambiguity in which table is which, just like we do if the column names are ambiguous.

      Or, if for some reason we need people to be able to use 'a' to refer to multiple tables or subqueries, it would be excellent if the exact parsing structure were made clear and added to the wiki. In that case, I will file a separate bug JIRA to complain about how it should be different.

        Activity

        Hide
        John Sichi added a comment -

        We're fixing the bugs and sticking with the normal SQL rules, which allow duplicate aliases, for the reasons mentioned above.

        Show
        John Sichi added a comment - We're fixing the bugs and sticking with the normal SQL rules, which allow duplicate aliases, for the reasons mentioned above.
        Hide
        John Sichi added a comment -

        Actually, after thinking about it some more, it's not practical to prevent alias reuse, even in strict mode. Here's why.

        Suppose I have

        CREATE VIEW V AS SELECT * FROM BLAH T1 JOIN FLUB T2 ON T1.J=T2.K;

        SELECT * FROM V T1 WHERE T1.X=3;

        When we expand the view reference in the query, we'll end up with

        SELECT * FROM (
        SELECT * FROM BLAH T1 JOIN FLUB T2 ON T1.J=T2.K
        ) V T1 WHERE T1.X=3;

        And now in the expansion, T1 is legitimately duplicated, even though the person querying the view didn't even know that T1 was used inside the view definition (in general, could be very deep).

        Expanding the view in this way is what allows us to do a lot of optimizations such as pushing predicates (e.g. T1.X=3) all the way down into the view.

        Show
        John Sichi added a comment - Actually, after thinking about it some more, it's not practical to prevent alias reuse, even in strict mode. Here's why. Suppose I have CREATE VIEW V AS SELECT * FROM BLAH T1 JOIN FLUB T2 ON T1.J=T2.K; SELECT * FROM V T1 WHERE T1.X=3; When we expand the view reference in the query, we'll end up with SELECT * FROM ( SELECT * FROM BLAH T1 JOIN FLUB T2 ON T1.J=T2.K ) V T1 WHERE T1.X=3; And now in the expansion, T1 is legitimately duplicated, even though the person querying the view didn't even know that T1 was used inside the view definition (in general, could be very deep). Expanding the view in this way is what allows us to do a lot of optimizations such as pushing predicates (e.g. T1.X=3) all the way down into the view.
        Hide
        John Sichi added a comment -

        OK, I did some testing and verified that

        1) this is due to the combination of reused alias and predicate push-down (either turning off ppd or using a different outermost alias fixes the plan)
        2) this is a different bug from HIVE-1342 (I tried applying Ted's patch and it did not fix this one)

        I'm going to let someone else who really wants it create a separate issue for preventing alias reuse in strict mode; as Ted says, we should fix these two bugs independent of that.

        Show
        John Sichi added a comment - OK, I did some testing and verified that 1) this is due to the combination of reused alias and predicate push-down (either turning off ppd or using a different outermost alias fixes the plan) 2) this is a different bug from HIVE-1342 (I tried applying Ted's patch and it did not fix this one) I'm going to let someone else who really wants it create a separate issue for preventing alias reuse in strict mode; as Ted says, we should fix these two bugs independent of that.
        Hide
        Ted Xu added a comment -

        +1 to disallow it in strict mode. Simply disallowing it at all may be too aggressive and cause more confusion.
        I'm pretty sure that ambiguous is caused by predicate push down, we shall work it out rather than avoid it.

        Show
        Ted Xu added a comment - +1 to disallow it in strict mode. Simply disallowing it at all may be too aggressive and cause more confusion. I'm pretty sure that ambiguous is caused by predicate push down, we shall work it out rather than avoid it.
        Hide
        Adam Kramer added a comment -

        Right--this is reported as a bug because Hive is crossing levels. The first use of a.num in the above query should look into the CLOSEST scope, which is the subquery labeled a. What hive is doing here is looking into the NON-CLOSEST scope, and returning foo.num when it should return (subquery).num. That is the bug.

        Whether we should allow it at ALL in hive, since it's confusing, is a broader question. I vote either for disallowing it at all, or disallowing it in strict mode.

        Show
        Adam Kramer added a comment - Right--this is reported as a bug because Hive is crossing levels. The first use of a.num in the above query should look into the CLOSEST scope, which is the subquery labeled a. What hive is doing here is looking into the NON-CLOSEST scope, and returning foo.num when it should return (subquery).num. That is the bug. Whether we should allow it at ALL in hive, since it's confusing, is a broader question. I vote either for disallowing it at all, or disallowing it in strict mode.
        Hide
        John Sichi added a comment -

        In the SQL standard, aliases are allowed to be reused at different levels of query nesting; conflicts are only illegal in the same FROM clause. This is the same rule as for column names, actually (they can be reused at different levels of the query nesting, but not at the same level). There are corresponding rules for resolving references when correlated subqueries are used (looking up into the closest scope).

        I'm fine with making the rules stricter for Hive, since reusing a table alias is very confusing; just pointing this out.

        Show
        John Sichi added a comment - In the SQL standard, aliases are allowed to be reused at different levels of query nesting; conflicts are only illegal in the same FROM clause. This is the same rule as for column names, actually (they can be reused at different levels of the query nesting, but not at the same level). There are corresponding rules for resolving references when correlated subqueries are used (looking up into the closest scope). I'm fine with making the rules stricter for Hive, since reusing a table alias is very confusing; just pointing this out.

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Kramer
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development