Hive
  1. Hive
  2. HIVE-2845

Add support for index joins in Hive

    Details

      Description

      Hive supports indexes, which are used for filters currently.

      It would be very useful to add support for index-based joins in Hive.
      If 2 tables A and B are being joined, and an index exists on the join key of A,
      B can be scanned (by the mappers), and for each row in B, a lookup for the corresponding row in A can be performed.
      This can be very useful for some usecases.

        Activity

        Namit Jain created issue -
        Namit Jain made changes -
        Field Original Value New Value
        Labels gsoc gsoc2012
        Hide
        Srinivasan Sembakkam Rajivelu added a comment -

        Could you please explain in which use case this will be useful. Because I feel that there is a space overhead for storing the index is also there.

        And there is already a map join optimization implemented in Hive,
        http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919

        So this index creation will be helpful do u think??

        Show
        Srinivasan Sembakkam Rajivelu added a comment - Could you please explain in which use case this will be useful. Because I feel that there is a space overhead for storing the index is also there. And there is already a map join optimization implemented in Hive, http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919 So this index creation will be helpful do u think??
        Hide
        Srinivasan Sembakkam Rajivelu added a comment -

        Could you please explain in which use case this will be useful. Because I feel that there is a space overhead for storing the index is also there.

        And there is already a map join optimization implemented in Hive,
        http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919

        So this index creation will be helpful do u think??

        Show
        Srinivasan Sembakkam Rajivelu added a comment - Could you please explain in which use case this will be useful. Because I feel that there is a space overhead for storing the index is also there. And there is already a map join optimization implemented in Hive, http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919 So this index creation will be helpful do u think??
        Hide
        Srinivasan Sembakkam Rajivelu added a comment -

        Could you please explain in which use case this will be useful. Because I feel that there is a space overhead for storing the index is also there.

        And there is already a map join optimization implemented in Hive,
        http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919

        So this index creation will be helpful do u think??

        Show
        Srinivasan Sembakkam Rajivelu added a comment - Could you please explain in which use case this will be useful. Because I feel that there is a space overhead for storing the index is also there. And there is already a map join optimization implemented in Hive, http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919 So this index creation will be helpful do u think??
        Hide
        Namit Jain added a comment -

        This may not be very useful in the current hive setup, since the point lookup on the hive index is not very fast.
        However, this opens up a very wide variety of applications.

        Consider the scenario when one of the tables is stored in HBase. In that case, a join with other table can be reduced
        to a map-only job, with the mapper doing a point lookup for every row. This is very different from map-join where one of
        the tables is so small that it fits in memory.

        Show
        Namit Jain added a comment - This may not be very useful in the current hive setup, since the point lookup on the hive index is not very fast. However, this opens up a very wide variety of applications. Consider the scenario when one of the tables is stored in HBase. In that case, a join with other table can be reduced to a map-only job, with the mapper doing a point lookup for every row. This is very different from map-join where one of the tables is so small that it fits in memory.
        Carl Steinbach made changes -
        Labels gsoc gsoc2012 gsoc gsoc2012 indexing joins performance
        Component/s Indexing [ 12313710 ]
        Component/s Query Processor [ 12312586 ]
        Hide
        Mahsa Mofidpoor added a comment -

        When where clause makes use of indexes, the query is re-written to process the index table, rather than the base one. The predicate (lets' assume a single condition) produces one single value to lookup in the index table.
        1-In case of a join, for each row in table B a lookup should be done, does it mean that for each row query should be re-written? This would lead to so many temp files.
        2- Lets's say the lookup is done, how the association between the result from lookup and the original base table is maintained? Finally we have to access the base table for extracting the desired columns.

        Show
        Mahsa Mofidpoor added a comment - When where clause makes use of indexes, the query is re-written to process the index table, rather than the base one. The predicate (lets' assume a single condition) produces one single value to lookup in the index table. 1-In case of a join, for each row in table B a lookup should be done, does it mean that for each row query should be re-written? This would lead to so many temp files. 2- Lets's say the lookup is done, how the association between the result from lookup and the original base table is maintained? Finally we have to access the base table for extracting the desired columns.
        Hide
        Mahsa Mofidpoor added a comment -

        what will be the operator rule that matches with this? Is it a join followed by a tablescan("JOIN%.*TS")?

        Show
        Mahsa Mofidpoor added a comment - what will be the operator rule that matches with this? Is it a join followed by a tablescan("JOIN%.*TS")?
        Mahsa Mofidpoor made changes -
        Labels gsoc gsoc2012 indexing joins performance indexing joins performance
        Hide
        Mahsa Mofidpoor added a comment -

        Can anybody tell me how close this JIRA is with HIVE-1644, I have already tried the same approach introduced in HIVE-1644 to perform join using the index but it produces nothing at the end.

        Show
        Mahsa Mofidpoor added a comment - Can anybody tell me how close this JIRA is with HIVE-1644 , I have already tried the same approach introduced in HIVE-1644 to perform join using the index but it produces nothing at the end.
        Hide
        He Yongqiang added a comment -

        With HIVE-1644, this should be done. Have you looked at the query plan, or looked at the patch of HIVE-1644? Maybe Hive-1644 does not process join cases (but the code is already there.) The filter needs to be pushed down to the mapper to trigger the auto index.

        Show
        He Yongqiang added a comment - With HIVE-1644 , this should be done. Have you looked at the query plan, or looked at the patch of HIVE-1644 ? Maybe Hive-1644 does not process join cases (but the code is already there.) The filter needs to be pushed down to the mapper to trigger the auto index.
        Hide
        Mahsa Mofidpoor added a comment -

        I've checked the query plan. It does not process joins.
        In this case the join key should be pushed to the mapper, but isn't it more appropriate to replace the TS operator(like HIVE-1694) rather than recompiling the query?

        Show
        Mahsa Mofidpoor added a comment - I've checked the query plan. It does not process joins. In this case the join key should be pushed to the mapper, but isn't it more appropriate to replace the TS operator(like HIVE-1694 ) rather than recompiling the query?
        Hide
        Namit Jain added a comment -

        Mahasa asked me the following:::

        In the following query:

        SELECT col_list FROM A JOIN B ON (A.col1 = B.col1)

        1. Since all tables are buffered except for the last one which is streamed, it is only table B that can make use of the index, am I right?

        >>>>> No, indexes can be used for both the tables if there are filters on any of the tables. It would be after the tablescan for either A or B.

        2. In order to do this, in the mapper, the TS should be done on the index table rather than the base table; what about the reduce stage? Don't you need to have access to base table/index table in the reduce phase too? For applying SEL?

        >>>>> Yes, the TS would be on the index table for either A or B. There would be no change after that – no change in the reduce phase.

        3. As far as I know, filter pushdown and group by use indexes to accelerate the query. Filter pushdown recompiles the re-written query whereas GB only replaces appropriate operators of the operator tree. Which one is more suitable to be inspired to implement HIVE-2845?

        >>>>> HIVE-2845 requires new changes. Essentially, one of the tables, say A would be read completely, and the other one, B would be probed for each key of A,
        or vice versa.

        4. May I ask to assign this ticket to me?

        >>>>> Yes, I dont think anyone is working on it right now.

        Show
        Namit Jain added a comment - Mahasa asked me the following::: In the following query: SELECT col_list FROM A JOIN B ON (A.col1 = B.col1) 1. Since all tables are buffered except for the last one which is streamed, it is only table B that can make use of the index, am I right? >>>>> No, indexes can be used for both the tables if there are filters on any of the tables. It would be after the tablescan for either A or B. 2. In order to do this, in the mapper, the TS should be done on the index table rather than the base table; what about the reduce stage? Don't you need to have access to base table/index table in the reduce phase too? For applying SEL? >>>>> Yes, the TS would be on the index table for either A or B. There would be no change after that – no change in the reduce phase. 3. As far as I know, filter pushdown and group by use indexes to accelerate the query. Filter pushdown recompiles the re-written query whereas GB only replaces appropriate operators of the operator tree. Which one is more suitable to be inspired to implement HIVE-2845 ? >>>>> HIVE-2845 requires new changes. Essentially, one of the tables, say A would be read completely, and the other one, B would be probed for each key of A, or vice versa. 4. May I ask to assign this ticket to me? >>>>> Yes, I dont think anyone is working on it right now.
        Hide
        Mahsa Mofidpoor added a comment -

        Can you please clarify on the new changes in more details?

        Show
        Mahsa Mofidpoor added a comment - Can you please clarify on the new changes in more details?
        Hide
        Namit Jain added a comment -

        Say there is a index of table A on 'key'.

        For a query of the type:

        select .. from A join B on A.key=B.key;

        the plan can be as follows:

        scan B
        for every row of B (or a batch of rows in B), lookup the value using the index in A

        The basic infra-structure is needed first. A lot of optimizations can be added later.

        Show
        Namit Jain added a comment - Say there is a index of table A on 'key'. For a query of the type: select .. from A join B on A.key=B.key; the plan can be as follows: scan B for every row of B (or a batch of rows in B), lookup the value using the index in A The basic infra-structure is needed first. A lot of optimizations can be added later.
        Hide
        Mahsa Mofidpoor added a comment -

        I put my code in optimizer/physical, pretty much the same as how HIVE-1644 is organized.
        I removed the TS from topToTable and opParseContext and also the table from topOps. Then I updated the correspondent operator and table(index table) in OpParseContext and ParseContext respectively, but the map logs shows the original table still being scanned.
        1.Should I regenerate the operator tree afterwards? probably by using RewriteParseContextGenerator?
        2.Is a new RowResolver needed?

        Show
        Mahsa Mofidpoor added a comment - I put my code in optimizer/physical, pretty much the same as how HIVE-1644 is organized. I removed the TS from topToTable and opParseContext and also the table from topOps. Then I updated the correspondent operator and table(index table) in OpParseContext and ParseContext respectively, but the map logs shows the original table still being scanned. 1.Should I regenerate the operator tree afterwards? probably by using RewriteParseContextGenerator? 2.Is a new RowResolver needed?
        Hide
        Mahsa Mofidpoor added a comment -

        Finally, I replaced the table in the operator tree with the index table. Now the problem is that i have only access to the indexed cols of the replaced base table. With the bucket name and offset it should not be a big deal to reach other columns.
        Can you please give me some clues on the stages in which other columns are extarcted by bucket and offset?

        Show
        Mahsa Mofidpoor added a comment - Finally, I replaced the table in the operator tree with the index table. Now the problem is that i have only access to the indexed cols of the replaced base table. With the bucket name and offset it should not be a big deal to reach other columns. Can you please give me some clues on the stages in which other columns are extarcted by bucket and offset?

          People

          • Assignee:
            Unassigned
            Reporter:
            Namit Jain
          • Votes:
            4 Vote for this issue
            Watchers:
            19 Start watching this issue

            Dates

            • Created:
              Updated:

              Development