Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: physical operator
    • Labels:

      Description

      The hybrid hash join algorithm is well known as the best join algorithm which takes advantage of more available memory. The goal of this proposal is to implement HybridHashJoinExec class subclassed from PhysicalExec. Its original literature is "Join processing in database systems with large main memories" (http://www.cs.ucr.edu/~tsotras/cs236/W13/join.pdf). In addition, you could find many other materials available in the web.

      1. TAJO-35_2.patch
        65 kB
        Hyunsik Choi
      2. TAJO-35.patch
        134 kB
        Sergio Esteves

        Activity

        Hyunsik Choi made changes -
        Attachment TAJO-35_2.patch [ 12606982 ]
        Hide
        Hyunsik Choi added a comment -

        I've rebased and merged the first patch. For a while, I'll continue to improve your work. Thank you for your contribution!

        Show
        Hyunsik Choi added a comment - I've rebased and merged the first patch. For a while, I'll continue to improve your work. Thank you for your contribution!
        Hide
        Hyunsik Choi added a comment -

        Style changes have to be performed in a separate issue If not, It's very hard to distinguish the intended change and just style modification.

        Show
        Hyunsik Choi added a comment - Style changes have to be performed in a separate issue If not, It's very hard to distinguish the intended change and just style modification.
        Hide
        Hyunsik Choi added a comment -

        Have you verified 'mvn clean install'? This patch causes many unit tests errors in 'mvn test'.

        Show
        Hyunsik Choi added a comment - Have you verified 'mvn clean install'? This patch causes many unit tests errors in 'mvn test'.
        Hide
        Hyunsik Choi added a comment -

        Thank you for your contribution. I'll take a look at this patch in today's night.

        Show
        Hyunsik Choi added a comment - Thank you for your contribution. I'll take a look at this patch in today's night.
        Sergio Esteves made changes -
        Attachment TAJO-35.patch [ 12606712 ]
        Sergio Esteves made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Sergio Esteves made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hyunsik Choi made changes -
        Assignee Sergio Esteves [ sesteves ]
        Hide
        Sergio Esteves added a comment -

        Thank you very much for the feedback.
        I already started exploring the source code, and perhaps I will complement my proposal with some more specific details about TAJO and physical operators.

        Show
        Sergio Esteves added a comment - Thank you very much for the feedback. I already started exploring the source code, and perhaps I will complement my proposal with some more specific details about TAJO and physical operators.
        Hide
        Jihoon Son added a comment -

        Hi, Sergio

        Your proposal looks great. As Hyunsik said above, you need to understand how Tajo works.
        We always ready to answer your questions.

        Show
        Jihoon Son added a comment - Hi, Sergio Your proposal looks great. As Hyunsik said above, you need to understand how Tajo works. We always ready to answer your questions.
        Hide
        Hyunsik Choi added a comment -

        Looks good to me. The proposal is well-surveyed and contains well-written approach and plan.

        Firstly, you may need to dig Tajo source code. You can find more information about Tajo from http://tajo.incubator.apache.org and http://wiki.apache.org/tajo. Especially, "GettingStarted" (http://wiki.apache.org/tajo/GettingStarted) is a good point to start, and "HowToContribute" would be helpful for you.

        TestBNLJoinExec and TestHashJoinExec are good starting points for understanding physical operators. With these unit tests, you can also execute certain physical operators.

        Also, I would like to suggest you not to miss the deadline.

        Show
        Hyunsik Choi added a comment - Looks good to me. The proposal is well-surveyed and contains well-written approach and plan. Firstly, you may need to dig Tajo source code. You can find more information about Tajo from http://tajo.incubator.apache.org and http://wiki.apache.org/tajo . Especially, "GettingStarted" ( http://wiki.apache.org/tajo/GettingStarted ) is a good point to start, and "HowToContribute" would be helpful for you. TestBNLJoinExec and TestHashJoinExec are good starting points for understanding physical operators. With these unit tests, you can also execute certain physical operators. Also, I would like to suggest you not to miss the deadline.
        Hide
        Sergio Esteves added a comment -

        Here is my proposal to this project: https://docs.google.com/document/d/1NucTfg4Uhsa7WFyPokHHSwQbWlfkGf9ajwdFjVznXj8/edit

        Any feedback appreciated.

        Show
        Sergio Esteves added a comment - Here is my proposal to this project: https://docs.google.com/document/d/1NucTfg4Uhsa7WFyPokHHSwQbWlfkGf9ajwdFjVznXj8/edit Any feedback appreciated.
        Gavin made changes -
        Workflow jira [ 12777541 ] patch-available, re-open possible [ 12778007 ]
        Hide
        Hyunsik Choi added a comment -

        The current Tajo has three join physical operators, such as BNLJoinExec, HashJoinExec, and MergeJoinExec. HashJoinExec builds a in-memory hash table for the smaller relation, and for each tuple in the larger relation it probes the hash table. It works very well and fast in many cases.

        However, it doesn't work if the smaller relation does not fit in memory. In such a case, now we use ExternalSortExec and then MergeJoinExec. It also works well, but external sort is very costly.

        Hybrid hash join (http://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join) is a well-known join algorithm. It is a variation of Grace hash join algorithm. Grace hash join algorithm works as follows:

        Given two relations R and S to be joined,
        1. Grace hash join algorithm partitions R into N buckets by hashing join keys in order to fit each bucket in memory
        2. it also partitions S into N buckets by hashing join keys
        3. for each pair of [i] buckets, 1 <= i <= N, Grace hash joins builds a in-memory hash table for bucket [i] of S relation, and probes the hash table for each tuple in bucket [i] of R relation.

        Hybrid hash join just keeps the first bucket of S in memory and probes the first bucket of R directly in order to avoid the write and read of the first bucket pair.

        You can find many other materials (e.g., PPTs, book chapters, and papers) for hybrid hash join algorithm via google.

        Thanks!

        Show
        Hyunsik Choi added a comment - The current Tajo has three join physical operators, such as BNLJoinExec, HashJoinExec, and MergeJoinExec. HashJoinExec builds a in-memory hash table for the smaller relation, and for each tuple in the larger relation it probes the hash table. It works very well and fast in many cases. However, it doesn't work if the smaller relation does not fit in memory. In such a case, now we use ExternalSortExec and then MergeJoinExec. It also works well, but external sort is very costly. Hybrid hash join ( http://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join ) is a well-known join algorithm. It is a variation of Grace hash join algorithm. Grace hash join algorithm works as follows: Given two relations R and S to be joined, 1. Grace hash join algorithm partitions R into N buckets by hashing join keys in order to fit each bucket in memory 2. it also partitions S into N buckets by hashing join keys 3. for each pair of [i] buckets, 1 <= i <= N, Grace hash joins builds a in-memory hash table for bucket [i] of S relation, and probes the hash table for each tuple in bucket [i] of R relation. Hybrid hash join just keeps the first bucket of S in memory and probes the first bucket of R directly in order to avoid the write and read of the first bucket pair. You can find many other materials (e.g., PPTs, book chapters, and papers) for hybrid hash join algorithm via google. Thanks!
        Hide
        Burak ISIKLI added a comment -

        Hello,
        I'm a master student in Gebze Institute of Technology. I'm working on about
        database, distributed system, datamining, IR
        I'm interested in working with this issue. Could you do more explanation
        about it what you mean hybrid hash operator?

        Thanks
        Best Regards...

        BURAK ISIKLI* *| *http://burakisikli.wordpress.com*
        *
        *

        Show
        Burak ISIKLI added a comment - Hello, I'm a master student in Gebze Institute of Technology. I'm working on about database, distributed system, datamining, IR I'm interested in working with this issue. Could you do more explanation about it what you mean hybrid hash operator? Thanks Best Regards... – BURAK ISIKLI * *| * http://burakisikli.wordpress.com* * *
        Hyunsik Choi made changes -
        Field Original Value New Value
        Labels gsoc2013 mentor gsoc gsoc2013 mentor
        Hyunsik Choi created issue -

          People

          • Assignee:
            Sergio Esteves
            Reporter:
            Hyunsik Choi
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development