Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Concurrency model for Hive:

      Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that
      reader will not see partial data from the old version (before the write) and partial data from the new version (after the write).
      This has come across as a big problem, specially for background processes performing maintenance operations.

      The following possible solutions come to mind.

      1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move
      task (when the directory is actually moved). Care needs to be taken for deadlocks.

      2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots,
      the old version can only be accessed by the current readers, and will be deleted when all of them have finished.

      Comments.

      1. hive.1293.9.patch
        122 kB
        Namit Jain
      2. hive.1293.8.patch
        121 kB
        Namit Jain
      3. hive.1293.7.patch
        122 kB
        Namit Jain
      4. hive.1293.6.patch
        117 kB
        Namit Jain
      5. hive.1293.5.patch
        101 kB
        Namit Jain
      6. hive.1293.4.patch
        100 kB
        Namit Jain
      7. hive.1293.3.patch
        99 kB
        Namit Jain
      8. hive.1293.2.patch
        89 kB
        Namit Jain
      9. hive.1293.1.patch
        87 kB
        Namit Jain
      10. hive_leases.txt
        3 kB
        Prasad Chakka

        Issue Links

          Activity

          Hide
          Ashish Thusoo added a comment -

          I would vote for versioning. Since we do not have to deal with the complexity of a buffer cache I think this would be much simpler to implement than what is possible in traditional databases. At the same time, for locks we will have to do a lease based mechanism anyway in order to protect against locks leaking because of client crashes. And when you account for that, it seems that locking would not be significantly simpler to implement than versioning.

          Show
          Ashish Thusoo added a comment - I would vote for versioning. Since we do not have to deal with the complexity of a buffer cache I think this would be much simpler to implement than what is possible in traditional databases. At the same time, for locks we will have to do a lease based mechanism anyway in order to protect against locks leaking because of client crashes. And when you account for that, it seems that locking would not be significantly simpler to implement than versioning.
          Hide
          Namit Jain added a comment -

          One option is to use ZooKeeper for locking - we dont need to worry about leases since ZooKeeper supports ephemeral nodes.
          The zookeeper quorum can be specified via some configuration parameters, and they need to be specified for concurrency to
          be enabled.

          Show
          Namit Jain added a comment - One option is to use ZooKeeper for locking - we dont need to worry about leases since ZooKeeper supports ephemeral nodes. The zookeeper quorum can be specified via some configuration parameters, and they need to be specified for concurrency to be enabled.
          Hide
          Namit Jain added a comment -

          The initial writeup is at http://wiki.apache.org/hadoop/Hive/Locking.
          Please comment

          Show
          Namit Jain added a comment - The initial writeup is at http://wiki.apache.org/hadoop/Hive/Locking . Please comment
          Hide
          John Sichi added a comment -

          Just to clarify, when you say this:

          "The lock modes are hierarchical in nature - if 'S' lock is acquired on T, implicitly 'S' lock is acquired on all partitions of T."

          it only applies when no partition is specified, so for

          select .. T1 partition P1 => S on T1, T1.P1

          we only S-lock T1.P1 (not all the other partitions of T), right?

          Show
          John Sichi added a comment - Just to clarify, when you say this: "The lock modes are hierarchical in nature - if 'S' lock is acquired on T, implicitly 'S' lock is acquired on all partitions of T." it only applies when no partition is specified, so for select .. T1 partition P1 => S on T1, T1.P1 we only S-lock T1.P1 (not all the other partitions of T), right?
          Hide
          Namit Jain added a comment -

          I meant the above to be only true for 'X' locks - updated the wiki

          Show
          Namit Jain added a comment - I meant the above to be only true for 'X' locks - updated the wiki
          Hide
          Edward Capriolo added a comment -

          I see you have made mention to zookeeper. Do we really need zookeeper to implement locking when we already have a transactional MetaStore via JPOX (Derby/MySQL)? From my prospecting on the management side and from a development side having to integrate and understand the Zookeeper API for simple low performance locking seems overkill.

          For example, I understand that zookeeper has facilities to auto-expire old locks and other features that make it seem ideal in a use case, but I think the visibility into it is rather low. If our locks were implemented in the meta-store, clearing them would be easy to do thought hive commands or with JDBC. What tools would we have to investigate Hive-Zookeeper-Locking?

          I say this because of experience with hbase-zookeeper. Other the 'hbase status' zookeeper is a total black box and makes troubleshooting that component of hbase difficult. I would not want to see hive go down the same road.

          No matter which implementation we go with we should have DDL commands like 'show locks' 'clear locks'.

          Show
          Edward Capriolo added a comment - I see you have made mention to zookeeper. Do we really need zookeeper to implement locking when we already have a transactional MetaStore via JPOX (Derby/MySQL)? From my prospecting on the management side and from a development side having to integrate and understand the Zookeeper API for simple low performance locking seems overkill. For example, I understand that zookeeper has facilities to auto-expire old locks and other features that make it seem ideal in a use case, but I think the visibility into it is rather low. If our locks were implemented in the meta-store, clearing them would be easy to do thought hive commands or with JDBC. What tools would we have to investigate Hive-Zookeeper-Locking? I say this because of experience with hbase-zookeeper. Other the 'hbase status' zookeeper is a total black box and makes troubleshooting that component of hbase difficult. I would not want to see hive go down the same road. No matter which implementation we go with we should have DDL commands like 'show locks' 'clear locks'.
          Hide
          John Sichi added a comment -

          Regarding the implementation of lock acquisition/release in terms of zookeeper, can you elaborate on how you are proposing to map the lock hierarchy into the znode hierarchy?

          It sounds like the znode paths will correspond to database.table.partition.subpartition... paths, which is good.

          However, does the lock acquisition recipe actually need to reflect the hierarchy, which I think is what you are proposing? In other words, can't we just come up with a flat list of object locks to take (including both table-level and partition-level), sort them, and then acquire them each independently using the non-hierarchical recipe (except as you mention with failfast instead of wait)? If any fail, then delete them all before re-entering the retry loop.

          Assuming the sorting matches the compound naming scheme, this should guarantee hierarchical lock acquisition order within each table.

          Also, I do not understand the part below.

          "The 'X' lock for table T is specified as follows:
          ...
          #

          1. For all parent znodes of T, call getChildren() without setting the watch flag."

          Do you mean "for the parent znode of T" rather than "all parent znodes of T", and this is supposed to apply for case where T is actually a partition?

          Show
          John Sichi added a comment - Regarding the implementation of lock acquisition/release in terms of zookeeper, can you elaborate on how you are proposing to map the lock hierarchy into the znode hierarchy? It sounds like the znode paths will correspond to database.table.partition.subpartition... paths, which is good. However, does the lock acquisition recipe actually need to reflect the hierarchy, which I think is what you are proposing? In other words, can't we just come up with a flat list of object locks to take (including both table-level and partition-level), sort them, and then acquire them each independently using the non-hierarchical recipe (except as you mention with failfast instead of wait)? If any fail, then delete them all before re-entering the retry loop. Assuming the sorting matches the compound naming scheme, this should guarantee hierarchical lock acquisition order within each table. Also, I do not understand the part below. "The 'X' lock for table T is specified as follows: ... # For all parent znodes of T, call getChildren() without setting the watch flag." Do you mean "for the parent znode of T" rather than "all parent znodes of T", and this is supposed to apply for case where T is actually a partition?
          Hide
          Namit Jain added a comment -

          @Edward, I agree that the main reason we are tilting towards zookeeper is its ability to do auto-cleanup which is very useful.
          I can add a interface, and there can be 2 implementations: one with zookeeper and other with metastore.
          I will try to address the other suggestion about: 'show locks' and 'clear locks' - it will be very useful for debugging.
          But, if I am stuck, I will do it in a follow-up jira

          Show
          Namit Jain added a comment - @Edward, I agree that the main reason we are tilting towards zookeeper is its ability to do auto-cleanup which is very useful. I can add a interface, and there can be 2 implementations: one with zookeeper and other with metastore. I will try to address the other suggestion about: 'show locks' and 'clear locks' - it will be very useful for debugging. But, if I am stuck, I will do it in a follow-up jira
          Hide
          Namit Jain added a comment -

          @John, if we lock the whole flat list of objects, it might be a huge list - to lock a table 'T' in 'X' mode - we will be locking all the sub-partitions.
          Given that, we have thousands of partitions for some tables, it will just be a problem for debugging (even more, when we have 'show locks' etc.)

          Do you mean "for the parent znode of T" rather than "all parent znodes of T", and this is supposed to apply for case where T is actually a partition?

          --> yes, will fix the wiki

          Show
          Namit Jain added a comment - @John, if we lock the whole flat list of objects, it might be a huge list - to lock a table 'T' in 'X' mode - we will be locking all the sub-partitions. Given that, we have thousands of partitions for some tables, it will just be a problem for debugging (even more, when we have 'show locks' etc.) Do you mean "for the parent znode of T" rather than "all parent znodes of T", and this is supposed to apply for case where T is actually a partition? --> yes, will fix the wiki
          Hide
          Namit Jain added a comment -

          Forgot to add before, I was thinking of a one-to-map mapping between a table and a znode.

          For eg: if there are 2 tables: T1 and T2 with partitions T1.p=1, T1.p=2 and T2.p1=1,p2=1 and T2.p1=1,p2=2 respectively, the corresponding znodes will be:

          T1
          -> p=1
          -> p=2
          T2
          -> p1=1
          ->p2=1
          ->p2=2

          Show
          Namit Jain added a comment - Forgot to add before, I was thinking of a one-to-map mapping between a table and a znode. For eg: if there are 2 tables: T1 and T2 with partitions T1.p=1, T1.p=2 and T2.p1=1,p2=1 and T2.p1=1,p2=2 respectively, the corresponding znodes will be: T1 -> p=1 -> p=2 T2 -> p1=1 ->p2=1 ->p2=2
          Hide
          John Sichi added a comment -

          @Namit: do we actually have to explicitly X-lock all partitions when we X-lock a table? Doesn't the fact that we always S-lock the table before locking specific partitions already serve as a form of intent lock (we can't get the table X-lock if anyone already has a table S-lock, and we can't get a table S-lock if someone already has the table X-lock)?

          Show
          John Sichi added a comment - @Namit: do we actually have to explicitly X-lock all partitions when we X-lock a table? Doesn't the fact that we always S-lock the table before locking specific partitions already serve as a form of intent lock (we can't get the table X-lock if anyone already has a table S-lock, and we can't get a table S-lock if someone already has the table X-lock)?
          Hide
          Namit Jain added a comment -

          Yes, we dont have to X-lock all partitions. But, then before acquiring a S-lock on the partition, we need to check if its parent is not X-locked, which is what I am proposing.

          Show
          Namit Jain added a comment - Yes, we dont have to X-lock all partitions. But, then before acquiring a S-lock on the partition, we need to check if its parent is not X-locked, which is what I am proposing.
          Hide
          John Sichi added a comment -

          Right, you get this if for partition p.q.r in t, you add the following to the flat lock list:

          t (S)
          t.p (S)
          t.p.q (S)
          t.p.q.r (S or X depending what the operation is)

          This doesn't add a lot of extra locks in general since there are more children than parents, it makes the low-level recipe a little simpler, and maybe makes show locks output clearer.

          It might be exactly what you are already proposing, in which case we're in agreement.

          Show
          John Sichi added a comment - Right, you get this if for partition p.q.r in t, you add the following to the flat lock list: t (S) t.p (S) t.p.q (S) t.p.q.r (S or X depending what the operation is) This doesn't add a lot of extra locks in general since there are more children than parents, it makes the low-level recipe a little simpler, and maybe makes show locks output clearer. It might be exactly what you are already proposing, in which case we're in agreement.
          Hide
          Namit Jain added a comment -

          Agreed, this is cleaner than what I had. I was checking the parents, you are suggesting locking the parents in 'S' mode, which achieves the desired affect,
          but removes the need for hierarchy from the lock manager.

          It is even better given that we may have different lock manager implementations. I will update the wiki

          Show
          Namit Jain added a comment - Agreed, this is cleaner than what I had. I was checking the parents, you are suggesting locking the parents in 'S' mode, which achieves the desired affect, but removes the need for hierarchy from the lock manager. It is even better given that we may have different lock manager implementations. I will update the wiki
          Hide
          Namit Jain added a comment -

          The unit tests are running right now (they should succeed) - submitted a patch for review.

          Also, all the jar files (3 of them) from hbase-handler/lib should be moved to lib.
          That is not part of the patch since those files are binary

          Show
          Namit Jain added a comment - The unit tests are running right now (they should succeed) - submitted a patch for review. Also, all the jar files (3 of them) from hbase-handler/lib should be moved to lib. That is not part of the patch since those files are binary
          Hide
          He Yongqiang added a comment -

          I will take a look.

          Show
          He Yongqiang added a comment - I will take a look.
          Hide
          He Yongqiang added a comment -

          a few questions so far:
          1) can the lock implementation guarantee the atomicity? I mean since the lock's logic happens in client side, it is possible that two concurrent client get conflicting locks.
          2) about realizing locks. if a client did an unlock, will it also release locks made by other clients? I mean,if a client A did a lock, and then client B did another lock, and client B did an unlock, will client A still hold its lock?

          still reading the code.

          Show
          He Yongqiang added a comment - a few questions so far: 1) can the lock implementation guarantee the atomicity? I mean since the lock's logic happens in client side, it is possible that two concurrent client get conflicting locks. 2) about realizing locks. if a client did an unlock, will it also release locks made by other clients? I mean,if a client A did a lock, and then client B did another lock, and client B did an unlock, will client A still hold its lock? still reading the code.
          Hide
          Namit Jain added a comment -

          1. No, 2 clients cannot get conflicting locks – zookeeper will guarantee that, will read again to double-check
          2. If client A cannot get the lock, it will unlock only its own lock. However, there is no security - if client A does unlock table A explicitly, all locks on A are released

          Show
          Namit Jain added a comment - 1. No, 2 clients cannot get conflicting locks – zookeeper will guarantee that, will read again to double-check 2. If client A cannot get the lock, it will unlock only its own lock. However, there is no security - if client A does unlock table A explicitly, all locks on A are released
          Hide
          Namit Jain added a comment -

          Confirmed 1

          Show
          Namit Jain added a comment - Confirmed 1
          Show
          Prasad Chakka added a comment - same as https://issues.apache.org/jira/browse/HIVE-829 ?
          Hide
          Prasad Chakka added a comment -

          I had written up an algorithm to create leases in metastore sometime back. Not sure how useful it is now but if someone wants to implement leases without depending on a 3rd party system this may come handy.

          Show
          Prasad Chakka added a comment - I had written up an algorithm to create leases in metastore sometime back. Not sure how useful it is now but if someone wants to implement leases without depending on a 3rd party system this may come handy.
          Hide
          He Yongqiang added a comment -

          I am going to commit this patch in the next few days. Please post your comments if you have any, so we can fix them before this patch gets in.

          Show
          He Yongqiang added a comment - I am going to commit this patch in the next few days. Please post your comments if you have any, so we can fix them before this patch gets in.
          Hide
          Namit Jain added a comment -

          The patch updated

          Show
          Namit Jain added a comment - The patch updated
          Hide
          John Sichi added a comment -

          Added comments here:

          https://review.cloudera.org/r/563/

          (doesn't seem to be adding comments here)

          Show
          John Sichi added a comment - Added comments here: https://review.cloudera.org/r/563/ (doesn't seem to be adding comments here)
          Hide
          Namit Jain added a comment -

          comments addressed.

          running tests right now -

          Show
          Namit Jain added a comment - comments addressed. running tests right now -
          Hide
          Namit Jain added a comment -

          minor bug

          Show
          Namit Jain added a comment - minor bug
          Hide
          John Sichi added a comment -

          Two configuration questions:

          • You have hive.support.concurrency=true in hive-default.xml. Probably we want it false instead (only on during tests) since most people using Hive won't have a zookeeper quorum set up?
          • Isn't there a default value we can use for hive.zookeeper.client.port?

          One lib question:

          • Zookeeper is now available from maven. Maybe we should delete the one in hbase-handler/lib and get it via ivy instead of adding it in the top-level lib? The version we have checked in is 3.2.2, but the maven availability is 3.3.x, so we'd need to test to make sure everything (including hbase-handler) still works with the newer version.

          http://mvnrepository.com/artifact/org.apache.hadoop/zookeeper

          Two cleanups:

          • In QTestUtil.java, you left the following code commented out; can we get rid of it?

          + // for (int i = 0; i < qfiles.length; i++)

          { + // qsetup[i].tearDown(); + // }
          • In DDLTask.java, you left some commented-out debugging code (two instances):

          + // console.printError("conflicting lock present " + tbl + " cannot be locked in mode " + mode);

          Show
          John Sichi added a comment - Two configuration questions: You have hive.support.concurrency=true in hive-default.xml. Probably we want it false instead (only on during tests) since most people using Hive won't have a zookeeper quorum set up? Isn't there a default value we can use for hive.zookeeper.client.port? One lib question: Zookeeper is now available from maven. Maybe we should delete the one in hbase-handler/lib and get it via ivy instead of adding it in the top-level lib? The version we have checked in is 3.2.2, but the maven availability is 3.3.x, so we'd need to test to make sure everything (including hbase-handler) still works with the newer version. http://mvnrepository.com/artifact/org.apache.hadoop/zookeeper Two cleanups: In QTestUtil.java, you left the following code commented out; can we get rid of it? + // for (int i = 0; i < qfiles.length; i++) { + // qsetup[i].tearDown(); + // } In DDLTask.java, you left some commented-out debugging code (two instances): + // console.printError("conflicting lock present " + tbl + " cannot be locked in mode " + mode);
          Hide
          Namit Jain added a comment - - edited

          Did the cleanups and changed default value of hive.support.concurrency to false

          Not sure how can we set a default value for hive.zookeeper.client.port?

          Let us do the lib cleanup in a follow-up - filed https://issues.apache.org/jira/browse/HIVE-1533

          Show
          Namit Jain added a comment - - edited Did the cleanups and changed default value of hive.support.concurrency to false Not sure how can we set a default value for hive.zookeeper.client.port? Let us do the lib cleanup in a follow-up - filed https://issues.apache.org/jira/browse/HIVE-1533
          Hide
          Namit Jain added a comment -

          new patch

          Show
          Namit Jain added a comment - new patch
          Hide
          John Sichi added a comment -

          I think ZK default client port would be 2181; see HBASE-2305.

          Show
          John Sichi added a comment - I think ZK default client port would be 2181; see HBASE-2305 .
          Hide
          John Sichi added a comment -

          From testing: the parsed lock mode seems to be case-sensitive:

          hive> lock table blah shared;
          Failed with exception No enum const class org.apache.hadoop.hive.ql.lockmgr.HiveLockMode.shared
          FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

          If I use lock table blah SHARED it works.

          Show
          John Sichi added a comment - From testing: the parsed lock mode seems to be case-sensitive: hive> lock table blah shared; Failed with exception No enum const class org.apache.hadoop.hive.ql.lockmgr.HiveLockMode.shared FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask If I use lock table blah SHARED it works.
          Hide
          Basab Maulik added a comment -

          Re: One lib question: Zookeeper

          hbase-handler with hbase 0.20.x does not work with zk 3.3.1 but works fine with the version it ships with, zk 3.2.2. Have not investigated what breaks.

          Show
          Basab Maulik added a comment - Re: One lib question: Zookeeper hbase-handler with hbase 0.20.x does not work with zk 3.3.1 but works fine with the version it ships with, zk 3.2.2. Have not investigated what breaks.
          Hide
          John Sichi added a comment -

          Namit, I tried testing with a standalone zookeeper via CLI. Locking a table succeeded, but then show locks didn't show anything, and unlock said the lock didn't exist.

          I think the reason is that CLI is creating a new Driver for each statement executed, and when the old Driver is closed, the lock manager is closed along with it (closing the ZooKeeper client instance). As a result, locks are released immediately after LOCK TABLE is executed.

          When I tested with a thrift server plus two JDBC clients, all was well. I was able to take a lock from one client and prevent the other client from getting the same lock. So I guess the thrift server is keeping one Driver around per connection.

          Show
          John Sichi added a comment - Namit, I tried testing with a standalone zookeeper via CLI. Locking a table succeeded, but then show locks didn't show anything, and unlock said the lock didn't exist. I think the reason is that CLI is creating a new Driver for each statement executed, and when the old Driver is closed, the lock manager is closed along with it (closing the ZooKeeper client instance). As a result, locks are released immediately after LOCK TABLE is executed. When I tested with a thrift server plus two JDBC clients, all was well. I was able to take a lock from one client and prevent the other client from getting the same lock. So I guess the thrift server is keeping one Driver around per connection.
          Hide
          John Sichi added a comment -

          Here's a scenario which is not working correctly. (Tested with thrift server plus JDBC clients.)

          Existing table foo.

          Client 1: LOCK TABLE foo EXCLUSIVE;

          Client 2: DROP TABLE foo;

          According to the doc, the DROP TABLE should fail, but it succeeds. Same is true for LOAD DATA. Probably the same reason in both cases: for these commands we don't register the output in the PREHOOK (only the POSTHOOK). INSERT is getting blocked correctly since it's in the PREHOOK.

          Show
          John Sichi added a comment - Here's a scenario which is not working correctly. (Tested with thrift server plus JDBC clients.) Existing table foo. Client 1: LOCK TABLE foo EXCLUSIVE; Client 2: DROP TABLE foo; According to the doc, the DROP TABLE should fail, but it succeeds. Same is true for LOAD DATA. Probably the same reason in both cases: for these commands we don't register the output in the PREHOOK (only the POSTHOOK). INSERT is getting blocked correctly since it's in the PREHOOK.
          Hide
          John Sichi added a comment -

          After seeing some other issues, had a chat with Namit about semantics; here's what we worked out.

          • Normally, locks should only be held for duration of statement execution.
          • However, LOCK TABLE should take a global lock (not tied to any particular session or statement).
          • UNLOCK TABLE should remove both kinds of lock (statement-level and global). Likewise, SHOW LOCKS shows all.
          • For fetching results, we'll need a parameter to control whether a dirty read is possible. Normally, this is not an issue since we're fetching from saved temp results, but when using select * from t to fetch directly from the original table, this behavior makes a difference. To prevent dirty reads, we'll need the statement-level lock to span the duration of the fetch.

          To avoid leaks, we need to make sure that once we create a ZooKeeper client, we always close it.

          Show
          John Sichi added a comment - After seeing some other issues, had a chat with Namit about semantics; here's what we worked out. Normally, locks should only be held for duration of statement execution. However, LOCK TABLE should take a global lock (not tied to any particular session or statement). UNLOCK TABLE should remove both kinds of lock (statement-level and global). Likewise, SHOW LOCKS shows all. For fetching results, we'll need a parameter to control whether a dirty read is possible. Normally, this is not an issue since we're fetching from saved temp results, but when using select * from t to fetch directly from the original table, this behavior makes a difference. To prevent dirty reads, we'll need the statement-level lock to span the duration of the fetch. To avoid leaks, we need to make sure that once we create a ZooKeeper client, we always close it.
          Hide
          John Sichi added a comment -

          Also, as a followup, need to add client info such as hostname, process ID to SHOW LOCKS.

          Show
          John Sichi added a comment - Also, as a followup, need to add client info such as hostname, process ID to SHOW LOCKS.
          Hide
          Joydeep Sen Sarma added a comment -

          a little bummed that locks need to be held for entire query execution. that could mean a writer blocking readers for hours.

          hive's query plans seem to be of two distinct stages:
          1. read a bunch of stuff, compute intermediate/final data
          2. move final data into output locations

          ie. - a single query never reads what it writes (into a final output location). even if #1 and #2 are mingled today - they can easily be put in order.

          in that sense - we only need to get shared locks for all read entities involved in #1 to begin with. once phase #1 is done, we can drop all the read locks and get the exclusive locks for all the write entities in #2, perform #2 and quit. that way exclusive locks are held for a very short duration. i think this scheme is similarly deadlock free (now there are two independent lock acquire/release phases - and each of them can lock stuff in lex. order).

          Show
          Joydeep Sen Sarma added a comment - a little bummed that locks need to be held for entire query execution. that could mean a writer blocking readers for hours. hive's query plans seem to be of two distinct stages: 1. read a bunch of stuff, compute intermediate/final data 2. move final data into output locations ie. - a single query never reads what it writes (into a final output location). even if #1 and #2 are mingled today - they can easily be put in order. in that sense - we only need to get shared locks for all read entities involved in #1 to begin with. once phase #1 is done, we can drop all the read locks and get the exclusive locks for all the write entities in #2, perform #2 and quit. that way exclusive locks are held for a very short duration. i think this scheme is similarly deadlock free (now there are two independent lock acquire/release phases - and each of them can lock stuff in lex. order).
          Hide
          Joydeep Sen Sarma added a comment -

          also - i am missing something here:

          + for (WriteEntity output : plan.getOutputs())

          { + lockObjects.addAll(getLockObjects(output.getTable(), output.getPartition(), HiveLockMode.EXCLUSIVE)); + }

          getLockObjects():

          + if (p != null)

          { ... + locks.add(new LockObject(new HiveLockObject(p.getTable()), mode)); + }

          doesn't this end up locking the table in exclusive mode if a partition is being written to? (whereas the design talks about locking the table in shared mode only?)

          Show
          Joydeep Sen Sarma added a comment - also - i am missing something here: + for (WriteEntity output : plan.getOutputs()) { + lockObjects.addAll(getLockObjects(output.getTable(), output.getPartition(), HiveLockMode.EXCLUSIVE)); + } getLockObjects(): + if (p != null) { ... + locks.add(new LockObject(new HiveLockObject(p.getTable()), mode)); + } doesn't this end up locking the table in exclusive mode if a partition is being written to? (whereas the design talks about locking the table in shared mode only?)
          Hide
          Namit Jain added a comment -

          The partition being written to is locked in exclusive mode - the table should be locked in shared mode.
          The write entity should only consist of the partition.

          There might be a bug there - https://issues.apache.org/jira/browse/HIVE-1548 should populate the inputs and outputs appropriately.
          I will start on this now.

          Show
          Namit Jain added a comment - The partition being written to is locked in exclusive mode - the table should be locked in shared mode. The write entity should only consist of the partition. There might be a bug there - https://issues.apache.org/jira/browse/HIVE-1548 should populate the inputs and outputs appropriately. I will start on this now.
          Hide
          Joydeep Sen Sarma added a comment -

          can u check the getLockObjects() routine. it seemed to me that even u called with partition in X mode - it would add the table to the list of objects to be locked as well (in the same X mode).

          i think we should, at least, as follow on make the optimization to not lock write entities for the duration of the query.

          Show
          Joydeep Sen Sarma added a comment - can u check the getLockObjects() routine. it seemed to me that even u called with partition in X mode - it would add the table to the list of objects to be locked as well (in the same X mode). i think we should, at least, as follow on make the optimization to not lock write entities for the duration of the query.
          Hide
          Namit Jain added a comment - - edited

          Agreed on the bug in getLockObjects() - will have a new patch.

          Filed a new patch for the followup: https://issues.apache.org/jira/browse/HIVE-1557

          Show
          Namit Jain added a comment - - edited Agreed on the bug in getLockObjects() - will have a new patch. Filed a new patch for the followup: https://issues.apache.org/jira/browse/HIVE-1557
          Hide
          Namit Jain added a comment -

          Fixed a lot of bugs, added a lot of comments, tested it with a zooKeeper cluster of 3 nodes.
          select * currently performs a dirty read, we can add a new parameter to change that behavior if need be.

          Show
          Namit Jain added a comment - Fixed a lot of bugs, added a lot of comments, tested it with a zooKeeper cluster of 3 nodes. select * currently performs a dirty read, we can add a new parameter to change that behavior if need be.
          Hide
          John Sichi added a comment -

          Did some more testing with latest:

          • thrift server is now broken so I'm not able to test multi-conn that way (I can use CLI + MySQL metastore instead though)
          • zookeeper communication is eager, so we start a zk client conn for stuff like show tables and describe table; can we make it lazy instead?
          • EXCLUSIVE/SHARED keyword is still case sensitive
          • something I noticed before but forgot to mention: we're getting NPE noise in the log (below); can we disable the watcher or supply a dummy?

          10/08/19 16:12:38 ERROR zookeeper.ClientCnxn: Error while calling watcher
          java.lang.NullPointerException
          at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:425)

          Show
          John Sichi added a comment - Did some more testing with latest: thrift server is now broken so I'm not able to test multi-conn that way (I can use CLI + MySQL metastore instead though) zookeeper communication is eager, so we start a zk client conn for stuff like show tables and describe table; can we make it lazy instead? EXCLUSIVE/SHARED keyword is still case sensitive something I noticed before but forgot to mention: we're getting NPE noise in the log (below); can we disable the watcher or supply a dummy? 10/08/19 16:12:38 ERROR zookeeper.ClientCnxn: Error while calling watcher java.lang.NullPointerException at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:425)
          Hide
          John Sichi added a comment -

          Some notes:

          • A typical administrative sequence will involve taking a global lock on the object(s) in question, then doing set hive.support.concurrency=false, then doing the housekeeping work, then doing set hive.support.concurrency=true, then unlocking the object(s).
          • select * from t still takes a test lock at the beginning, but does not hold it for the duration of the fetch. In order to get a completely dirty read (skipping the lock), use set hive.support.concurrency=false for that session.

          These don't require fixes or followups; they're just usage notes.

          Show
          John Sichi added a comment - Some notes: A typical administrative sequence will involve taking a global lock on the object(s) in question, then doing set hive.support.concurrency=false, then doing the housekeeping work, then doing set hive.support.concurrency=true, then unlocking the object(s). select * from t still takes a test lock at the beginning, but does not hold it for the duration of the fetch. In order to get a completely dirty read (skipping the lock), use set hive.support.concurrency=false for that session. These don't require fixes or followups; they're just usage notes.
          Hide
          John Sichi added a comment -

          The method name "getLocks" is used to mean two different things: in one context it means "get me a list of locks which already exist", whereas in another (private method in Driver) it means "acquire these locks". The second one should be renamed to acquireLocks to avoid confusion.

          Show
          John Sichi added a comment - The method name "getLocks" is used to mean two different things: in one context it means "get me a list of locks which already exist", whereas in another (private method in Driver) it means "acquire these locks". The second one should be renamed to acquireLocks to avoid confusion.
          Hide
          Namit Jain added a comment -

          another - hopefully, final patch

          Show
          Namit Jain added a comment - another - hopefully, final patch
          Hide
          Namit Jain added a comment -

          made lock manager creation lazy - ran the relevant tests,
          am running all tests now.

          Show
          Namit Jain added a comment - made lock manager creation lazy - ran the relevant tests, am running all tests now.
          Hide
          John Sichi added a comment -

          With latest patch I am still seeing this in ZK log for each show tables:

          2010-08-23 12:31:53,003 - INFO [NIOServerCxn.Factory:2181:NIOServerCnxn@607] - Connected to /fe80:0:0:0:0:0:0:1%1:59209 lastZxid 0
          2010-08-23 12:31:53,003 - INFO [NIOServerCxn.Factory:2181:NIOServerCnxn@639] - Creating new session 0x12aa06e84940003
          2010-08-23 12:31:53,009 - INFO [SyncThread:0:NIOServerCnxn@992] - Finished init of 0x12aa06e84940003 valid:true
          2010-08-23 12:31:53,086 - INFO [ProcessThread:-1:PrepRequestProcessor@384] - Processed session termination request for id: 0x12aa06e84940003
          2010-08-23 12:31:53,088 - INFO [SyncThread:0:NIOServerCnxn@857] - closing session:0x12aa06e84940003 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/fe80:0:0:0:0:0:0:1%1:2181 remote=/fe80:0:0:0:0:0:0:1%1:59209]

          Show
          John Sichi added a comment - With latest patch I am still seeing this in ZK log for each show tables: 2010-08-23 12:31:53,003 - INFO [NIOServerCxn.Factory:2181:NIOServerCnxn@607] - Connected to /fe80:0:0:0:0:0:0:1%1:59209 lastZxid 0 2010-08-23 12:31:53,003 - INFO [NIOServerCxn.Factory:2181:NIOServerCnxn@639] - Creating new session 0x12aa06e84940003 2010-08-23 12:31:53,009 - INFO [SyncThread:0:NIOServerCnxn@992] - Finished init of 0x12aa06e84940003 valid:true 2010-08-23 12:31:53,086 - INFO [ProcessThread:-1:PrepRequestProcessor@384] - Processed session termination request for id: 0x12aa06e84940003 2010-08-23 12:31:53,088 - INFO [SyncThread:0:NIOServerCnxn@857] - closing session:0x12aa06e84940003 NIOServerCnxn: java.nio.channels.SocketChannel [connected local=/fe80:0:0:0:0:0:0:1%1:2181 remote=/fe80:0:0:0:0:0:0:1%1:59209]
          Hide
          John Sichi added a comment -

          I think you need to check to see whether there are actually any locks to be acquired, and if not, skip checkLockManager.

          Show
          John Sichi added a comment - I think you need to check to see whether there are actually any locks to be acquired, and if not, skip checkLockManager.
          Hide
          Namit Jain added a comment -

          You must have set: hive.supports.concurrency to true.

          The only case that you are trying to optimize is: 'show tables', 'desc ..' etc.
          The problem is that 'show Locks' also comes under the same category - no locks need to be acquired.
          I didnt want to special case 'show Locks'

          Show
          Namit Jain added a comment - You must have set: hive.supports.concurrency to true. The only case that you are trying to optimize is: 'show tables', 'desc ..' etc. The problem is that 'show Locks' also comes under the same category - no locks need to be acquired. I didnt want to special case 'show Locks'
          Hide
          John Sichi added a comment -

          +1. Will commit when tests pass.

          Show
          John Sichi added a comment - +1. Will commit when tests pass.
          Hide
          John Sichi added a comment -

          Oops, ant test is giving me this build error:

          [javac] /data/users/jsichi/open/commit-trunk/ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java:939: cannot find symbol
          [javac] symbol : class MiniZooKeeperCluster
          [javac] location: class org.apache.hadoop.hive.ql.QTestUtil.QTestSetup
          [javac] private MiniZooKeeperCluster zooKeeperCluster = null;
          [javac] ^

          I had only moved zookeeper-3.2.2.jar from hbase-handler/lib to top-level lib (but not the hbase-*.jar). I guess we need to move all of them.

          Show
          John Sichi added a comment - Oops, ant test is giving me this build error: [javac] /data/users/jsichi/open/commit-trunk/ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java:939: cannot find symbol [javac] symbol : class MiniZooKeeperCluster [javac] location: class org.apache.hadoop.hive.ql.QTestUtil.QTestSetup [javac] private MiniZooKeeperCluster zooKeeperCluster = null; [javac] ^ I had only moved zookeeper-3.2.2.jar from hbase-handler/lib to top-level lib (but not the hbase-*.jar). I guess we need to move all of them.
          Hide
          John Sichi added a comment -

          Committed. Thanks Namit!

          Show
          John Sichi added a comment - Committed. Thanks Namit!

            People

            • Assignee:
              Namit Jain
              Reporter:
              Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development