HBase
  1. HBase
  2. HBASE-6774

Immediate assignment of regions that don't have entries in HLog

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.95.2
    • Fix Version/s: None
    • Component/s: master, regionserver
    • Labels:
      None

      Description

      The algo is today, after a failure detection:

      • split the logs
      • when all the logs are split, assign the regions

      But some regions can have no entries at all in the HLog. There are many reasons for this:

      • kind of reference or historical tables. Bulk written sometimes then read only.
      • sequential rowkeys. In this case, most of the regions will be read only. But they can be in a regionserver with a lot of writes.
      • tables flushed often for safety reasons. I'm thinking about meta here.

      For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases, will be the failure detection time.

      There are different possible algos:
      Option 1)
      A new task is added, in parallel of the split. This task reads all the HLog. If there is no entry for a region, this region is assigned.
      Pro: simple
      Cons: We will need to read all the files. Add a read.
      Option 2)
      The master writes in ZK the number of log files, per region.
      When the regionserver starts the split, it reads the full block (64M) and decrease the log file counter of the region. If it reaches 0, the assign start. At the end of its split, the region server decreases the counter as well. This allow to start the assign even if not all the HLog are finished. It would allow to make some regions available even if we have an issue in one of the log file.
      Pro: parallel
      Cons: add something to do for the region server. Requites to read the whole file before starting to write.
      Option 3)
      Add some metadata at the end of the log file. The last log file won't have meta data, as if we are recovering, it's because the server crashed. But the others will. And last log file should be smaller (half a block on average).
      Option 4) Still some metadata, but in a different file. Cons: write are increased (but not that much, we just need to write the region once). Pros: if we lose the HLog files (major failure, no replica available) we can still continue with the regions that were not written at this stage.

      I think it should be done, even if none of the algorithm above is totally convincing yet. It's linked as well to locality and short circuit reads: with these two points reading the file twice become much less of an issue for example. My current preference would be to open the file twice in the region server, once for splitting as of today, once for a quick read looking for unused regions. Who knows, may be it would even be faster this way, the quick read thread would warm-up the different caches for the splitting thread.

      1. HBase-6774-approach.pdf
        88 kB
        Himanshu Vashishtha

        Issue Links

          Activity

          Hide
          Nicolas Liochon added a comment -

          After thinking again about this one, here is another possible solution:

          • put the memstore state in ZooKeeper
          • when we create a new memstore, we asynchronously write the state in ZK (region with empty memstore & region server name)
          • When the first put is written in the WAL, we synchronously write to ZK that this region has now an non empty memstore.
          • then the other puts don't need any ZK writes or synchronisation
          • on memstore flush, we asynchronously update the state in ZK to empty memstore region.
          • on crash, the master checks the region memstore states. If region is assigned but its memstore is empty, we can reassign the region immediately. If there is no data in ZK, or this data says the memstore is not empty, the master does nothing.

          This is high level, I obviously need to tune it for multiple memstore case and study all error cases. But it seems doable.

          So we would have a maximum of 100K znodes (1 per region) in ZK, with one viewer (the master), and one writer (the region server).
          These objects would be written on memstore creation & flush, so not very often.
          If we don't have the znode in ZK, we split as today. We could loose the whole ZK data without any impact.
          This can be made optional (and may be even activated per table: it could be activated only for reference tables and meta. Tables heavily written would not do that. This lowers the number of znode to write into ZK)
          Region servers are already connected to zookeeper, we don't add any ZK connection.

          Pros:

          • do the job: the region non written will be reassigned immediately
          • add a security if we can't split the logs: the table that were not written can be made available immediately
          • optional, and configurable per table
          • should not decrease write performances; only the first put is impacted (by about 10-15ms). With a block size of 128Mb or more, it's acceptable imho.
          • don't add workload (read nor write) on HDFS
          • no dependency on ZK content: we continue to work if the ZK content 'disappears'.

          Cons:

          • add workload on ZooKeeper: but it's configurable per table, so we can limit to whatever we want. We can even imagine heuristic (wait before creating the znode, don't create it if a put occurs before 10 seconds for example)
          • as always, any new feature adds complexity to the whole thing... Could nearly be done with coprocessors (likely not the master part however).
          Show
          Nicolas Liochon added a comment - After thinking again about this one, here is another possible solution: put the memstore state in ZooKeeper when we create a new memstore, we asynchronously write the state in ZK (region with empty memstore & region server name) When the first put is written in the WAL, we synchronously write to ZK that this region has now an non empty memstore. then the other puts don't need any ZK writes or synchronisation on memstore flush, we asynchronously update the state in ZK to empty memstore region. on crash, the master checks the region memstore states. If region is assigned but its memstore is empty, we can reassign the region immediately. If there is no data in ZK, or this data says the memstore is not empty, the master does nothing. This is high level, I obviously need to tune it for multiple memstore case and study all error cases. But it seems doable. So we would have a maximum of 100K znodes (1 per region) in ZK, with one viewer (the master), and one writer (the region server). These objects would be written on memstore creation & flush, so not very often. If we don't have the znode in ZK, we split as today. We could loose the whole ZK data without any impact. This can be made optional (and may be even activated per table: it could be activated only for reference tables and meta. Tables heavily written would not do that. This lowers the number of znode to write into ZK) Region servers are already connected to zookeeper, we don't add any ZK connection. Pros: do the job: the region non written will be reassigned immediately add a security if we can't split the logs: the table that were not written can be made available immediately optional, and configurable per table should not decrease write performances; only the first put is impacted (by about 10-15ms). With a block size of 128Mb or more, it's acceptable imho. don't add workload (read nor write) on HDFS no dependency on ZK content: we continue to work if the ZK content 'disappears'. Cons: add workload on ZooKeeper: but it's configurable per table, so we can limit to whatever we want. We can even imagine heuristic (wait before creating the znode, don't create it if a put occurs before 10 seconds for example) as always, any new feature adds complexity to the whole thing... Could nearly be done with coprocessors (likely not the master part however).
          Hide
          Jimmy Xiang added a comment -

          Can we just split log and assign region in parallel? In opening a region, we check if the region is involved in log splitting somehow. If not, open it. Otherwise, hold there till the log splitting is done for that region.

          Show
          Jimmy Xiang added a comment - Can we just split log and assign region in parallel? In opening a region, we check if the region is involved in log splitting somehow. If not, open it. Otherwise, hold there till the log splitting is done for that region.
          Hide
          stack added a comment -

          I like Jimmy's suggestion for first step at lowering MTTR hereabouts.

          What if we wrote on the end of a WAL a list of all regions mentioned?

          On crash, we'd look at the tail of all WALs and scan fully all WALs that were not properly closed to get the list of regions with edits. Could be done in master. Before a region opens, could query master if it needs to pick up edits from a split? (Maybe only do this is the assign is because of regionserver crash – add a marker to the assign message).

          Writing stuff to zk could work but would be better if we could avoid having to do this?

          Show
          stack added a comment - I like Jimmy's suggestion for first step at lowering MTTR hereabouts. What if we wrote on the end of a WAL a list of all regions mentioned? On crash, we'd look at the tail of all WALs and scan fully all WALs that were not properly closed to get the list of regions with edits. Could be done in master. Before a region opens, could query master if it needs to pick up edits from a split? (Maybe only do this is the assign is because of regionserver crash – add a marker to the assign message). Writing stuff to zk could work but would be better if we could avoid having to do this?
          Hide
          Nicolas Liochon added a comment -

          Yes, everything is in the "somehow" of we check if the region is involved in log splitting somehow

          The advantage of doing that in ZK is that we don"t have to open all the WALs, with the risk of going to a dead datanode (bad datanode often means 60 seconds delay). And we don't have to read fully the last one (as well, if we finally implement the multi WALs, we will have all these WALs to fully read). For the others, technically, reading backward may be difficult to optimize. As well, if the WAL is corrupted, we save the regions that were not written to.

          This said, I agree that writing to ZK is not an easy decision. I think on the long term, having a widely shared real time status on the region is interesting, but we need the middleware (ZK here) to support this (lots of znodes with lots of readers). It's my famous ZOOKEEPER-1147.

          Devaraj told me an idea from Enis for the .meta. case: just do a specific WAL for it. It can be generalized with the multiwals as well.

          All these solutions are not incompatible between themselves anyway...

          Show
          Nicolas Liochon added a comment - Yes, everything is in the "somehow" of we check if the region is involved in log splitting somehow The advantage of doing that in ZK is that we don"t have to open all the WALs, with the risk of going to a dead datanode (bad datanode often means 60 seconds delay). And we don't have to read fully the last one (as well, if we finally implement the multi WALs, we will have all these WALs to fully read). For the others, technically, reading backward may be difficult to optimize. As well, if the WAL is corrupted, we save the regions that were not written to. This said, I agree that writing to ZK is not an easy decision. I think on the long term, having a widely shared real time status on the region is interesting, but we need the middleware (ZK here) to support this (lots of znodes with lots of readers). It's my famous ZOOKEEPER-1147 . Devaraj told me an idea from Enis for the .meta. case: just do a specific WAL for it. It can be generalized with the multiwals as well. All these solutions are not incompatible between themselves anyway...
          Hide
          stack added a comment -

          If multiwals, yeah, should dedicate one for .META.

          Agree all suggestions are not incompatible: i.e. we should do Jimmy's suggestion (You may have suggested similar a while back IIRC).

          I like the issues you raise w/ the soln. I suggest. While we could read the last WAL while splitting, just reading metadata off the end of all WALs concerned would take say ... about a second for each unless we did it in //... and if tens of WALs, thats tens of seconds before we could open a region even when all is functioning without hiccups (add hiccups, bad DN and it goes up significantly).

          I do like not having to have another subsystem in the mix doing log splitting. I suppose we already have an optional dependency on zk farming out the work.

          Could the regionserver send the master the regions mentioned in a WAL and let it do accounting or it could send sequenceids by flush to the master and let it figure out up to what entry it can skip edits? It could do the writing to zk instead of every regionserver doing it for every WAL roll. We are already sending over seqids on heartbeat so we can skip stale edits on crash? Could expand this functionality so a by region dimension? (Haven't thought it through... just making suggestion)

          Show
          stack added a comment - If multiwals, yeah, should dedicate one for .META. Agree all suggestions are not incompatible: i.e. we should do Jimmy's suggestion (You may have suggested similar a while back IIRC). I like the issues you raise w/ the soln. I suggest. While we could read the last WAL while splitting, just reading metadata off the end of all WALs concerned would take say ... about a second for each unless we did it in //... and if tens of WALs, thats tens of seconds before we could open a region even when all is functioning without hiccups (add hiccups, bad DN and it goes up significantly). I do like not having to have another subsystem in the mix doing log splitting. I suppose we already have an optional dependency on zk farming out the work. Could the regionserver send the master the regions mentioned in a WAL and let it do accounting or it could send sequenceids by flush to the master and let it figure out up to what entry it can skip edits? It could do the writing to zk instead of every regionserver doing it for every WAL roll. We are already sending over seqids on heartbeat so we can skip stale edits on crash? Could expand this functionality so a by region dimension? (Haven't thought it through... just making suggestion)
          Hide
          Nicolas Liochon added a comment -

          If multiwals, yeah, should dedicate one for .META.

          Is someone working on multiwals implementation, or is it still in the "currently studied" state?

          we should do Jimmy's suggestion. (You may have suggested similar a while back IIRC)

          Yes, it's option 3) in this jira description . I was not totally satisfied, that why I tried to find something different. I agree it's more a different balance than a better solution.

          Could the regionserver send the master the regions mentioned in a WAL and let it do accounting or it could send sequenceids by flush to haven't though about the master option, it could be a solution. I need to think about it.

          Show
          Nicolas Liochon added a comment - If multiwals, yeah, should dedicate one for .META. Is someone working on multiwals implementation, or is it still in the "currently studied" state? we should do Jimmy's suggestion. (You may have suggested similar a while back IIRC) Yes, it's option 3) in this jira description . I was not totally satisfied, that why I tried to find something different. I agree it's more a different balance than a better solution. Could the regionserver send the master the regions mentioned in a WAL and let it do accounting or it could send sequenceids by flush to haven't though about the master option, it could be a solution. I need to think about it.
          Hide
          Nicolas Liochon added a comment -

          For the master based solution
          If we go for the regionserver -> master -> zookeeper solution, it's not perfect imho, because we just add an agent in the middle.

          The master could store the region information, without going to ZK
          -> Faster than the solution with ZK, because we would not write to the disk
          -> If we lose the master, we lose the date, but it's not an issue (just that the recovery will be slower: we will have to read all the logs)
          -> The master becomes an element of the write path (for the first write in a memstore). I'm not at ease with that.

          At the end of the day, I agree with what Stack said previously: let's not add a new component in the write path. This is valid for both the master & ZK.

          So we're left with the other options:

          • specific WAL for .meta.
          • adding meta data at the end of the WAL.

          I'm currently looking at them.

          Show
          Nicolas Liochon added a comment - For the master based solution If we go for the regionserver -> master -> zookeeper solution, it's not perfect imho, because we just add an agent in the middle. The master could store the region information, without going to ZK -> Faster than the solution with ZK, because we would not write to the disk -> If we lose the master, we lose the date, but it's not an issue (just that the recovery will be slower: we will have to read all the logs) -> The master becomes an element of the write path (for the first write in a memstore). I'm not at ease with that. At the end of the day, I agree with what Stack said previously: let's not add a new component in the write path. This is valid for both the master & ZK. So we're left with the other options: specific WAL for .meta. adding meta data at the end of the WAL. I'm currently looking at them.
          Hide
          Devaraj Das added a comment -

          I am starting to prototype the specific wal for .meta. approach (leveraging the implementation of FSHlog) to get a feel for the complexity, etc. Will keep folks posted (and probably raise a separate jira as well).

          Show
          Devaraj Das added a comment - I am starting to prototype the specific wal for .meta. approach (leveraging the implementation of FSHlog) to get a feel for the complexity, etc. Will keep folks posted (and probably raise a separate jira as well).
          Hide
          Himanshu Vashishtha added a comment -

          Nicolas Liochon Devaraj Das: I am interested to know whether there is any progress on this issue (making regions available which do not have a WAL entry, i.e., not waiting for log splitting to finish). Faced this when working on a read intensive workload. As Nkeywal commented earlier, it is quite useful for some use-cases. There is already a separate WAL for .META., thanks to Devaraj. If you guys are OK, I would like to work on this.

          Show
          Himanshu Vashishtha added a comment - Nicolas Liochon Devaraj Das : I am interested to know whether there is any progress on this issue (making regions available which do not have a WAL entry, i.e., not waiting for log splitting to finish). Faced this when working on a read intensive workload. As Nkeywal commented earlier, it is quite useful for some use-cases. There is already a separate WAL for .META., thanks to Devaraj. If you guys are OK, I would like to work on this.
          Hide
          Nicolas Liochon added a comment -

          Ok for me of course . Thanks for this. I don't have an ideal solution in mind, I guess there is some design work to do here, but may be Devaraj is more advanced than me. I assign the jira to you in case you don't have the ar for this.

          Show
          Nicolas Liochon added a comment - Ok for me of course . Thanks for this. I don't have an ideal solution in mind, I guess there is some design work to do here, but may be Devaraj is more advanced than me. I assign the jira to you in case you don't have the ar for this.
          Hide
          Devaraj Das added a comment -

          I am fine with that, Himanshu Vashishtha.. I guess we should start with a proposal and agree on (this jira had multiple proposals).

          Show
          Devaraj Das added a comment - I am fine with that, Himanshu Vashishtha .. I guess we should start with a proposal and agree on (this jira had multiple proposals).
          Hide
          Lars Hofhansl added a comment -

          This mingles (somewhat at least) with HBASE-8375 that I just opened. One of the options proposed there are "unlogged tables" (tables that never write WAL entries). All regions of those tables could be assigned immediately.

          Show
          Lars Hofhansl added a comment - This mingles (somewhat at least) with HBASE-8375 that I just opened. One of the options proposed there are "unlogged tables" (tables that never write WAL entries). All regions of those tables could be assigned immediately.
          Hide
          Himanshu Vashishtha added a comment -

          Hello,

          I have come up with a proposal. Suggestions are welcome.

          Thanks.

          Show
          Himanshu Vashishtha added a comment - Hello, I have come up with a proposal. Suggestions are welcome. Thanks.
          Hide
          ramkrishna.s.vasudevan added a comment -
          This is to cover all the regions that were updated after the last ServerLoad report and shutdown event. Once it has read the WAL files (usually only the last one) which have sequenceIds greater than max_completeSequenceId, it sends an open request for regions which has allWALEntriesFlushed set to true, as they don’t need to wait for log splitting/replaying to complete
          

          Am not clear in this area. I may be missing something in my understanding. Pls do correct me if am wrong.
          I have allWALEntriesFlushed set to true, but the region has some additional wal entries in HLog just before the report and abrupt shutdown event happened.
          When you say they don't need to wait for Log Splitting?
          Also did you see Jeffrey's latest work on Log Splitting. His proposal also uses the LatestCompleteFlushSeqId.
          Thanks for the write up.

          Show
          ramkrishna.s.vasudevan added a comment - This is to cover all the regions that were updated after the last ServerLoad report and shutdown event. Once it has read the WAL files (usually only the last one) which have sequenceIds greater than max_completeSequenceId, it sends an open request for regions which has allWALEntriesFlushed set to true , as they don’t need to wait for log splitting/replaying to complete Am not clear in this area. I may be missing something in my understanding. Pls do correct me if am wrong. I have allWALEntriesFlushed set to true, but the region has some additional wal entries in HLog just before the report and abrupt shutdown event happened. When you say they don't need to wait for Log Splitting? Also did you see Jeffrey's latest work on Log Splitting. His proposal also uses the LatestCompleteFlushSeqId. Thanks for the write up.
          Hide
          Himanshu Vashishtha added a comment -

          Hey Ram,

          Thanks for reading it through.

          When you say they don't need to wait for Log Splitting?

          So in case when there are some mutations after the last Serverreport and shutdown event, we need to look at the last WAL. Once we have read it and updated the region:allWalEntriesFlushed mapping for WALEdits which has logSeqNum > max_completeSequenceId, we can open those regions which has allWalEntriesFlushed still set to true.

          Yes, I have looked at Jeffrey's work and will review it more this week. I didn't think he proposed any change in the usage of latestCompleteFlushSeqId, though. Also, that jira will make regions available for writes, and this is about making "pristine" regions available for reads.

          Show
          Himanshu Vashishtha added a comment - Hey Ram, Thanks for reading it through. When you say they don't need to wait for Log Splitting? So in case when there are some mutations after the last Serverreport and shutdown event, we need to look at the last WAL. Once we have read it and updated the region:allWalEntriesFlushed mapping for WALEdits which has logSeqNum > max_completeSequenceId, we can open those regions which has allWalEntriesFlushed still set to true. Yes, I have looked at Jeffrey's work and will review it more this week. I didn't think he proposed any change in the usage of latestCompleteFlushSeqId, though. Also, that jira will make regions available for writes, and this is about making "pristine" regions available for reads.
          Hide
          Enis Soztutar added a comment -

          If I understand this correctly, this allWALEntriesFlushed does not seem to contain reliable information. With this proposal, it seems that we have to read the last WAL files to update the allWalEntriesFlushed to make up for the fact that the last heartbeat might not complete etc. But hartbeats themselves are not reliable as well. We cannot assume that by just reading the last WAL file, allWalEntriesFlushed will be correct, since we might have been missing hearthbeats for some time. The only reliable way is to read up the wal backwards, until for each region we make sure that we have read up to latestCompleteFlushSeqId. Which makes allWALEntriesFlushed redundant.
          If a region has not got any update for some time, its latestCompleteFlushSeqId wont be updated at all, since there will be no flushes. To reassign this region, we have to ensure that all wals are read.

          Show
          Enis Soztutar added a comment - If I understand this correctly, this allWALEntriesFlushed does not seem to contain reliable information. With this proposal, it seems that we have to read the last WAL files to update the allWalEntriesFlushed to make up for the fact that the last heartbeat might not complete etc. But hartbeats themselves are not reliable as well. We cannot assume that by just reading the last WAL file, allWalEntriesFlushed will be correct, since we might have been missing hearthbeats for some time. The only reliable way is to read up the wal backwards, until for each region we make sure that we have read up to latestCompleteFlushSeqId. Which makes allWALEntriesFlushed redundant. If a region has not got any update for some time, its latestCompleteFlushSeqId wont be updated at all, since there will be no flushes. To reassign this region, we have to ensure that all wals are read.
          Hide
          Himanshu Vashishtha added a comment -

          Hey Enis,

          Thanks for asking these questions.

          There is a max_completeSequenceId per regionserver field in the attached doc, which is updated after receiving the heartbeat from a regionserver. When master processes the server shutdown event, it will use the max_completeSequenceId for the regionserver in order to determine how much WAL is relevant (it has missed) and need to read before finalizing allWALEntriesFlushed. The goal is to process all WALEdits which have walEdit#key#logSequenceId > max_completeSequenceId. If that means reading second last WAL also, it will process that too. The invariant is to read latest WAL files first, until we reach the point where some waledits in the wal are s.t. WALedit#key#logSequenceId < max_completeSequenceId. We no longer need to read older WALs then.

          If a region has not got any update for some time, its latestCompleteFlushSeqId wont be updated at all, since there will be no flushes. To reassign this region, we have to ensure that all wals are read.

          It uses max_completeSequenceId to read the remaining WAL. Once it has read all the WALEdits after max_completeSequenceId, allWALEntriesFlushed will have the correct information, and it can be used to assign a region or not.

          The only reliable way is to read up the wal backwards,

          I am not sure whether a sequenceFile can be read backwards, or how efficient it would be. That's why I propose to read a WAL file from its head and re-use the existing WALReader code.

          As soon as any region is flushed, master will have the most updated information for all regions for that regionserver once it receives the next heartbeat.

          Consider a rogue scenario: A regionserver sends a report and the max_completeSequenceId = 100. There is a write heavy workload and WAL is rolled and then server abort. And master missed all its heartbeats before the rs aborted. Based on max_completeSequenceId, we need to read last 2 WAL files (1 + 1): 1 new one, and 1 at which master got the last heartbeat (it has some entries > 100). Since we are reading most current ones first, it is easy to determine whether we need to older WALs or not. Let's call those files f1 and f2 where f1 is the latest.
          It reads f1 first and see that the first waledit#key#logSequenceId > 100, so it en-queues f2 also as there might be some entries at f2's tail which are missed.
          Once it has read f1 and f2, and updated the allWALEntriesFlushed for the regions, master can decide which regions can be assigned right away.

          Hope this helps.

          Show
          Himanshu Vashishtha added a comment - Hey Enis, Thanks for asking these questions. There is a max_completeSequenceId per regionserver field in the attached doc, which is updated after receiving the heartbeat from a regionserver. When master processes the server shutdown event, it will use the max_completeSequenceId for the regionserver in order to determine how much WAL is relevant (it has missed) and need to read before finalizing allWALEntriesFlushed. The goal is to process all WALEdits which have walEdit#key#logSequenceId > max_completeSequenceId. If that means reading second last WAL also, it will process that too. The invariant is to read latest WAL files first, until we reach the point where some waledits in the wal are s.t. WALedit#key#logSequenceId < max_completeSequenceId. We no longer need to read older WALs then. If a region has not got any update for some time, its latestCompleteFlushSeqId wont be updated at all, since there will be no flushes. To reassign this region, we have to ensure that all wals are read. It uses max_completeSequenceId to read the remaining WAL. Once it has read all the WALEdits after max_completeSequenceId, allWALEntriesFlushed will have the correct information, and it can be used to assign a region or not. The only reliable way is to read up the wal backwards, I am not sure whether a sequenceFile can be read backwards, or how efficient it would be. That's why I propose to read a WAL file from its head and re-use the existing WALReader code. As soon as any region is flushed, master will have the most updated information for all regions for that regionserver once it receives the next heartbeat. Consider a rogue scenario: A regionserver sends a report and the max_completeSequenceId = 100. There is a write heavy workload and WAL is rolled and then server abort. And master missed all its heartbeats before the rs aborted. Based on max_completeSequenceId, we need to read last 2 WAL files (1 + 1): 1 new one, and 1 at which master got the last heartbeat (it has some entries > 100). Since we are reading most current ones first, it is easy to determine whether we need to older WALs or not. Let's call those files f1 and f2 where f1 is the latest. It reads f1 first and see that the first waledit#key#logSequenceId > 100, so it en-queues f2 also as there might be some entries at f2's tail which are missed. Once it has read f1 and f2, and updated the allWALEntriesFlushed for the regions, master can decide which regions can be assigned right away. Hope this helps.
          Hide
          Enis Soztutar added a comment -

          Thanks for the explanation. It seems that this can work, but the relative gain may not be that much to justify it. Other proposal for writing the list of region names in wal header, and reading them to determine, which tasks should be complete before to make sure the assignment seems more cleaner to me.

          Show
          Enis Soztutar added a comment - Thanks for the explanation. It seems that this can work, but the relative gain may not be that much to justify it. Other proposal for writing the list of region names in wal header, and reading them to determine, which tasks should be complete before to make sure the assignment seems more cleaner to me.
          Hide
          Himanshu Vashishtha added a comment -

          Thanks Enis.

          Yes, WAL approach is also there but I think they both have their own plus and minus points. I proposed the ServerLoad approach because it is self contained and doesn't involve any changes in WAL/SequenceFile, etc, and re-uses existing ServerLoad object.

          In WAL meta data case, some meta data should be appended at the end of a WAL file. This involves adding custom key-value while closing the WAL file, and a check while reading every record (whether it is a meta record or not, etc).
          Since it will be added at the end, master needs to open the reader and seek to the end of the file. This meta data should be read for all the log files, in a sequential manner starting from the oldest wal file in order to track a region timeline. This is in addition to reading the last WAL file.
          An application that have high write rates, a regionserver may have larger number of WALs to replay.

          Another point is, IMHO, this feature should be made configurable as there might be some workloads which may not require this (writes distributed on all key-space, etc). With WAL approach, it becomes little bit tricky to make this feature optional, as it is inserting meta data in the WAL. With some meta entry in a WAL file, LogReader should always be aware of such entries, be it ReplicationLogReaders or LogSplitter as they might be reading some old logs, etc.

          It seems that this can work, but the relative gain may not be that much to justify it.

          This is just an alternative approach to the WAL one, and I think it is less intrusive. But I am open to both and would like to hear more of your opinions on the above points.

          Show
          Himanshu Vashishtha added a comment - Thanks Enis. Yes, WAL approach is also there but I think they both have their own plus and minus points. I proposed the ServerLoad approach because it is self contained and doesn't involve any changes in WAL/SequenceFile, etc, and re-uses existing ServerLoad object. In WAL meta data case, some meta data should be appended at the end of a WAL file. This involves adding custom key-value while closing the WAL file, and a check while reading every record (whether it is a meta record or not, etc). Since it will be added at the end, master needs to open the reader and seek to the end of the file. This meta data should be read for all the log files, in a sequential manner starting from the oldest wal file in order to track a region timeline. This is in addition to reading the last WAL file. An application that have high write rates, a regionserver may have larger number of WALs to replay. Another point is, IMHO, this feature should be made configurable as there might be some workloads which may not require this (writes distributed on all key-space, etc). With WAL approach, it becomes little bit tricky to make this feature optional, as it is inserting meta data in the WAL. With some meta entry in a WAL file, LogReader should always be aware of such entries, be it ReplicationLogReaders or LogSplitter as they might be reading some old logs, etc. It seems that this can work, but the relative gain may not be that much to justify it. This is just an alternative approach to the WAL one, and I think it is less intrusive. But I am open to both and would like to hear more of your opinions on the above points.

            People

            • Assignee:
              Himanshu Vashishtha
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development