Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-311

Modifications to enable multiple types of logging

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    1. HDFS-311-design-document.pdf
      135 kB
      Luca Telloli
    2. HDFS-311-complete.patch
      98 kB
      Luca Telloli
    3. HDFS-311.patch
      55 kB
      Luca Telloli
    4. HDFS-311.patch
      82 kB
      Luca Telloli
    5. HADOOP-5188.patch
      55 kB
      Luca Telloli
    6. HADOOP-5188.patch
      54 kB
      Luca Telloli
    7. HADOOP-5188.patch
      46 kB
      Luca Telloli
    8. HADOOP-5188.patch
      57 kB
      Luca Telloli
    9. HADOOP-5188.patch
      52 kB
      Luca Telloli
    10. HADOOP-5188.pdf
      53 kB
      Luca Telloli
    11. HADOOP-5188.patch
      70 kB
      Luca Telloli

      Issue Links

        Activity

        Hide
        Ivan Kelly added a comment -

        This JIRA has been superseded by HDFS-1580 and its subtasks.

        Show
        Ivan Kelly added a comment - This JIRA has been superseded by HDFS-1580 and its subtasks.
        Hide
        Diego Marron added a comment -

        Hello guys, my name is Diego and I'm interested on finish this modification.
        I was thinking in follow the design document uploaded by luca on august 2009 (HDFS-311-design-document.pdf).

        As far as i could see, we need a new class (device) and modify fsimage to work with and arraylist of devices instead of storagedir. Then make all the changes necesary to make everything work.

        The doc also talks about 2 Device subclasses: JournalDevice (device for journaling) and ImageDevice (device for namespace storage: fsimage) from which we will get the apropriate I/O streams.

        So if we want to implement a new Journaling device it should extend JournalDevice and take care of initialize its I/O streams.

        Are you agree with what this doc says/proposes? or you think there is a better way to acomplish that?

        Thank you guys!!
        Diego.

        Show
        Diego Marron added a comment - Hello guys, my name is Diego and I'm interested on finish this modification. I was thinking in follow the design document uploaded by luca on august 2009 ( HDFS-311 -design-document.pdf). As far as i could see, we need a new class (device) and modify fsimage to work with and arraylist of devices instead of storagedir. Then make all the changes necesary to make everything work. The doc also talks about 2 Device subclasses: JournalDevice (device for journaling) and ImageDevice (device for namespace storage: fsimage) from which we will get the apropriate I/O streams. So if we want to implement a new Journaling device it should extend JournalDevice and take care of initialize its I/O streams. Are you agree with what this doc says/proposes? or you think there is a better way to acomplish that? Thank you guys!! Diego.
        Hide
        Luca Telloli added a comment -

        Attaching a design document for this patch as a result of previous conversations with the HDFS team.
        The proposed design builds upon the LogDevice design proposed last May, taking care of the other sensitive issues related to the file system (checkpointing and timestamps over all).

        Everyone feel free to proofread it, pinpoint errors or suggest design enhancements.

        Any future development on this issue should be based on this document or its derivatives.

        Show
        Luca Telloli added a comment - Attaching a design document for this patch as a result of previous conversations with the HDFS team. The proposed design builds upon the LogDevice design proposed last May, taking care of the other sensitive issues related to the file system (checkpointing and timestamps over all). Everyone feel free to proofread it, pinpoint errors or suggest design enhancements. Any future development on this issue should be based on this document or its derivatives.
        Hide
        Luca Telloli added a comment -

        Sorry, earlier I meant to say that I'm attaching the bookkeeper + zookeeper jars to HDFS-234 for download (although it's probably better to download the distribution from http://hadoop.apache.org/zookeeper/ ).

        Show
        Luca Telloli added a comment - Sorry, earlier I meant to say that I'm attaching the bookkeeper + zookeeper jars to HDFS-234 for download (although it's probably better to download the distribution from http://hadoop.apache.org/zookeeper/ ).
        Hide
        Luca Telloli added a comment -

        PATCH SPLIT!

        The patch named HDFS-311.patch contains only modifications related to this JIRA and it's mainly for review/test.

        The patch named HDFS-311-complete.patch contains all modifications needed for abstracting logging (HDFS-311) and use Bookkeeper Logging (HDFS-234) and additionally HDFS-397. This second patch is meant to be used against trunk for people interested in using Bookkeeper for logging operations.
        Again: this patch containg HDFS-397, so you don't need to apply it manually. You'll need the bookkeeper and zookeeper jars, that I'm currently attaching to this JIRA

        Show
        Luca Telloli added a comment - PATCH SPLIT! The patch named HDFS-311 .patch contains only modifications related to this JIRA and it's mainly for review/test. The patch named HDFS-311 -complete.patch contains all modifications needed for abstracting logging ( HDFS-311 ) and use Bookkeeper Logging ( HDFS-234 ) and additionally HDFS-397 . This second patch is meant to be used against trunk for people interested in using Bookkeeper for logging operations. Again: this patch containg HDFS-397 , so you don't need to apply it manually. You'll need the bookkeeper and zookeeper jars, that I'm currently attaching to this JIRA
        Hide
        Luca Telloli added a comment -

        Usage:

        • This patch requires HDFS-397 to be applied first
        • To use bookkeeper, you should specify the same values in dfs.name.edits.dir and fs.checkpoint.name.edits
        • The value should be specified as a URI: for instance: bookkeeper://127.0.0.1:2181?nbookies=3&quorumsize=2&mode=verifiable ; at the moment, the three parameters are the number of bookies (nbookies), the size of the quorum (quorumsize) and the quorum mode (mode). More informations on BookKeeper are available at the Zookeeper website

        Drawbacks:

        • The patch does not work with BackupNode yet, it works only with SecondaryNamenode
        • The patch passes all unit tests excluding TestDFSUpgrade and TestDistributedUpgrade. Since these two tests rely on an upgrade scenario, I decided to post the patch anyway
        • The patch includes old HADOOP-5189 (now HDFS-234)

        Short list of modifications:
        EditLogOutputStream:

        • two abstract methods have been added
        • abstract void setLatestCheckpointTime(long time) throws IOException;
        • abstract void format(FSImage fsimage) throws IOException;

        EditLogInputStream:

        • one abstract method has been added: abstract long getLatestCheckpointTime() throws IOException;

        Journals: is a map <URI, JournalType>, used to retrieve all Journal of some specific type. It's similar to StorageDirectory in that it provides a custom iterator

        FSImage:

        • readCheckpointTime and writeCheckPoint time have been made static. This should be thread safe and allows EditLogOutputStream to access these methods
        • processIOError(ArrayList<StorageDirectory> sds, boolean propagate) has been changed with processIOError(List<URI> faultyJournals, boolean propagate)

        FSEditLog:
        processIOError(ArrayList<StorageDirectory> sds, boolean propagate) has been changed with processIOError(List<URI> faultyJournals, boolean propagate)

        Show
        Luca Telloli added a comment - Usage: This patch requires HDFS-397 to be applied first To use bookkeeper, you should specify the same values in dfs.name.edits.dir and fs.checkpoint.name.edits The value should be specified as a URI: for instance: bookkeeper://127.0.0.1:2181?nbookies=3&quorumsize=2&mode=verifiable ; at the moment, the three parameters are the number of bookies (nbookies), the size of the quorum (quorumsize) and the quorum mode (mode). More informations on BookKeeper are available at the Zookeeper website Drawbacks: The patch does not work with BackupNode yet, it works only with SecondaryNamenode The patch passes all unit tests excluding TestDFSUpgrade and TestDistributedUpgrade. Since these two tests rely on an upgrade scenario, I decided to post the patch anyway The patch includes old HADOOP-5189 (now HDFS-234 ) Short list of modifications: EditLogOutputStream: two abstract methods have been added abstract void setLatestCheckpointTime(long time) throws IOException; abstract void format(FSImage fsimage) throws IOException; EditLogInputStream: one abstract method has been added: abstract long getLatestCheckpointTime() throws IOException; Journals: is a map <URI, JournalType>, used to retrieve all Journal of some specific type. It's similar to StorageDirectory in that it provides a custom iterator FSImage: readCheckpointTime and writeCheckPoint time have been made static. This should be thread safe and allows EditLogOutputStream to access these methods processIOError(ArrayList<StorageDirectory> sds, boolean propagate) has been changed with processIOError(List<URI> faultyJournals, boolean propagate) FSEditLog: processIOError(ArrayList<StorageDirectory> sds, boolean propagate) has been changed with processIOError(List<URI> faultyJournals, boolean propagate)
        Hide
        Luca Telloli added a comment -

        To reopen this issue, I'm currently try to solve the following problem.

        For each logging stream I need to set a checkpoint time. This can be done including a setCheckpointTime() method in the EditLogOutputStream abstract class, but it has a serie of implications, that I'll list:

        1. at format time we'll need to instantiate a stream of each type to call the method (or we can make the method static)

        2. for file streams, the method will have to call the FSImage.format(sd) method: this means that the method should be static or the code should be copied as it is inside the EditLogFileOutputStream

        3. FSImage.format static will imply that saveFSImage should be static as well, which implies that the variable namespaceID should be static as well

        I'm wondering if this is an acceptable solution or not; or I'd like to know if anyone has a better one. In general I think that some methods of FSImage can be made static without affecting the thread safety, but I'd like a confirmation on this.

        Show
        Luca Telloli added a comment - To reopen this issue, I'm currently try to solve the following problem. For each logging stream I need to set a checkpoint time. This can be done including a setCheckpointTime() method in the EditLogOutputStream abstract class, but it has a serie of implications, that I'll list: 1. at format time we'll need to instantiate a stream of each type to call the method (or we can make the method static) 2. for file streams, the method will have to call the FSImage.format(sd) method: this means that the method should be static or the code should be copied as it is inside the EditLogFileOutputStream 3. FSImage.format static will imply that saveFSImage should be static as well, which implies that the variable namespaceID should be static as well I'm wondering if this is an acceptable solution or not; or I'd like to know if anyone has a better one. In general I think that some methods of FSImage can be made static without affecting the thread safety, but I'd like a confirmation on this.
        Hide
        Luca Telloli added a comment - - edited

        Here a few comments to reply to the latest questions:

        @Flavio:

        • Append: in a few cases thorough the current code, file edit streams are closed and later reopened. Your proposal of using multiple ledgers is good but at the same time it sounds more hacky than clean. You should save somewhere a persistent list of ledgers opened and their role (were they equivalent to edits? to edits.new?) and allow this role to be changed (when edits.new is "renamed" as edits).
        • LogDevice not fully used: currently it's used only for output streams. Input streams do not currently use it.
        • BackupNode and Bookkeeper: my feeling is that the expectation is to spawn a new process when using Bookkeeper as logger

        @Ben:
        I don't think we're down the road of mixed API. As far as I remember there's only one duplicated method, and the duplication should disappear when the integration of LogDevice is completed.
        On the other side, Konstantin keeps saying that to implement 5189 there's only the need to implement the Input/Output stream classes but I'm still not fully convinced about it.

        FINALLY, since there's no general agreement on LogDevice, for the moment I'm unlinking this patch and HADOOP-5189, to work independently on them.

        Show
        Luca Telloli added a comment - - edited Here a few comments to reply to the latest questions: @Flavio: Append: in a few cases thorough the current code, file edit streams are closed and later reopened. Your proposal of using multiple ledgers is good but at the same time it sounds more hacky than clean. You should save somewhere a persistent list of ledgers opened and their role (were they equivalent to edits? to edits.new?) and allow this role to be changed (when edits.new is "renamed" as edits). LogDevice not fully used: currently it's used only for output streams. Input streams do not currently use it. BackupNode and Bookkeeper: my feeling is that the expectation is to spawn a new process when using Bookkeeper as logger @Ben: I don't think we're down the road of mixed API. As far as I remember there's only one duplicated method, and the duplication should disappear when the integration of LogDevice is completed. On the other side, Konstantin keeps saying that to implement 5189 there's only the need to implement the Input/Output stream classes but I'm still not fully convinced about it. FINALLY, since there's no general agreement on LogDevice, for the moment I'm unlinking this patch and HADOOP-5189 , to work independently on them.
        Hide
        Benjamin Reed added a comment -

        just to follow up flavio's comment a bit: is it necessary to introduce the LogDevice? i believe it can be done without the LogDevice; it just will not be as clean. right? i would rather have it be done with a more crude approach than have it done half way. i agree with flavio that it is confusing to have a mix of old and new api lying around. that stuff rarely goes away. this issue is really just about enabling multiple types of logging, not about making the namenode code more modular. i think the later needs to be done, but it is going to impact a lot of code to do it right, and i don't think it will happen before the next release.

        Show
        Benjamin Reed added a comment - just to follow up flavio's comment a bit: is it necessary to introduce the LogDevice? i believe it can be done without the LogDevice; it just will not be as clean. right? i would rather have it be done with a more crude approach than have it done half way. i agree with flavio that it is confusing to have a mix of old and new api lying around. that stuff rarely goes away. this issue is really just about enabling multiple types of logging, not about making the namenode code more modular. i think the later needs to be done, but it is going to impact a lot of code to do it right, and i don't think it will happen before the next release.
        Hide
        Flavio Junqueira added a comment - - edited

        Luca, I was wondering if you could clarify a few points of your last post:

        1. I'm not sure I follow the argument about append. More specifically, I'm not sure if your argument is about implementation complexity or just abstraction. To implement append with BookKeeper, can't we simply open another ledger upon a call to append? With respect to abstraction, it sounds rights that it would be better not to have to implement "append" for bookkeeper logging because no such operation is supported directly by the service.
        2. On your argument about not fully using LogDevice, perhaps you want to elaborate on it a little more. Why is it ok to leave it partially implemented? I'm specially concerned about someone looking at this code later and having a hard time understanding it because there is a mix of mechanisms implemented. If the remainder of the abstraction can be implemented right away, then it might be ok to postpone in the interest of progress.
        3. I didn't follow the correlation between the BackupNode and logging with BookKeeper. Are you saying that if we don't have LogDevice, then we would have to spawn a new process to perform BookKeeper logging as with the BackupNode? Independent of the answer to this question, please clarify.
        Show
        Flavio Junqueira added a comment - - edited Luca, I was wondering if you could clarify a few points of your last post: I'm not sure I follow the argument about append. More specifically, I'm not sure if your argument is about implementation complexity or just abstraction. To implement append with BookKeeper, can't we simply open another ledger upon a call to append? With respect to abstraction, it sounds rights that it would be better not to have to implement "append" for bookkeeper logging because no such operation is supported directly by the service. On your argument about not fully using LogDevice, perhaps you want to elaborate on it a little more. Why is it ok to leave it partially implemented? I'm specially concerned about someone looking at this code later and having a hard time understanding it because there is a mix of mechanisms implemented. If the remainder of the abstraction can be implemented right away, then it might be ok to postpone in the interest of progress. I didn't follow the correlation between the BackupNode and logging with BookKeeper. Are you saying that if we don't have LogDevice, then we would have to spawn a new process to perform BookKeeper logging as with the BackupNode? Independent of the answer to this question, please clarify.
        Hide
        Luca Telloli added a comment -

        Konstantin, to follow up on Flavio's comment, let me use BookKeeper as an example.

        To use bookkeeper as a log device I need to open a client. If I open it in EditLogOutputStream that client should be destroyed upon a close() call and would be recreated each time a new EditLogOutputStream is called: I don't think this is the right way to proceed. LogDevice would create an instance of such a client and keep it persistent thourough the life of a server process (namenode, etc). This motivates the insufficiency of the current EditStreams.

        So far HDFS has been working only with files for logging. BackupNode is a special type of process that, although it registers a stream, at the end uses files again. I don't think that every other log device should spawn a new process similar to BackupNode for logging, although I assume that in production users will choose only one device.

        Finally, it would be hard to integrate Bookkeeper with the current streams because BookKeeper ledgers do not support append. Files do, bookkeeper doesn't (instead of appending, we create another "ledger" - the bookkeeper log file). It's part of the diversity of the two "devices", it's another argument in favor of the need of an abstraction.

        LogDevice has been passing so far all unit tests successfully and it guarantees compatibility with the current implementation, so it seems a safe absstraction. I believe it's also a good one but I'd be delighted to hear a concrete proposal to make it better or replace it.

        Other points:
        1. HADOOP-5188 does introduce LogDevice and I agree it's not fully uses yet thorough the code. But if we want to keep it, we'll do the modifications eventually.

        2. Method EditLogFileOutputStream.switchEditStreams() should eventually contain part of the rollEditLog() code. In the case of the BackupNode you probably know better what should be there, and if you wanna contribute to it, you're very welcome.

        3. The use of instanceof has been reduced and it will disappear in my next version, since I introduced a getType() method. Reflection is used at the end only to instantiate the correct log device implementation

        Show
        Luca Telloli added a comment - Konstantin, to follow up on Flavio's comment, let me use BookKeeper as an example. To use bookkeeper as a log device I need to open a client. If I open it in EditLogOutputStream that client should be destroyed upon a close() call and would be recreated each time a new EditLogOutputStream is called: I don't think this is the right way to proceed. LogDevice would create an instance of such a client and keep it persistent thourough the life of a server process (namenode, etc). This motivates the insufficiency of the current EditStreams. So far HDFS has been working only with files for logging. BackupNode is a special type of process that, although it registers a stream, at the end uses files again. I don't think that every other log device should spawn a new process similar to BackupNode for logging, although I assume that in production users will choose only one device. Finally, it would be hard to integrate Bookkeeper with the current streams because BookKeeper ledgers do not support append. Files do, bookkeeper doesn't (instead of appending, we create another "ledger" - the bookkeeper log file). It's part of the diversity of the two "devices", it's another argument in favor of the need of an abstraction. LogDevice has been passing so far all unit tests successfully and it guarantees compatibility with the current implementation, so it seems a safe absstraction. I believe it's also a good one but I'd be delighted to hear a concrete proposal to make it better or replace it. Other points: 1. HADOOP-5188 does introduce LogDevice and I agree it's not fully uses yet thorough the code. But if we want to keep it, we'll do the modifications eventually. 2. Method EditLogFileOutputStream.switchEditStreams() should eventually contain part of the rollEditLog() code. In the case of the BackupNode you probably know better what should be there, and if you wanna contribute to it, you're very welcome. 3. The use of instanceof has been reduced and it will disappear in my next version, since I introduced a getType() method. Reflection is used at the end only to instantiate the correct log device implementation
        Hide
        Flavio Junqueira added a comment -

        I just realized Luca and Konstantin had a similar discussion on May 21-22 about the LogDevice abstraction. Luca gave an argument for the necessity of the LogDevice, which is similar to the one I raised above. Deciding whether we'll have the LogDevice abstraction or not is critical for us to work on HADOOP-5189, and we would like to reach agreement soon so that we can make sure to have it ready for 0.21. Unfortunately, it sounds like we are deadlocked. How should we proceed?

        Of course it would be great to hear more comments and suggestions.

        Show
        Flavio Junqueira added a comment - I just realized Luca and Konstantin had a similar discussion on May 21-22 about the LogDevice abstraction. Luca gave an argument for the necessity of the LogDevice, which is similar to the one I raised above. Deciding whether we'll have the LogDevice abstraction or not is critical for us to work on HADOOP-5189 , and we would like to reach agreement soon so that we can make sure to have it ready for 0.21. Unfortunately, it sounds like we are deadlocked. How should we proceed? Of course it would be great to hear more comments and suggestions.
        Hide
        Flavio Junqueira added a comment -

        Konstantin, Could you be a little more specific on some of your comments? In particular, the following is not clear to me:

        1. "But my main concern is that LogDevice (or LoggingDevice) is not the right abstraction.
          And it seems to me that all necessary abstraction are already there you just need to use them."

        You say that all necessary abstraction is there. But, where is "there"? What abstraction exactly you believe the namenode has currently that enables one to keep the configuration and properties of a device during the execution of the namenode? From what Luca is pointing out, it seems that the abstractions you talk about are still implemented in a way that is very oriented to files, and LogDevice enables one to abstract away how the log devices behave and are implemented. Please shed some light on this one.

        1. "As I said before Edits Streams should be responsible for keeping information about logging methods. And this is completely orthogonal to the number of BookKeeper clients you need to instantiate."

        I'm not sure what you're trying to say here. What does it mean to keep information about logging methods? I believe they keep the methods used to access logs. Now, I agree that the number of bookies that one needs to use is a separate issue, in the case of BookKeeper concretly. However, we need to initialize ledgers, and for that we need to remember configuration parameters for BookKeeper and we need a different initialization procedure. Where should we keep config parameters for BookKeeper and where should we instantiate and keep a BookKeeper object?

        Also, I agree with your comment on the use of "instanceof".

        Show
        Flavio Junqueira added a comment - Konstantin, Could you be a little more specific on some of your comments? In particular, the following is not clear to me: "But my main concern is that LogDevice (or LoggingDevice) is not the right abstraction. And it seems to me that all necessary abstraction are already there you just need to use them." You say that all necessary abstraction is there. But, where is "there"? What abstraction exactly you believe the namenode has currently that enables one to keep the configuration and properties of a device during the execution of the namenode? From what Luca is pointing out, it seems that the abstractions you talk about are still implemented in a way that is very oriented to files, and LogDevice enables one to abstract away how the log devices behave and are implemented. Please shed some light on this one. "As I said before Edits Streams should be responsible for keeping information about logging methods. And this is completely orthogonal to the number of BookKeeper clients you need to instantiate." I'm not sure what you're trying to say here. What does it mean to keep information about logging methods? I believe they keep the methods used to access logs. Now, I agree that the number of bookies that one needs to use is a separate issue, in the case of BookKeeper concretly. However, we need to initialize ledgers, and for that we need to remember configuration parameters for BookKeeper and we need a different initialization procedure. Where should we keep config parameters for BookKeeper and where should we instantiate and keep a BookKeeper object? Also, I agree with your comment on the use of "instanceof".
        Hide
        Konstantin Shvachko added a comment -

        I looked at the code. It still has a lot of unnecessary changes (like those in Storage), unused methods, like EditLogBackupOutputStream.switchEditStreams() or getLastInputStreamInstance(). Did you rename LoggingDevice to LogDevice (then the comments should be renamed too)?
        But my main concern is that LogDevice (or LoggingDevice) is not the right abstraction.
        And it seems to me that all necessary abstraction are already there you just need to use them.
        As I said before Edits Streams should be responsible for keeping information about logging methods. And this is completely orthogonal to the number of BookKeeper clients you need to instantiate.
        You still use instanceof to distinguish between LogDevices. This particularly indicate there is something wrong with your abstraction. instanceof can usually be replaced by virtual methods or enums.

        > When I say "another patch" I'm talking about a collection of patches

        This sounds really scary.

        Show
        Konstantin Shvachko added a comment - I looked at the code. It still has a lot of unnecessary changes (like those in Storage), unused methods, like EditLogBackupOutputStream.switchEditStreams() or getLastInputStreamInstance(). Did you rename LoggingDevice to LogDevice (then the comments should be renamed too)? But my main concern is that LogDevice (or LoggingDevice) is not the right abstraction. And it seems to me that all necessary abstraction are already there you just need to use them. As I said before Edits Streams should be responsible for keeping information about logging methods. And this is completely orthogonal to the number of BookKeeper clients you need to instantiate. You still use instanceof to distinguish between LogDevices. This particularly indicate there is something wrong with your abstraction. instanceof can usually be replaced by virtual methods or enums. > When I say "another patch" I'm talking about a collection of patches This sounds really scary.
        Hide
        Luca Telloli added a comment -

        Ben, thanks for your quick fix list, it has been very useful. I'm posting another patch that addresses your concerns. Let me comment some of the issues you outlined:

        • in src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java and src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java get rid of the // TODO comment and explain why it is empty

        Method switchEditStreams() for Backup streams and file stream is currently implemented somewhere else in the code. That code should eventually end up here, but that should probably be part of a different patch, although strictly related to this one.

        • there are two processIOError in src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java. shouldn't there only be one?

        Part of the process of switching to LogDevice might involve duplicating code until all the dependencies on StorageDirectory are cleared out. This should eventually disappear in the future and should be the subject of another patch.

        • what is the FIXME getNewInputStreamInstance?

        The current patch only uses LogDevice for output streams; when it comes to input the code is basically the same as it was before. This again might be part of another patch.

        When I say "another patch" I'm talking about a collection of patches that would depend on this one, once committed.

        Show
        Luca Telloli added a comment - Ben, thanks for your quick fix list, it has been very useful. I'm posting another patch that addresses your concerns. Let me comment some of the issues you outlined: • in src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java and src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java get rid of the // TODO comment and explain why it is empty Method switchEditStreams() for Backup streams and file stream is currently implemented somewhere else in the code. That code should eventually end up here, but that should probably be part of a different patch, although strictly related to this one. • there are two processIOError in src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java. shouldn't there only be one? Part of the process of switching to LogDevice might involve duplicating code until all the dependencies on StorageDirectory are cleared out. This should eventually disappear in the future and should be the subject of another patch. • what is the FIXME getNewInputStreamInstance? The current patch only uses LogDevice for output streams; when it comes to input the code is basically the same as it was before. This again might be part of another patch. When I say "another patch" I'm talking about a collection of patches that would depend on this one, once committed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12409049/HADOOP-5188.patch
        against trunk revision 779383.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 15 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409049/HADOOP-5188.patch against trunk revision 779383. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/417/console This message is automatically generated.
        Hide
        Benjamin Reed added a comment -

        -1 this patch is not ready for submission. here are some things to fix:

        • you left and log.info message in src/hdfs/org/apache/hadoop/hdfs/server/namenode/BackupStorage.java
        • in src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java and src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java get rid of the // TODO comment and explain why it is empty
        • you should remove the commented line in src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
        • there are two processIOError in src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java. shouldn't there only be one?
        • you are commenting out lines instead of removing them. why?
        • in getCheckpointEditsDirs you can't just be doing stack traces. log the exception and handle it.
        • don't do e.printStackTrace(), pass it as a parameter to your log message
        • what is the FIXME getNewInputStreamInstance?
        Show
        Benjamin Reed added a comment - -1 this patch is not ready for submission. here are some things to fix: you left and log.info message in src/hdfs/org/apache/hadoop/hdfs/server/namenode/BackupStorage.java in src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogBackupOutputStream.java and src/hdfs/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java get rid of the // TODO comment and explain why it is empty you should remove the commented line in src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java there are two processIOError in src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java. shouldn't there only be one? you are commenting out lines instead of removing them. why? in getCheckpointEditsDirs you can't just be doing stack traces. log the exception and handle it. don't do e.printStackTrace(), pass it as a parameter to your log message what is the FIXME getNewInputStreamInstance?
        Hide
        Luca Telloli added a comment -

        Attaching new version of the patch. It passes core-tests and hdfs-tests, release audit and findbugs.

        Fixed a bug related to the use of absolute versus relative patches when comparing directories with URI.

        Reduced the use of reflection thorough the patch by using a type field inside each LoggingDevice.

        Show
        Luca Telloli added a comment - Attaching new version of the patch. It passes core-tests and hdfs-tests, release audit and findbugs. Fixed a bug related to the use of absolute versus relative patches when comparing directories with URI. Reduced the use of reflection thorough the patch by using a type field inside each LoggingDevice.
        Hide
        Luca Telloli added a comment -

        1 - ok

        2 - correct

        Show
        Luca Telloli added a comment - 1 - ok 2 - correct
        Hide
        Flavio Junqueira added a comment -

        I'd like to make a comment and ask a question:

        1- There are two files missing license headers. Before submitting a patch, please make sure that all new files have a license header. Also, according to Hudson, this patch introduces new potential bugs according to findbugs. Pleas verify if they can be fixed;

        2- Using BookKeeper as an example, I was wondering where we will store the configuration parameters necessary to bootstrap and run it as a log device. For example, we need to specify number of bookies and ledger mode. If I understand it correctly, the idea is to pass these parameters in a URI, and then parse the URI. Once the namenode parses the URI, it will have the values for the corresponding variables, and it will keep it somewhere. Is the idea to keep such configuration parameters in the corresponding log device implementation for BookKeeper?

        Show
        Flavio Junqueira added a comment - I'd like to make a comment and ask a question: 1- There are two files missing license headers. Before submitting a patch, please make sure that all new files have a license header. Also, according to Hudson, this patch introduces new potential bugs according to findbugs. Pleas verify if they can be fixed; 2- Using BookKeeper as an example, I was wondering where we will store the configuration parameters necessary to bootstrap and run it as a log device. For example, we need to specify number of bookies and ledger mode. If I understand it correctly, the idea is to pass these parameters in a URI, and then parse the URI. Once the namenode parses the URI, it will have the values for the corresponding variables, and it will keep it somewhere. Is the idea to keep such configuration parameters in the corresponding log device implementation for BookKeeper?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12408600/HADOOP-5188.patch
        against trunk revision 777761.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 15 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        -1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 491 warnings).

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/artifact/trunk/current/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408600/HADOOP-5188.patch against trunk revision 777761. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 491 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/383/console This message is automatically generated.
        Hide
        Luca Telloli added a comment - - edited

        Konstantin, thanks for your comments! Let me address each of your point separately:

        1. I opened HADOOP-5832 by mistake using the subtask option, without knowing that it would have generated a new jira. My bad.

        Personally I think that there's no clear cut between the two jiras: with the URI processing I'm introducing LoggingDevice, an abstraction for logging. Without it, HADOOP-5832 would be trivial: you'd read a string, treat is as a URI, retrieve the path fragment and create a StorageDirectory. So, unless you think it's necessary to have separate patches, I'd rather link them.

        2. I think edit streams are opened and closed many times during the life of a server process (namenode/secondaryNN/backupnode) while the LoggingDevices are ideally created only once. While this might not make a difference for files, which can be opened/closed many times, it might fit better other types of logging which might want to store persistent information inside their "LoggingDevice" implementation. For instance, when thinking about Bookkeeper, I want to create a Bookkeeper client only once through the life of a NameNode and use it for Input/Output streams when needed. Does this sound reasonable?

        3. From what I see in FSImage.loadFSEdits(StorageDirectory sd) EditLogFileInputStream needs to access a StorageDirectory object. Assuming that we keep the LoggingDevice abstraction and both edit streams (input and output) need to access it, it sounds better to leave it inside FileLoggingDevice and retrieve it with the getNewInputStreamInstance() method.

        4. In the current patch you can specify values in dfs.name.dir as URI, as you can see in the FSNamesystem.getStorageDirs(...) method, where arguments of type URI are processed as well as path strings, both adding the correct value to the dirNames ArrayList. In general, values of this property will be mapped directly to storage directories, since StorageDirectory is the unit of storage for file system images. The mapping is different for the other property: dfs.name.edits.dir, where the mapping is not one-to-one with StorageDirectories, since other types of device can be used.

        Show
        Luca Telloli added a comment - - edited Konstantin, thanks for your comments! Let me address each of your point separately: 1. I opened HADOOP-5832 by mistake using the subtask option, without knowing that it would have generated a new jira. My bad. Personally I think that there's no clear cut between the two jiras: with the URI processing I'm introducing LoggingDevice, an abstraction for logging. Without it, HADOOP-5832 would be trivial: you'd read a string, treat is as a URI, retrieve the path fragment and create a StorageDirectory. So, unless you think it's necessary to have separate patches, I'd rather link them. 2. I think edit streams are opened and closed many times during the life of a server process (namenode/secondaryNN/backupnode) while the LoggingDevices are ideally created only once. While this might not make a difference for files, which can be opened/closed many times, it might fit better other types of logging which might want to store persistent information inside their "LoggingDevice" implementation. For instance, when thinking about Bookkeeper, I want to create a Bookkeeper client only once through the life of a NameNode and use it for Input/Output streams when needed. Does this sound reasonable? 3. From what I see in FSImage.loadFSEdits(StorageDirectory sd) EditLogFileInputStream needs to access a StorageDirectory object. Assuming that we keep the LoggingDevice abstraction and both edit streams (input and output) need to access it, it sounds better to leave it inside FileLoggingDevice and retrieve it with the getNewInputStreamInstance() method. 4. In the current patch you can specify values in dfs.name.dir as URI, as you can see in the FSNamesystem.getStorageDirs(...) method, where arguments of type URI are processed as well as path strings, both adding the correct value to the dirNames ArrayList. In general, values of this property will be mapped directly to storage directories, since StorageDirectory is the unit of storage for file system images. The mapping is different for the other property: dfs.name.edits.dir, where the mapping is not one-to-one with StorageDirectories, since other types of device can be used.
        Hide
        Konstantin Shvachko added a comment -
        1. Isn't that supposed to belong to HADOOP-5832?
        2. I don't understand why you need the LoggingDevice? It is just a wrapper around the configuration variable, which is now a URI.
          So you can just create edits streams based on the URI itself rather than on different types of LoggingDevice-s.
          You currently instantiate LoggingDevice-s based on the URI's so you may as well instantiate directly the streams based on that.
        3. I would suggest to move StorageDirectory directly inside EditLogFileOutputStream rather than LoggingDevice. This might be a separate issue.
        4. Do you also plan to make "dfs.name.dir" a URI. Looks like you cannot avoid that since if "dfs.name.edits.dir" is not specified then "dfs.name.dir" automatically become edits directories as well as name dirs.
        Show
        Konstantin Shvachko added a comment - Isn't that supposed to belong to HADOOP-5832 ? I don't understand why you need the LoggingDevice? It is just a wrapper around the configuration variable, which is now a URI. So you can just create edits streams based on the URI itself rather than on different types of LoggingDevice-s. You currently instantiate LoggingDevice-s based on the URI's so you may as well instantiate directly the streams based on that. I would suggest to move StorageDirectory directly inside EditLogFileOutputStream rather than LoggingDevice. This might be a separate issue. Do you also plan to make "dfs.name.dir" a URI. Looks like you cannot avoid that since if "dfs.name.edits.dir" is not specified then "dfs.name.dir" automatically become edits directories as well as name dirs.
        Hide
        Luca Telloli added a comment -

        Feedback from Hudson

        Show
        Luca Telloli added a comment - Feedback from Hudson
        Hide
        Luca Telloli added a comment - - edited

        This patch adds the following features:

        • supports for URI for property dfs.name.edits.dir (HADOOP-5832)
        • abstraction of logging systems

        The abstraction might be still incomplete, but I think it's in a good shape.

        To achieve the first point (URI in dfs.name.edits.dir), the patch introduces a new abstraction called LoggingDevice.
        The trunk implementation currently assumes that every attribute inside the property dfs.name.edits.dir is a StorageDirectory, which is false if other types of logging are specified. LoggingDevice should help with it; additionally, a Logging device would incorporate all needed parameters for its instantiation by reading them from the initialization URI.

        Ideally, a new logging device should:

        • extend class LoggingDevice
        • implement interfaces EditLogInput/OutputStream

        Additionally, in the current implementation there is one data structure that keeps the set of edit streams (FSEditLog.editStreams) and one structure that keeps the list of StorageDirectories. The two structures are maintained in sync whenever a edit stream fails (the corresponding storage directory is removed) or a storage directory fails (the corresponding edit stream is removed).

        My hope is that, by using logging devices, we could keep only one data structure.

        Finally:

        • the patch passes all NameNode unit tests
        • the patch applies against current trunk (as of today)
        • I'm starting the submit patch process to get more feedback from users
        Show
        Luca Telloli added a comment - - edited This patch adds the following features: supports for URI for property dfs.name.edits.dir ( HADOOP-5832 ) abstraction of logging systems The abstraction might be still incomplete, but I think it's in a good shape. To achieve the first point (URI in dfs.name.edits.dir), the patch introduces a new abstraction called LoggingDevice. The trunk implementation currently assumes that every attribute inside the property dfs.name.edits.dir is a StorageDirectory, which is false if other types of logging are specified. LoggingDevice should help with it; additionally, a Logging device would incorporate all needed parameters for its instantiation by reading them from the initialization URI. Ideally, a new logging device should: extend class LoggingDevice implement interfaces EditLogInput/OutputStream Additionally, in the current implementation there is one data structure that keeps the set of edit streams (FSEditLog.editStreams) and one structure that keeps the list of StorageDirectories. The two structures are maintained in sync whenever a edit stream fails (the corresponding storage directory is removed) or a storage directory fails (the corresponding edit stream is removed). My hope is that, by using logging devices, we could keep only one data structure. Finally: the patch passes all NameNode unit tests the patch applies against current trunk (as of today) I'm starting the submit patch process to get more feedback from users
        Hide
        Luca Telloli added a comment -

        In this newer version I'm trying to modify the open() method to make use of the class LoggingDevice. Basically the idea would be the following:

        CustomLoggingDevice would implement LoggingDevice, and contain all the information needed to configure the custom device when the getNewInputStreamInstance() and getNewOutputStreamInstance() methods are called. The two methods should return an instance of the implementation of the EditLogInput/OutputStream class for the specific type of logging.

        open() should then use this method to instanciate a new EditLogOutputStream() for each logging device.

        Unfortunately as it is this patch doesn't pass all unit tests, and I guess the problem is related to the fact that storage directories are used in three modes, IMAGE, IMAGE_AND_EDITS, and EDITS. Somehow I'd get rid of the mixed mode if that was possible but I don't exactly how to proceed, so any help would be appreciated.

        This patch contains HADOOP-5721.

        Show
        Luca Telloli added a comment - In this newer version I'm trying to modify the open() method to make use of the class LoggingDevice. Basically the idea would be the following: CustomLoggingDevice would implement LoggingDevice, and contain all the information needed to configure the custom device when the getNewInputStreamInstance() and getNewOutputStreamInstance() methods are called. The two methods should return an instance of the implementation of the EditLogInput/OutputStream class for the specific type of logging. open() should then use this method to instanciate a new EditLogOutputStream() for each logging device. Unfortunately as it is this patch doesn't pass all unit tests, and I guess the problem is related to the fact that storage directories are used in three modes, IMAGE, IMAGE_AND_EDITS, and EDITS. Somehow I'd get rid of the mixed mode if that was possible but I don't exactly how to proceed, so any help would be appreciated. This patch contains HADOOP-5721 .
        Hide
        Luca Telloli added a comment -

        the attached patch supports:

        • parsing dfs.name.edits.dir arguments as URI
        • introducing a new abstraction named LoggingDevice. The subclass named FileLoggingDevice should incorporate StorageDirectory (work in progress). The idea is that FileLoggingDevice should be treated as the current StorageDirectory (when the directory is of type EDITS or IMAGE_AND_EDITS) while other types of logging should be treated according to their respective properties.
          Two methods (namely getInputDeviceClassName and getOutputDeviceClassName) should get the class name of the EditLogInput/OutputStreams associated with this type of device.

        the patch passes successfully namenode unit tests

        Show
        Luca Telloli added a comment - the attached patch supports: parsing dfs.name.edits.dir arguments as URI introducing a new abstraction named LoggingDevice. The subclass named FileLoggingDevice should incorporate StorageDirectory (work in progress). The idea is that FileLoggingDevice should be treated as the current StorageDirectory (when the directory is of type EDITS or IMAGE_AND_EDITS) while other types of logging should be treated according to their respective properties. Two methods (namely getInputDeviceClassName and getOutputDeviceClassName) should get the class name of the EditLogInput/OutputStreams associated with this type of device. the patch passes successfully namenode unit tests
        Hide
        Flavio Junqueira added a comment -

        It is perhaps not clear what problem point number 3 of the concrete tasks is trying to solve. If I understand it correctly, there is an opcode (OP_JSPOOL_START) that the backup node uses to mark an edits stream and determine the last synchronization point. I believe the idea is to leverage this mechanism to close one ledger and create another ledger when using BookKeeper. In general, this mechanism flags to any log device where synchronization points are, and let each device handle accordingly. The important property here is that all devices will observe the same synchronization points and in the same positions.

        One problem with the implementation of this marking mechanism is that it is a little obscure in the code, so you need to know that it is there. What I believe the design document is proposing is to add an interface method that forces a developer implementing the adapter for some new log device to get a error at compile time if it doesn't implement such a method. For the streams that do not support such markers, the method call would be a simple no-op. Also, an implementation of the method isOperationSupported() might be unnecessary if we do have such an interface method.

        Comments or clarifications?

        Show
        Flavio Junqueira added a comment - It is perhaps not clear what problem point number 3 of the concrete tasks is trying to solve. If I understand it correctly, there is an opcode (OP_JSPOOL_START) that the backup node uses to mark an edits stream and determine the last synchronization point. I believe the idea is to leverage this mechanism to close one ledger and create another ledger when using BookKeeper. In general, this mechanism flags to any log device where synchronization points are, and let each device handle accordingly. The important property here is that all devices will observe the same synchronization points and in the same positions. One problem with the implementation of this marking mechanism is that it is a little obscure in the code, so you need to know that it is there. What I believe the design document is proposing is to add an interface method that forces a developer implementing the adapter for some new log device to get a error at compile time if it doesn't implement such a method. For the streams that do not support such markers, the method call would be a simple no-op. Also, an implementation of the method isOperationSupported() might be unnecessary if we do have such an interface method. Comments or clarifications?
        Hide
        Luca Telloli added a comment -

        Attaching a draft of a design document with some concrete proposal for advancing on this issue.

        Show
        Luca Telloli added a comment - Attaching a draft of a design document with some concrete proposal for advancing on this issue.
        Hide
        Konstantin Shvachko added a comment -

        I think you are trying to substitute EditLogOutputStream abstraction with an EditLog abstraction. Will try to explain:

        • FSImage object deals (or supposed to deal) with everything related to the file system persistent image.
        • FSEditLog should be dealing with everything related to journaling.
        • There are different ways of journaling and this should be reflected by an abstract EditLogOutputStream class.

        Your approach will lead to that FSImage will have an array of EditLog(s). And you will have to introduce FSImage.logSync() method on it so that it would loop over all EditLogs and call their respective EditLog.logSync() methods. But this is exactly what current EditLog.logSync() method does. It loops through the EditLogOutputStreams and calls flushAndSync() on them. The same with other operations: logEdit(), processIOError().

        So the idea is that EditLog should combine common logic for all journaling streams (logging types). The specifics of journaling should be contained within implementations of EditLogOutputStream.

        I agree with Ben - FSEditLog was originally written for file based journals and still contains code specific to this type. And it may be optimized.
        I can see that waiting for the whole batch of edits to complete makes the bookKeeper stream less efficient.
        But that does not mean that FSEditLog should be overloaded; it just means that one method logSync() should be generalized to allow efficient implementation of BK streams as well as other (file and backup) streams.

        So the proposal. Lets put an if statement in logSync() for now, which checks whether all streams are bookKeeper streams and then if it is it does not go into synchronized sections in logSync() (avoids waiting) or alternatively calls a BK specific method.
        I say this because if the name-node uses BK logging together with other logging types then logging will go with the speed of the slowest journal. So the only case when BK can benefit from optimized logSync() is when there are now other than BK types of streams.
        Hope that makes sense.

        Show
        Konstantin Shvachko added a comment - I think you are trying to substitute EditLogOutputStream abstraction with an EditLog abstraction. Will try to explain: FSImage object deals (or supposed to deal) with everything related to the file system persistent image. FSEditLog should be dealing with everything related to journaling. There are different ways of journaling and this should be reflected by an abstract EditLogOutputStream class. Your approach will lead to that FSImage will have an array of EditLog(s). And you will have to introduce FSImage.logSync() method on it so that it would loop over all EditLogs and call their respective EditLog.logSync() methods. But this is exactly what current EditLog.logSync() method does. It loops through the EditLogOutputStreams and calls flushAndSync() on them. The same with other operations: logEdit(), processIOError(). So the idea is that EditLog should combine common logic for all journaling streams (logging types). The specifics of journaling should be contained within implementations of EditLogOutputStream. I agree with Ben - FSEditLog was originally written for file based journals and still contains code specific to this type. And it may be optimized. I can see that waiting for the whole batch of edits to complete makes the bookKeeper stream less efficient. But that does not mean that FSEditLog should be overloaded; it just means that one method logSync() should be generalized to allow efficient implementation of BK streams as well as other (file and backup) streams. So the proposal. Lets put an if statement in logSync() for now, which checks whether all streams are bookKeeper streams and then if it is it does not go into synchronized sections in logSync() (avoids waiting) or alternatively calls a BK specific method. I say this because if the name-node uses BK logging together with other logging types then logging will go with the speed of the slowest journal. So the only case when BK can benefit from optimized logSync() is when there are now other than BK types of streams. Hope that makes sense.
        Hide
        Benjamin Reed added a comment -

        What is needed to get this change committed? We really need an EditLog interface that is more general. The current FSEditLog is much too specific to an implementation of file based logging. Plus, if we want to write a more efficient file based EditLog is is almost impossible with the current implementation.

        Show
        Benjamin Reed added a comment - What is needed to get this change committed? We really need an EditLog interface that is more general. The current FSEditLog is much too specific to an implementation of file based logging. Plus, if we want to write a more efficient file based EditLog is is almost impossible with the current implementation.
        Hide
        Flavio Junqueira added a comment -

        Konstantin says that if we don't synchronize on logSync, then edits can be written multiple times and the edit log would be corrupted. Although this statement sounds correct for the current implementation of FSEditLog, it is not mandatory for every possible implementation of an EditLog, and that's why Luca is proposing to change it. One example is the implementation in the patch. With this implementation, there is no shared buffer, and each thread has its own buffer and writes that buffer to a BookKeeper ledger, which obviates the need for synchronization on logSync.

        I understand that there might be a concern about the semantics of the file system, but in the way I see there is no such a problem. Having a buffer for each thread and writing it to BookKeeper does not violate the file system semantics because BookKeeper respects the order of calls to add edits to a ledger, and calls back to the Namenode using the same order. That is, the namenode will receive callbacks in the same order that have been submitted, so if the namenode processes return values using the same order of callbacks, then the execution order will be respected. Note that there is no change with respect to the order of callback processing, and if the Namenode processes them correctly today, then I don't see why it wouldn't with the proposed solution.

        Back to the logSync issue, in my interpretation, Konstantin is trying to enforce the semantics defined for the file system by not allowing logSync to be overridden. This is a valid argument, but the lack of flexibility precludes solutions that do not require logSync implemented in the way it is currently, so I'd rather have a more flexible implementation.

        Show
        Flavio Junqueira added a comment - Konstantin says that if we don't synchronize on logSync, then edits can be written multiple times and the edit log would be corrupted. Although this statement sounds correct for the current implementation of FSEditLog, it is not mandatory for every possible implementation of an EditLog, and that's why Luca is proposing to change it. One example is the implementation in the patch. With this implementation, there is no shared buffer, and each thread has its own buffer and writes that buffer to a BookKeeper ledger, which obviates the need for synchronization on logSync. I understand that there might be a concern about the semantics of the file system, but in the way I see there is no such a problem. Having a buffer for each thread and writing it to BookKeeper does not violate the file system semantics because BookKeeper respects the order of calls to add edits to a ledger, and calls back to the Namenode using the same order. That is, the namenode will receive callbacks in the same order that have been submitted, so if the namenode processes return values using the same order of callbacks, then the execution order will be respected. Note that there is no change with respect to the order of callback processing, and if the Namenode processes them correctly today, then I don't see why it wouldn't with the proposed solution. Back to the logSync issue, in my interpretation, Konstantin is trying to enforce the semantics defined for the file system by not allowing logSync to be overridden. This is a valid argument, but the lack of flexibility precludes solutions that do not require logSync implemented in the way it is currently, so I'd rather have a more flexible implementation.
        Hide
        Luca Telloli added a comment -

        Konstantin> No I don't think logSync() method should be modified at least not the way you did in your patch HADOOP-5189. Synchronization (locking) in the logSync() method prevents different handler threads to write the same information (edits) multiple times into the same edits file. Suppose you have 1 file stream (one edits file) and 2 threads call logSync() at the same time. If you remove locking as you did in your patch then both threads may write the same edits transactions into the same file twice. This will simply corrupt the edits file.

        Konstantin> logSync() may be modified to write in parallel into different streams. If you have 2 file streams then it makes sense to write into 2 files in parallel. But the information should be written into each file only once and in the right order. And your patch breaks that, afaiu.

        I agree with you that the logSync() in the patch would not work with the file-based logging, and that's why I put in my comments that this patch breaks it, but the purpose of the patch is to work with BookKeeper!

        In the prototype included in the patch we do indeed remove synchronisation in logSync(), but at the same time each thread has its own local buffer (a ThreadLocal variable) where it's writing log modifications, and this guarantees that requests are processes in the order they are received by the client, thus they do not violate the semantics of the file system. As a proof I decided to run tests where I check the number of bytes written onto bookkeeper with different logging systems and that number is the same which comforts me on this issue.

        Konstantin> My main objection to your approach though, is abstracting the whole EditsLog class rather than just the EditLog*Streams. This leads to replication of a lot of code. Say, both classes EditLog and BKEditLogThreadBuf have the loadFSEdits() methods, which are literally identical, right? We should avoid that at all cost.

        For BookKeeper integration I really need to be able to override logSync(), so I have to keep going with the current abstract class unless a different way of overriding that method is provided.

        In general, think that the abstract class has exactly the opposite effect of what you state, allowing to place in the root class the common parts, but I agree that I might not have provided you the cleanest code in this sense; I'll do some cleanup and post a new version. In the specific case of loadFSEdits, if they are identical then you could remove the code from BKEditLogThreadBuf.

        I also think that the abstract classes for Input/Output streams are cool and they do not conflict with having an abstract class EditLog; in each test prototype in fact we used them

        I'll try to have a patch that applies against trunk; and I'd also like to try to address dhruba request of multiple logging systems, maybe I could try with HADOOP-4539? Is that patch supporting multiple logging through multiple instances of the EditLogInput/OutputStream classes? I'd like to avoid duplicating efforts but, in my case, I really need to be able to Override the current logSync().

        Show
        Luca Telloli added a comment - Konstantin> No I don't think logSync() method should be modified at least not the way you did in your patch HADOOP-5189 . Synchronization (locking) in the logSync() method prevents different handler threads to write the same information (edits) multiple times into the same edits file. Suppose you have 1 file stream (one edits file) and 2 threads call logSync() at the same time. If you remove locking as you did in your patch then both threads may write the same edits transactions into the same file twice. This will simply corrupt the edits file. Konstantin> logSync() may be modified to write in parallel into different streams. If you have 2 file streams then it makes sense to write into 2 files in parallel. But the information should be written into each file only once and in the right order. And your patch breaks that, afaiu. I agree with you that the logSync() in the patch would not work with the file-based logging, and that's why I put in my comments that this patch breaks it, but the purpose of the patch is to work with BookKeeper! In the prototype included in the patch we do indeed remove synchronisation in logSync(), but at the same time each thread has its own local buffer (a ThreadLocal variable) where it's writing log modifications, and this guarantees that requests are processes in the order they are received by the client, thus they do not violate the semantics of the file system. As a proof I decided to run tests where I check the number of bytes written onto bookkeeper with different logging systems and that number is the same which comforts me on this issue. Konstantin> My main objection to your approach though, is abstracting the whole EditsLog class rather than just the EditLog*Streams. This leads to replication of a lot of code. Say, both classes EditLog and BKEditLogThreadBuf have the loadFSEdits() methods, which are literally identical, right? We should avoid that at all cost. For BookKeeper integration I really need to be able to override logSync() , so I have to keep going with the current abstract class unless a different way of overriding that method is provided. In general, think that the abstract class has exactly the opposite effect of what you state, allowing to place in the root class the common parts, but I agree that I might not have provided you the cleanest code in this sense; I'll do some cleanup and post a new version. In the specific case of loadFSEdits, if they are identical then you could remove the code from BKEditLogThreadBuf. I also think that the abstract classes for Input/Output streams are cool and they do not conflict with having an abstract class EditLog; in each test prototype in fact we used them I'll try to have a patch that applies against trunk; and I'd also like to try to address dhruba request of multiple logging systems, maybe I could try with HADOOP-4539 ? Is that patch supporting multiple logging through multiple instances of the EditLogInput/OutputStream classes? I'd like to avoid duplicating efforts but, in my case, I really need to be able to Override the current logSync().
        Hide
        Konstantin Shvachko added a comment -

        Sorry it took me a little more than a couple of days.

        Luca> will your patch support multiple logging system at a time?

        Yes HADOOP-4539 supports multiple logging systems. Namely, it supports traditional file streams and (what I call) backup streams. In my view you should be able to extend EditLogOutputStream and it will work with BookKeeper.

        Luca> 0.19 didn't allow me to modify the logSync() method. Will the patch you'll post fix the issue?

        No I don't think logSync() method should be modified at least not the way you did in your patch HADOOP-5189.
        Synchronization (locking) in the logSync() method prevents different handler threads to write the same information (edits) multiple times into the same edits file. Suppose you have 1 file stream (one edits file) and 2 threads call logSync() at the same time. If you remove locking as you did in your patch then both threads may write the same edits transactions into the same file twice. This will simply corrupt the edits file.

        logSync() may be modified to write in parallel into different streams. If you have 2 file streams then it makes sense to write into 2 files in parallel. But the information should be written into each file only once and in the right order. And your patch breaks that, afaiu.

        My main objection to your approach though, is abstracting the whole EditsLog class rather than just the EditLog*Streams. This leads to replication of a lot of code. Say, both classes EditLog and BKEditLogThreadBuf have the loadFSEdits() methods, which are literally identical, right? We should avoid that at all cost.

        Show
        Konstantin Shvachko added a comment - Sorry it took me a little more than a couple of days. Luca> will your patch support multiple logging system at a time? Yes HADOOP-4539 supports multiple logging systems. Namely, it supports traditional file streams and (what I call) backup streams. In my view you should be able to extend EditLogOutputStream and it will work with BookKeeper. Luca> 0.19 didn't allow me to modify the logSync() method. Will the patch you'll post fix the issue? No I don't think logSync() method should be modified at least not the way you did in your patch HADOOP-5189 . Synchronization (locking) in the logSync() method prevents different handler threads to write the same information (edits) multiple times into the same edits file. Suppose you have 1 file stream (one edits file) and 2 threads call logSync() at the same time. If you remove locking as you did in your patch then both threads may write the same edits transactions into the same file twice. This will simply corrupt the edits file. logSync() may be modified to write in parallel into different streams . If you have 2 file streams then it makes sense to write into 2 files in parallel. But the information should be written into each file only once and in the right order. And your patch breaks that, afaiu. My main objection to your approach though, is abstracting the whole EditsLog class rather than just the EditLog*Streams . This leads to replication of a lot of code. Say, both classes EditLog and BKEditLogThreadBuf have the loadFSEdits() methods, which are literally identical, right? We should avoid that at all cost.
        Hide
        dhruba borthakur added a comment -

        I had an offline discussion with Konstantin. It appears that making it "abstract class" makes sense to me. Konstantin might have more comments on this one.

        Show
        dhruba borthakur added a comment - I had an offline discussion with Konstantin. It appears that making it "abstract class" makes sense to me. Konstantin might have more comments on this one.
        Hide
        Luca Telloli added a comment -

        Raghu, in reply to your comment https://issues.apache.org/jira/browse/HADOOP-5189?focusedCommentId=12672454#action_12672454 it's hard for me to tell if the current interface is "inadequate", but I can definitely tell was insufficient for our needs, and this brings back to the logSync() issue, that I'll detail here for the sake of completeness.

        In the current implementation, during sync/flush a process among others gets elected to be the "flusher" while the others wait for it to complete; the fact that the NN writes multiple entries for each flush due to the interleaving of threads is a very well written piece of code, but it is not good for our purposes.

        What we do with BookKeeper is to allow multiple parallel ordered asynchronous writes to be pipelined on the BookKeeper side rather than the NN side, and to achieve this we need to hack the logSync() method to avoid unnecessary synchronizations. While the NN uses multiple synchronized threads, BookKeeper uses queues, which we believe should help the overall performance.

        Shortly: Bk does implement the two interfaces + hacks the logSync(), which is not part of the interfaces. So I went the "abstract class" way, do you think that's reasonable?

        Show
        Luca Telloli added a comment - Raghu, in reply to your comment https://issues.apache.org/jira/browse/HADOOP-5189?focusedCommentId=12672454#action_12672454 it's hard for me to tell if the current interface is "inadequate", but I can definitely tell was insufficient for our needs, and this brings back to the logSync() issue, that I'll detail here for the sake of completeness. In the current implementation, during sync/flush a process among others gets elected to be the "flusher" while the others wait for it to complete; the fact that the NN writes multiple entries for each flush due to the interleaving of threads is a very well written piece of code, but it is not good for our purposes. What we do with BookKeeper is to allow multiple parallel ordered asynchronous writes to be pipelined on the BookKeeper side rather than the NN side, and to achieve this we need to hack the logSync() method to avoid unnecessary synchronizations. While the NN uses multiple synchronized threads, BookKeeper uses queues, which we believe should help the overall performance. Shortly: Bk does implement the two interfaces + hacks the logSync(), which is not part of the interfaces. So I went the "abstract class" way, do you think that's reasonable?
        Hide
        Luca Telloli added a comment - - edited

        Konstantin,
        the EditLogI/OStream in the .19 didn't allow me to modify the logSync() method which is something crucial for the integration with BookKeeper. Will the patch you'll post fix the issue? What I did is basically to implement the EditLog as an abstract class and override what I needed. In this way also the code in FSEditLog was much cleaner. What do you think?

        dhruba, yes, I'm referring to FSImage.java, btw: Konstantin, will your patch support multiple logging system at a time?

        Show
        Luca Telloli added a comment - - edited Konstantin, the EditLogI/OStream in the .19 didn't allow me to modify the logSync() method which is something crucial for the integration with BookKeeper. Will the patch you'll post fix the issue? What I did is basically to implement the EditLog as an abstract class and override what I needed. In this way also the code in FSEditLog was much cleaner. What do you think? dhruba, yes, I'm referring to FSImage.java, btw: Konstantin, will your patch support multiple logging system at a time?
        Hide
        Konstantin Shvachko added a comment -

        I think Dhruba's proposal
        https://issues.apache.org/jira/browse/HADOOP-5189#action_12671968
        to overload EditLogInput/OutputStream rather than EditLog is exactly the direction we should be moving towards.
        I am planning to upload a patch for HADOOP-4539 in the next couple of days. This will give an example how the generic EditLogStreams can be used in this context. I found the stream functionality is quite sufficient, but the framework does not completely support for the abstraction, so I understand Luca's struggle with that and I am making some changes. I hope this will simplify your part too.

        Show
        Konstantin Shvachko added a comment - I think Dhruba's proposal https://issues.apache.org/jira/browse/HADOOP-5189#action_12671968 to overload EditLogInput/OutputStream rather than EditLog is exactly the direction we should be moving towards. I am planning to upload a patch for HADOOP-4539 in the next couple of days. This will give an example how the generic EditLogStreams can be used in this context. I found the stream functionality is quite sufficient, but the framework does not completely support for the abstraction, so I understand Luca's struggle with that and I am making some changes. I hope this will simplify your part too.
        Hide
        dhruba borthakur added a comment -

        > I thought about having multiple logging systems in parallel and the simplest solution I could think of is a vector of EditLogs inside FSImage. How does it sound?

        I am assuming that you are referring to the code in FsImage.java (and not the fsimage file that a namenode creates).

        Show
        dhruba borthakur added a comment - > I thought about having multiple logging systems in parallel and the simplest solution I could think of is a vector of EditLogs inside FSImage. How does it sound? I am assuming that you are referring to the code in FsImage.java (and not the fsimage file that a namenode creates).
        Hide
        Luca Telloli added a comment -

        It would be good to have a better abstraction for logging. I had this need when trying to integrate BookKeeper (https://issues.apache.org/jira/browse/ZOOKEEPER-276) with HDFS.

        My biggest problem during integration was related to the LogSync() method, which I had to modify to take advantage of BookKeeper features. The approach I took was to create an abstract class EditLog, incorporating most of the logic of FSEditLog, and FSEditLog to extend it. I'm not sure this is the best solution in the long term so any design comment is welcome. In particular, some methods that are implemented as synchronised in the EditLog class might be used in a non-synchronised way if using BookKeeper, so the actual synchronisation might be left to implement by the subclass and not by the superclass.

        Since after this patch multiple logging types might be available, I added an abstract method getLogNameType() to return the name of the Logging system in use.

        == USAGE ==
        Applies against hadoop-19.0 release (http://svn.apache.org/repos/asf/hadoop/core/tags/release-0.19.0/)

        Show
        Luca Telloli added a comment - It would be good to have a better abstraction for logging. I had this need when trying to integrate BookKeeper ( https://issues.apache.org/jira/browse/ZOOKEEPER-276 ) with HDFS. My biggest problem during integration was related to the LogSync() method, which I had to modify to take advantage of BookKeeper features. The approach I took was to create an abstract class EditLog, incorporating most of the logic of FSEditLog, and FSEditLog to extend it. I'm not sure this is the best solution in the long term so any design comment is welcome. In particular, some methods that are implemented as synchronised in the EditLog class might be used in a non-synchronised way if using BookKeeper, so the actual synchronisation might be left to implement by the subclass and not by the superclass. Since after this patch multiple logging types might be available, I added an abstract method getLogNameType() to return the name of the Logging system in use. == USAGE == Applies against hadoop-19.0 release ( http://svn.apache.org/repos/asf/hadoop/core/tags/release-0.19.0/ )

          People

          • Assignee:
            Luca Telloli
            Reporter:
            Luca Telloli
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development