Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1448

Create multi-format parser for edits logs file, support binary and XML formats initially

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.23.0
    • Component/s: tools
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Offline edits viewer feature adds oev tool to hdfs script. Oev makes it possible to convert edits logs to/from native binary and XML formats. It uses the same framework as Offline image viewer.

      Example usage:

      $HADOOP_HOME/bin/hdfs oev -i edits -o output.xml
      Show
      Offline edits viewer feature adds oev tool to hdfs script. Oev makes it possible to convert edits logs to/from native binary and XML formats. It uses the same framework as Offline image viewer. Example usage: $HADOOP_HOME/bin/hdfs oev -i edits -o output.xml

      Description

      Create multi-format parser for edits logs file, support binary and XML formats initially.

      Parsing should work from any supported format to any other supported format (e.g. from binary to XML and from XML to binary).

      The binary format is the format used by FSEditLog class to read/write edits file.

      Primary reason to develop this tool is to help with troubleshooting, the binary format is hard to read and edit (for human troubleshooters).

      Longer term it could be used to clean up and minimize parsers for fsimage and edits files. Edits parser OfflineEditsViewer is written in a very similar fashion to OfflineImageViewer. Next step would be to merge OfflineImageViewer and OfflineEditsViewer and use the result in both FSImage and FSEditLog. This is subject to change, specifically depending on adoption of avro (which would completely change how objects are serialized as well as provide ways to convert files to different formats).

      1. CDH-4355.txt
        1 kB
        Colin Patrick McCabe
      2. editsStored
        1.00 MB
        Erik Steffl
      3. HDFS-1448-0.22.patch
        94 kB
        Erik Steffl
      4. HDFS-1448-0.22-1.patch
        110 kB
        Erik Steffl
      5. HDFS-1448-0.22-2.patch
        110 kB
        Erik Steffl
      6. HDFS-1448-0.22-3.patch
        138 kB
        Erik Steffl
      7. HDFS-1448-0.22-4.patch
        146 kB
        Erik Steffl
      8. HDFS-1448-0.22-5.patch
        147 kB
        Erik Steffl
      9. Viewer hierarchy.pdf
        44 kB
        Jakob Homan

        Issue Links

          Activity

          Hide
          Erik Steffl added a comment -

          Patch HDFS-1448-0.22.patch implements binary and XML parser for edits file.

          Binary file editsStored should go to src/test/hdfs/org/apache/hadoop/hdfs/tools/offlineEditsViewer/editsStored.

          Unit test: ant -Dtestcase=TestOfflineEditsViewer -Dtest.output=yes test

          Usage:

          $HADOOP_HOME/bin/hdfs oev -i rrr.xml -o rrr.bin -v

          File type: *.xml file is parsed as XML file, all other files are treated as binary.

          -i name specifies input file name (xml or binary)
          -o name specifies output file name (xml or binary)
          -v means verbose, prints the filenames and XML output to screen (if XML output, otherwise nothing)

          Show
          Erik Steffl added a comment - Patch HDFS-1448 -0.22.patch implements binary and XML parser for edits file. Binary file editsStored should go to src/test/hdfs/org/apache/hadoop/hdfs/tools/offlineEditsViewer/editsStored. Unit test: ant -Dtestcase=TestOfflineEditsViewer -Dtest.output=yes test Usage: $HADOOP_HOME/bin/hdfs oev -i rrr.xml -o rrr.bin -v File type: *.xml file is parsed as XML file, all other files are treated as binary. -i name specifies input file name (xml or binary) -o name specifies output file name (xml or binary) -v means verbose, prints the filenames and XML output to screen (if XML output, otherwise nothing)
          Hide
          Jeff Hammerbacher added a comment -

          Would it be possible to write the edits log as an Avro data file? We'd then get these tools for free.

          Show
          Jeff Hammerbacher added a comment - Would it be possible to write the edits log as an Avro data file? We'd then get these tools for free.
          Hide
          Jeff Hammerbacher added a comment -

          Sorry, saw that you mentioned Avro near the end of the description. Would still be curious to hear what you would need out of Avro to make it a reasonable choice here.

          Show
          Jeff Hammerbacher added a comment - Sorry, saw that you mentioned Avro near the end of the description. Would still be curious to hear what you would need out of Avro to make it a reasonable choice here.
          Hide
          Erik Steffl added a comment -

          More thoughts on using avro etc.

          At this point there is no method to serialization of elements of edits (same goes for fsimage). Some elements are written to DataOutputStream directly (using DataOutputStream methods), some are Writable, some use other ways to serialize/deserialize, sometime themselves, sometime FSEditLog does it.

          We can probably use avro (in either way as described below) only after we replace the FSEditLog by the new parser (at this point parser has to mimic all the weird ways data structures are serialized/deserialized by FSEditLog).

          The parser in the patch is external to all this, i.e. I looked at what data is being written/read and built parser using that information (see Tokenizer.java for supported datatypes (different Tokens), EditsLoaderCurrent.java for grammar of the edits file). Otherwise I would not be able to support different data formats.

          Eventually all elements should be read/written in some organized way (e.g. using avro) but that's outside of scope of the current patch, which is targeting troubleshooting at this point.

          As far as I can tell there are two ways to handle this:

          • keep parser (and later on, writer) of edits (and fsimage) separated from the rest of the code, i.e. parser knows how to serialize/deserialize data structures that it reads/writes
          • use avro (or other serialization method) consistently across all data structures that need to be serializable, parser then knows high level grammar of the file and lets avro do the rest (or maybe avro would know how to parse the whole file, not sure)

          Keeping parser separate would make it easier to keep the file structure documented (grammar is centralized in the parser). Letting data structures serialize themselves makes it easier to add/change/remove data structures (no need to change data structures AND parser).

          Show
          Erik Steffl added a comment - More thoughts on using avro etc. At this point there is no method to serialization of elements of edits (same goes for fsimage). Some elements are written to DataOutputStream directly (using DataOutputStream methods), some are Writable, some use other ways to serialize/deserialize, sometime themselves, sometime FSEditLog does it. We can probably use avro (in either way as described below) only after we replace the FSEditLog by the new parser (at this point parser has to mimic all the weird ways data structures are serialized/deserialized by FSEditLog). The parser in the patch is external to all this, i.e. I looked at what data is being written/read and built parser using that information (see Tokenizer.java for supported datatypes (different Tokens), EditsLoaderCurrent.java for grammar of the edits file). Otherwise I would not be able to support different data formats. Eventually all elements should be read/written in some organized way (e.g. using avro) but that's outside of scope of the current patch, which is targeting troubleshooting at this point. As far as I can tell there are two ways to handle this: keep parser (and later on, writer) of edits (and fsimage) separated from the rest of the code, i.e. parser knows how to serialize/deserialize data structures that it reads/writes use avro (or other serialization method) consistently across all data structures that need to be serializable, parser then knows high level grammar of the file and lets avro do the rest (or maybe avro would know how to parse the whole file, not sure) Keeping parser separate would make it easier to keep the file structure documented (grammar is centralized in the parser). Letting data structures serialize themselves makes it easier to add/change/remove data structures (no need to change data structures AND parser).
          Hide
          Jakob Homan added a comment -

          Code review:

          • General:
            • All classes should be categorized with audience and stability
            • No need for all the brackets in messages. Breaks with what passes for our style.
            • Do we need to write to disk for tests? Just write to output stream
          • editsStored.xml
            • Indent/format to make more human readable
          • TestOfflineImageViewer.java
            • Convert getBuildDir and getCacheDir to fields, rather than re-evaluating the method each call
            • Would it be better to split the single, large test into four smaller tests, with more descriptive names?
            • The commented-out code should be removed. If it's useful for manual testing, it can be included in a static main in the test
            • Style: runOev method calls don't follow code convention, they can be all on one line
            • The methods runOevXmlToBinary/runOevBinaryToXml can be refactored to remove common code, which is most of it.
            • There is no need for a separate printToScreen variable in those methods
            • fileEqualIgnoreTrailingZeroes: Since largeFilename is just aliased to filename1, there is no need for filename1. Just use that name as the method parameter.
            • loadFile(): I'm surprised we don't have a utility method in the test package to do this. It's a general operation and this method may be better located there.
            • A larger problem is that this test doesn't use asserts to verify correctness, which will make working with it difficult. The exceptions should be converted to fully described JUnit asserts.
          • OfflineEditsViewerHelper.java
            • Class needs Javadoc
            • Is it necessary to copy the edits file? Instead, can we just leave it in place and test it there? A better option, though I don't believe supported by MiniDFSCluster, would be if we could just write the edits to a memory buffer and avoid the disk altogether.
            • Commented out code: fc = FileContext.getFileContext(cluster.getURI(), config);
          • Tokenizer.java
            • Tokenizer works specifically with EditsElements, may be good to give it more specific name. Same comment for Token.
            • I'm torn on the individual Token* classes. I'd rather if there were a way of directly integrating them into edits, but that's a bridge too far for this patch. Scala case classes would be quite helpful here...
            • Several referances to static method encode/decodeBase64 via instance variable
              *EditsLoaderCurrent.java
            • The duplicated edits enums should be re-factored into shared class rather than duplicated.
            • Style: case OP_CLOSE doesn't need to be surrounded by braces, as do several other cases.
            • The more involved cases should be refactored into separate classes to aid readability. This may be reasonable for all the cases to be consistent.
            • OP_UPDATE_MASTER_KEY this seems to be the only place we check for the possibility of working with an unsupported version. Is there a reason for this?
            • The pattern:
              v.visit(t.read(new Tokenizer.Token{Whatever}(EditsElement.LENGTH)));

              is repeated quite a lot. Can this be refactored into a helper method to aid in readability?

            • By doing a static import of the various Tokenizer classes (which can be made static) such as:
              import static org.apache.hadoop.hdfs.tools.offlineEditsViewer.Tokenizer.TokenInt;

              you can avoid the extra reference to Tokenizer in the visit calls.

            • I'm not sure that the statistics functionality adds any value to this class. It may be better to create a separate statistics viewer that provides this information.
            • Several unnecessary imports
          • EditsVisitor.java
            • The DepthCounter duplicates the same class in the oiv. May as well create a common utility class and share it.
            • Commented out code:
               // abstract void visit(EditsElement element, String value) throws IOException;  
            • Unnecessary import of DeprecatedUTF8
          • EditsVisitorXml.java
            • Consistent naming with oiv would be XmlEditsVisitor
            • I believe this class is quite ripe for a shared generic implementation with oiv's Xml viewer. This is discussed more below.
            • unnecessary import of Base64 class
          • OfflineEditsViewer.java
            • Typo: This class implements and offline edits viewer, tool that (and -> an)
            • No need to mention note about OfflineImageViewer.
            • The command line parsing and options shares quite a bit of code with the oiv and may be easy to merge.
          • EditsVisitorBinary.java
            • The printToScreen option is ignored and doesn't make sense for this viewer. It may be fine to keep the option, but we should probably add documentation about it being ignored by some visitors
            • No need for commented-out debugging code
          • Tokenizers.java
            • Since the class is a factory perhaps TokenizerFactory is a better name?
            • The file type determination can be simplified by checking for .endsWith(".xml")
            • Typo:* Tokenizers (factory) for different (implementatios -> implementations) of Tokenizer
          • EditsVisitorText.java
            • Consistent naming with oiv would be TextEditsVisitor
            • It may be reasonably easy to merge the oiv's TextVisitor, which this class duplicates, at this stage.
          • TokenizerBinary.java
            • Several unnecessary imports

          I'll have more comments after these revisions.

          Overall, this is a good effort of extending the oiv's approach. However, there does end up being quite a lot of duplicated code. I believe I've come up with a type hierarchy that we can use (attached) that would eliminate all of the duplicated code. The green classes are extensions of the oiv to write and read the image, which we had discussed creating when the oiv was written, but haven't done yet.

          For this patch, I think there are three options:

          1. Modulo patch revisions as outlined above and forthcoming, commit the patch as is and immediately open a new JIRA to de-duplicate code and merge the viewers together. This has the advantage of providing a greenfield for the new JIRA that doesn't have worry about creating the oev at the same time of merging it with the existing infrastructure.
          2. Implement the de-dupe and merging in this patch and commit that. This has the advantage of no duplicate code going into the source, but will increase the size and complexity of this patch.
          3. Re-factor the duplicated code into private methods and have both viewers call that, and immediately open a new JIRA to de-dupe and merge, as in #1. This has the advantage of no duplicate code going into the source, but the refactoring could get messy.

          Since there is no upcoming release where the oev would be exposed and thus cause deprecation lock-in, I favor #1. Regardless, we should probably open a JIRA to finish off merging the oiv and edits loading code.

          Show
          Jakob Homan added a comment - Code review: General: All classes should be categorized with audience and stability No need for all the brackets in messages. Breaks with what passes for our style. Do we need to write to disk for tests? Just write to output stream editsStored.xml Indent/format to make more human readable TestOfflineImageViewer.java Convert getBuildDir and getCacheDir to fields, rather than re-evaluating the method each call Would it be better to split the single, large test into four smaller tests, with more descriptive names? The commented-out code should be removed. If it's useful for manual testing, it can be included in a static main in the test Style: runOev method calls don't follow code convention, they can be all on one line The methods runOevXmlToBinary/runOevBinaryToXml can be refactored to remove common code, which is most of it. There is no need for a separate printToScreen variable in those methods fileEqualIgnoreTrailingZeroes: Since largeFilename is just aliased to filename1, there is no need for filename1. Just use that name as the method parameter. loadFile(): I'm surprised we don't have a utility method in the test package to do this. It's a general operation and this method may be better located there. A larger problem is that this test doesn't use asserts to verify correctness, which will make working with it difficult. The exceptions should be converted to fully described JUnit asserts. OfflineEditsViewerHelper.java Class needs Javadoc Is it necessary to copy the edits file? Instead, can we just leave it in place and test it there? A better option, though I don't believe supported by MiniDFSCluster, would be if we could just write the edits to a memory buffer and avoid the disk altogether. Commented out code: fc = FileContext.getFileContext(cluster.getURI(), config); Tokenizer.java Tokenizer works specifically with EditsElements, may be good to give it more specific name. Same comment for Token. I'm torn on the individual Token* classes. I'd rather if there were a way of directly integrating them into edits, but that's a bridge too far for this patch. Scala case classes would be quite helpful here... Several referances to static method encode/decodeBase64 via instance variable *EditsLoaderCurrent.java The duplicated edits enums should be re-factored into shared class rather than duplicated. Style: case OP_CLOSE doesn't need to be surrounded by braces, as do several other cases. The more involved cases should be refactored into separate classes to aid readability. This may be reasonable for all the cases to be consistent. OP_UPDATE_MASTER_KEY this seems to be the only place we check for the possibility of working with an unsupported version. Is there a reason for this? The pattern: v.visit(t.read(new Tokenizer.Token{Whatever}(EditsElement.LENGTH))); is repeated quite a lot. Can this be refactored into a helper method to aid in readability? By doing a static import of the various Tokenizer classes (which can be made static) such as: import static org.apache.hadoop.hdfs.tools.offlineEditsViewer.Tokenizer.TokenInt; you can avoid the extra reference to Tokenizer in the visit calls. I'm not sure that the statistics functionality adds any value to this class. It may be better to create a separate statistics viewer that provides this information. Several unnecessary imports EditsVisitor.java The DepthCounter duplicates the same class in the oiv. May as well create a common utility class and share it. Commented out code: // abstract void visit(EditsElement element, String value) throws IOException; Unnecessary import of DeprecatedUTF8 EditsVisitorXml.java Consistent naming with oiv would be XmlEditsVisitor I believe this class is quite ripe for a shared generic implementation with oiv's Xml viewer. This is discussed more below. unnecessary import of Base64 class OfflineEditsViewer.java Typo: This class implements and offline edits viewer, tool that (and -> an) No need to mention note about OfflineImageViewer. The command line parsing and options shares quite a bit of code with the oiv and may be easy to merge. EditsVisitorBinary.java The printToScreen option is ignored and doesn't make sense for this viewer. It may be fine to keep the option, but we should probably add documentation about it being ignored by some visitors No need for commented-out debugging code Tokenizers.java Since the class is a factory perhaps TokenizerFactory is a better name? The file type determination can be simplified by checking for .endsWith(".xml") Typo:* Tokenizers (factory) for different (implementatios -> implementations) of Tokenizer EditsVisitorText.java Consistent naming with oiv would be TextEditsVisitor It may be reasonably easy to merge the oiv's TextVisitor, which this class duplicates, at this stage. TokenizerBinary.java Several unnecessary imports I'll have more comments after these revisions. Overall, this is a good effort of extending the oiv's approach. However, there does end up being quite a lot of duplicated code. I believe I've come up with a type hierarchy that we can use (attached) that would eliminate all of the duplicated code. The green classes are extensions of the oiv to write and read the image, which we had discussed creating when the oiv was written, but haven't done yet. For this patch, I think there are three options: Modulo patch revisions as outlined above and forthcoming, commit the patch as is and immediately open a new JIRA to de-duplicate code and merge the viewers together. This has the advantage of providing a greenfield for the new JIRA that doesn't have worry about creating the oev at the same time of merging it with the existing infrastructure. Implement the de-dupe and merging in this patch and commit that. This has the advantage of no duplicate code going into the source, but will increase the size and complexity of this patch. Re-factor the duplicated code into private methods and have both viewers call that, and immediately open a new JIRA to de-dupe and merge, as in #1. This has the advantage of no duplicate code going into the source, but the refactoring could get messy. Since there is no upcoming release where the oev would be exposed and thus cause deprecation lock-in, I favor #1. Regardless, we should probably open a JIRA to finish off merging the oiv and edits loading code.
          Hide
          Konstantin Shvachko added a comment -

          I like Jacobs idea to isolate duplicate code common to I and E viewers. The hierarchy of classes looks good. I especially like the potential of reusing the same code for the actual image and edits loading and the viewers.
          Implementing plan #1 seems to be the shortest path to the final success. If we go with plan 1, it would be good to minimize public methods - easier for refactoring.

          Show
          Konstantin Shvachko added a comment - I like Jacobs idea to isolate duplicate code common to I and E viewers. The hierarchy of classes looks good. I especially like the potential of reusing the same code for the actual image and edits loading and the viewers. Implementing plan #1 seems to be the shortest path to the final success. If we go with plan 1, it would be good to minimize public methods - easier for refactoring.
          Hide
          Erik Steffl added a comment -

          Patch HDFS-1448-0.22-1.patch addresses the review comments https://issues.apache.org/jira/browse/HDFS-1448?focusedCommentId=12920717&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12920717

          It also has few minor updates:

          • uses MiniDFSCluster.Builder instead of a constructor
          • handles edits that were not properly closed (e.g. namenode crushed) better
          • tests check that all op codes were tested
          Show
          Erik Steffl added a comment - Patch HDFS-1448 -0.22-1.patch addresses the review comments https://issues.apache.org/jira/browse/HDFS-1448?focusedCommentId=12920717&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12920717 It also has few minor updates: uses MiniDFSCluster.Builder instead of a constructor handles edits that were not properly closed (e.g. namenode crushed) better tests check that all op codes were tested
          Hide
          Erik Steffl added a comment -

          HDFS-1448-0.22-2.patch is identical to HDFS-1448-0.22-1.patch except for removed temporary troubleshooting code that I forgot to remove before (in TestOfflineEditsViewer.java).

          Show
          Erik Steffl added a comment - HDFS-1448 -0.22-2.patch is identical to HDFS-1448 -0.22-1.patch except for removed temporary troubleshooting code that I forgot to remove before (in TestOfflineEditsViewer.java).
          Hide
          Jakob Homan added a comment -

          Patch review for HDFS-1448-0.22-2.patch.

          • BinaryTokenizer.java
            • Providing a constructor that takes a stream rather than a String could aid in testing (my goal is for all testing to be able to be done via streams without going to the file system).
          • EditsLoader.java
            • printStatistics - does this method add value that couldn't otherwise be achieved as a separate viewer? I'm still not sold on providing this information for every run. The vast majority of oev instances won't be interested in it, but will still have to pay the penalty of compiling the statistics. Conversely, the information could be of interest specifically (ie, tell me about this edits log), and then the user will have to run some other viewer just get it. This same information can be gathered as a separate visitor, as mentioned in the first review.
          • EditsLoaderCurrent.java
            • Comments on 157-160 should be moved one line down so they apply to the check they're describing.
            • The prior review had asked for the various switch statements to be moved into separate functions to aid in readability, testing and code maintability. Does the new code, with its individual functors and the extra code necessary to implement this scheme provide any functionality not given by the original suggestion? If not, it seems to be a large amount of extra code and wiring without any benefit. The goal of the comment in the original review was to reduce complexity and improve readability, which I'm not sure this new approach accomplishes.
          • EditsOpCode.java
            • The content of this file duplicates the constants created in FSEditLog.java. While it would be best to avoid all duplication, that may not be possible in this patch, as discussed above. However, to minimize duplication, I suggest refactoring out the constants from FSEditsLog into a separate class and referring to those constants in the enum definition. Bonus points would be to have the Enum in the same file as the constants.
              EditsVisitor.java
            • What's the use case for the getTokenizer() method? It doesn't seem to be called anywhere. Unused methods should be removed.
          • EditsVisitorFactory.java
            • Does the three lines of regular expression parsing necessary to determine if a file ends in .xml provide any extra benefit than simply using filename.endWith(".xml"), as was proposed in the original review? If not, we should prefer the shorter, simpler code.
            • It may be good to support .XML, .Xml, etc., and therefore call toLower on the string before checking for the file extension.
            • Typo: "different implementatios"
          • OfflineEditsViewer.java
            • The only method calling public setEditsLoader() is OfflineEditsViewer is go(). As such, it should be made private.
          • Tokenizer.java
          • TokenizerFactory.java
            • Same comments for as for EditsFactory.java
          • XmlTokenizer.java
            • A quick survey of our exception handling shows that it is preferable to nest exceptions rather than taking the message from one, swallowing it and throwing a new exception: http://dl.dropbox.com/u/565949/exceptions.txt We should do the same here.
            • Do we need to handle all the empty cases in the switch? At the very least, there are a lot of empty comments that should be returned.
            • public Token read(Token t) throws IOException { this method returns the same Token that it accepts, which has a bit of code smell. I wonder if there's a way to avoid the confusion of mutating the parameter and then returning it, or, if not, explicitly documenting this behavior.
              OfflineEditsViewerHelper.java
            • As an aside: Whatever form elements of the edits file eventually take, it would be nice if they were self-testing and could provide this information automatically, rather than needing to call each one here, decoupled from the implementation.
            • As noted in the first review, it is unnecessary and inefficient to shell out to copy the edits file. This file is used a source for the test; you can find where it is (as is done in order to accomplish the copying) and explicitly clean it up after the test has completed.
              I suggest refactoring
              public void generateEdits(String dfsDir, String editsFilename)

              to

              public String generateEdits(String dfsDir) // return path to edits

              and providing a shutdown method on OfflineEditsViewer that cleans up the cluster when the unit test has completed.

            • 137: No need for IOException to be on a separate line
            • Nit: 208: Void type is preferable to Object for doAs blocks
            • 218: Same comment about nesting exceptions (as well as anywhere else I may have missed).
          • TestOfflineEditsViewer.java
            • Remove commented-out lines 27,28
            • Nit: Line 174, save a few characters with a while loop
            • 134: Exception should be converted to an assert since this is a unit test
          Show
          Jakob Homan added a comment - Patch review for HDFS-1448 -0.22-2.patch. BinaryTokenizer.java Providing a constructor that takes a stream rather than a String could aid in testing (my goal is for all testing to be able to be done via streams without going to the file system). EditsLoader.java printStatistics - does this method add value that couldn't otherwise be achieved as a separate viewer? I'm still not sold on providing this information for every run. The vast majority of oev instances won't be interested in it, but will still have to pay the penalty of compiling the statistics. Conversely, the information could be of interest specifically (ie, tell me about this edits log), and then the user will have to run some other viewer just get it. This same information can be gathered as a separate visitor, as mentioned in the first review. EditsLoaderCurrent.java Comments on 157-160 should be moved one line down so they apply to the check they're describing. The prior review had asked for the various switch statements to be moved into separate functions to aid in readability, testing and code maintability. Does the new code, with its individual functors and the extra code necessary to implement this scheme provide any functionality not given by the original suggestion? If not, it seems to be a large amount of extra code and wiring without any benefit. The goal of the comment in the original review was to reduce complexity and improve readability, which I'm not sure this new approach accomplishes. EditsOpCode.java The content of this file duplicates the constants created in FSEditLog.java. While it would be best to avoid all duplication, that may not be possible in this patch, as discussed above. However, to minimize duplication, I suggest refactoring out the constants from FSEditsLog into a separate class and referring to those constants in the enum definition. Bonus points would be to have the Enum in the same file as the constants. EditsVisitor.java What's the use case for the getTokenizer() method? It doesn't seem to be called anywhere. Unused methods should be removed. EditsVisitorFactory.java Does the three lines of regular expression parsing necessary to determine if a file ends in .xml provide any extra benefit than simply using filename.endWith(".xml"), as was proposed in the original review? If not, we should prefer the shorter, simpler code. It may be good to support .XML, .Xml, etc., and therefore call toLower on the string before checking for the file extension. Typo: "different implementatios" OfflineEditsViewer.java The only method calling public setEditsLoader() is OfflineEditsViewer is go(). As such, it should be made private. Tokenizer.java Spacing between fields and methods does not follow our coding standard. Consensus on naming convention for tokens of Foo appears to be FooToken, rather than TokenFoo (see: http://www.google.com/codesearch?hl=en&sa=N&q=file:^.*Token*.java ), as well as our own 'Token'y classes. We should follow that here. TokenizerFactory.java Same comments for as for EditsFactory.java XmlTokenizer.java A quick survey of our exception handling shows that it is preferable to nest exceptions rather than taking the message from one, swallowing it and throwing a new exception: http://dl.dropbox.com/u/565949/exceptions.txt We should do the same here. Do we need to handle all the empty cases in the switch? At the very least, there are a lot of empty comments that should be returned. public Token read(Token t) throws IOException { this method returns the same Token that it accepts, which has a bit of code smell. I wonder if there's a way to avoid the confusion of mutating the parameter and then returning it, or, if not, explicitly documenting this behavior. OfflineEditsViewerHelper.java As an aside: Whatever form elements of the edits file eventually take, it would be nice if they were self-testing and could provide this information automatically, rather than needing to call each one here, decoupled from the implementation. As noted in the first review, it is unnecessary and inefficient to shell out to copy the edits file. This file is used a source for the test; you can find where it is (as is done in order to accomplish the copying) and explicitly clean it up after the test has completed. I suggest refactoring public void generateEdits(String dfsDir, String editsFilename) to public String generateEdits(String dfsDir) // return path to edits and providing a shutdown method on OfflineEditsViewer that cleans up the cluster when the unit test has completed. 137: No need for IOException to be on a separate line Nit: 208: Void type is preferable to Object for doAs blocks 218: Same comment about nesting exceptions (as well as anywhere else I may have missed). TestOfflineEditsViewer.java Remove commented-out lines 27,28 Nit: Line 174, save a few characters with a while loop 134: Exception should be converted to an assert since this is a unit test
          Hide
          Erik Steffl added a comment -

          HDFS-1448-0.22-3.patch implements changes suggested in review https://issues.apache.org/jira/browse/HDFS-1448?focusedCommentId=12931182&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12931182

          Few notes:

          XmlTokenizer.java public Token read(Token t): documented rationale in Tokenizer.java

          TestOfflineEditsViewer.java Nit: Line 174, save a few characters with a while loop: function changed a bit, don't think it applies now

          Show
          Erik Steffl added a comment - HDFS-1448 -0.22-3.patch implements changes suggested in review https://issues.apache.org/jira/browse/HDFS-1448?focusedCommentId=12931182&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12931182 Few notes: XmlTokenizer.java public Token read(Token t): documented rationale in Tokenizer.java TestOfflineEditsViewer.java Nit: Line 174, save a few characters with a while loop: function changed a bit, don't think it applies now
          Hide
          Jakob Homan added a comment -

          Looking good. Thanks for the changes.

          Review of latest patch (line numbers refer to patch):

          • 544: typo: judgement -> judgment
          • 785: Commented out code
          • 822-824: The call to the MiniDFSCluster builder can be simplified and the builder variable eliminated.
          • 3045: Usage should show option of using long option, not just -i/-o.
          • 3045, cont.: Usage should also document other command line options and statistics viewer.
          • 3045ish: For specifying the processor, we should probably follow the oiv convention of -p
          • 3867: The call to logEdit is broken in a bit of an odd place. Does the call not fit on one line?
          • FSEditLog: Would the code be cleaner with static imports of the FSEDitLogOpCodes?
          • The oev will need comparable documentation to what is provided for the oiv.
          Show
          Jakob Homan added a comment - Looking good. Thanks for the changes. Review of latest patch (line numbers refer to patch): 544: typo: judgement -> judgment 785: Commented out code 822-824: The call to the MiniDFSCluster builder can be simplified and the builder variable eliminated. 3045: Usage should show option of using long option, not just -i/-o. 3045, cont.: Usage should also document other command line options and statistics viewer. 3045ish: For specifying the processor, we should probably follow the oiv convention of -p 3867: The call to logEdit is broken in a bit of an odd place. Does the call not fit on one line? FSEditLog: Would the code be cleaner with static imports of the FSEDitLogOpCodes? The oev will need comparable documentation to what is provided for the oiv.
          Show
          Erik Steffl added a comment - HDFS-1448 -0.22-4.patch address the points in review from 09/Dec/10 08:06 PM https://issues.apache.org/jira/browse/HDFS-1448?focusedCommentId=12970037&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12970037
          Hide
          Erik Steffl added a comment -

          HDFS-1448-0.22-5.patch fixes findBugs warnings,

          test-patch results:

          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 16 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings).

          The release audit warning is caused by editsStored.xml, since it's a reference file for test it does not have apache license included. As far as I can tell this is acceptable since it's a file that must be exactly same as expected output of the Offline edits viewer. Let me know if there's a (better) standard way to deal with this scenario.

          Show
          Erik Steffl added a comment - HDFS-1448 -0.22-5.patch fixes findBugs warnings, test-patch results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 16 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). The release audit warning is caused by editsStored.xml, since it's a reference file for test it does not have apache license included. As far as I can tell this is acceptable since it's a file that must be exactly same as expected output of the Offline edits viewer. Let me know if there's a (better) standard way to deal with this scenario.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed it. Thanks, Erik!

          Erik, please add Release Note for the new feature.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed it. Thanks, Erik! Erik, please add Release Note for the new feature.
          Hide
          Erik Steffl added a comment -

          Release note added.

          Show
          Erik Steffl added a comment - Release note added.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #539 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/539/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #539 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/539/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Colin Patrick McCabe added a comment -

          This augments OEV to print out the highest generation stamp as part of its stats collection pass.

          Show
          Colin Patrick McCabe added a comment - This augments OEV to print out the highest generation stamp as part of its stats collection pass.
          Hide
          Colin Patrick McCabe added a comment -

          Created a new issue, https://issues.apache.org/jira/browse/HDFS-2971, for some OEV / OIV improvements

          Show
          Colin Patrick McCabe added a comment - Created a new issue, https://issues.apache.org/jira/browse/HDFS-2971 , for some OEV / OIV improvements

            People

            • Assignee:
              Erik Steffl
              Reporter:
              Erik Steffl
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development