Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1016

Make the format of the Job History be JSON instead of Avro binary

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None

      Description

      I forgot that one of the features that would be nice is to off load the job history display from the JobTracker. That will be a lot easier, if the job history is stored in JSON. Therefore, I think we should change the storage now to prevent incompatibilities later.

      1. MAPREDUCE-1016.patch
        3 kB
        Doug Cutting
      2. MAPREDUCE-1016.patch
        29 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #127 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/127/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #127 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/127/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #96 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/96/)
          . Make the job history log format JSON.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #96 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/96/ ) . Make the job history log format JSON.
          Hide
          Doug Cutting added a comment -

          I just committed this.

          Show
          Doug Cutting added a comment - I just committed this.
          Hide
          Sharad Agarwal added a comment -

          Patch looks good. Could you update the eclipse .classpath template as well when you commit?

          Show
          Sharad Agarwal added a comment - Patch looks good. Could you update the eclipse .classpath template as well when you commit?
          Hide
          Hong Tang added a comment -

          If you think we should have a more general, header-based format, please file a separate issue.

          I created MAPREDUCE-1135 to track my proposal.

          Show
          Hong Tang added a comment - If you think we should have a more general, header-based format, please file a separate issue. I created MAPREDUCE-1135 to track my proposal.
          Hide
          Doug Cutting added a comment -

          Can someone please review this before it goes stale?

          Show
          Doug Cutting added a comment - Can someone please review this before it goes stale?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422414/MAPREDUCE-1016.patch
          against trunk revision 825469.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422414/MAPREDUCE-1016.patch against trunk revision 825469. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/78/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          Here's a new version of this patch that also upgrades to Avro 1.2.0.

          In Avro 1.2, generated classes are no longer all nested within an interface, which mandated a systematic update to references to the generated classes.

          Show
          Doug Cutting added a comment - Here's a new version of this patch that also upgrades to Avro 1.2.0. In Avro 1.2, generated classes are no longer all nested within an interface, which mandated a systematic update to references to the generated classes.
          Hide
          Doug Cutting added a comment -

          > In this jira, comparing with the current code, I am only proposing breaking down the first line (version string) into two lines and add an empty line after the schema (and add a few keywords).

          I think that is overkill. The point of the change to the log format was to make it trivial for external tools to process these logs. We don't want to make folks have to write a header processor. Most folks will probably just skip the first two lines and use a json parser. If they want to do some error-checking, they can check that the first line is "Avro-Json". Further, if they want, they can read the schema and use Avro. If we permit arbitrary headers then folks have to skip up until a blank line, which harder, although not much. But if they want to check the format or use Avro then they have to write a header parser, and logic to process the header, which I think is an unneeded imposition.

          The magic signature for this format is "Avro-Json\n". The format is:

          • first line is magic
          • second line is an Avro schema
          • subsequent lines are json-encoded instances of that schema

          I see no need for a more complex format at this time. The only change from the original Json-based format when Avro was added was changing the magic line to not have a version number and adding a line with the schema, which is, in essence, the version number. If you think we should have a more general, header-based format, please file a separate issue.

          Show
          Doug Cutting added a comment - > In this jira, comparing with the current code, I am only proposing breaking down the first line (version string) into two lines and add an empty line after the schema (and add a few keywords). I think that is overkill. The point of the change to the log format was to make it trivial for external tools to process these logs. We don't want to make folks have to write a header processor. Most folks will probably just skip the first two lines and use a json parser. If they want to do some error-checking, they can check that the first line is "Avro-Json". Further, if they want, they can read the schema and use Avro. If we permit arbitrary headers then folks have to skip up until a blank line, which harder, although not much. But if they want to check the format or use Avro then they have to write a header parser, and logic to process the header, which I think is an unneeded imposition. The magic signature for this format is "Avro-Json\n". The format is: first line is magic second line is an Avro schema subsequent lines are json-encoded instances of that schema I see no need for a more complex format at this time. The only change from the original Json-based format when Avro was added was changing the magic line to not have a version number and adding a line with the schema, which is, in essence, the version number. If you think we should have a more general, header-based format, please file a separate issue.
          Hide
          Hong Tang added a comment -

          I think we're over-engineering this. Can we take this to a separate issue?

          Sure. In this jira, comparing with the current code, I am only proposing breaking down the first line (version string) into two lines and add an empty line after the schema (and add a few keywords). This could make the format easily extended to a more elaborated schema in the future without any special code handling backward/forward compatibility.

          That would require the parsing of schema string to figure out that the schema matches across the files before merging, no ?

          Yes/No. The one stored in TFile would be a default one. If a job history has a different schema than that default, then, for that entry, we can keep the "Schema: " line in the data.

          Can we somehow make it available remotely via url ?

          One of the principal advantage of Avro versus Protocol Buffer is that it is completely self-contained, This seems a step backwards.

          Show
          Hong Tang added a comment - I think we're over-engineering this. Can we take this to a separate issue? Sure. In this jira, comparing with the current code, I am only proposing breaking down the first line (version string) into two lines and add an empty line after the schema (and add a few keywords). This could make the format easily extended to a more elaborated schema in the future without any special code handling backward/forward compatibility. That would require the parsing of schema string to figure out that the schema matches across the files before merging, no ? Yes/No. The one stored in TFile would be a default one. If a job history has a different schema than that default, then, for that entry, we can keep the "Schema: " line in the data. Can we somehow make it available remotely via url ? One of the principal advantage of Avro versus Protocol Buffer is that it is completely self-contained, This seems a step backwards.
          Hide
          Doug Cutting added a comment -

          > I'd like to propose the following that may make it easier toward future extension without requiring too much special case code to support backward compatibility [ ... ]

          I think we're over-engineering this. Can we take this to a separate issue?

          Show
          Doug Cutting added a comment - > I'd like to propose the following that may make it easier toward future extension without requiring too much special case code to support backward compatibility [ ... ] I think we're over-engineering this. Can we take this to a separate issue?
          Hide
          Sharad Agarwal added a comment -

          Or we can strip out the schema string and put it in a TFile metadata section.

          That would require the parsing of schema string to figure out that the schema matches across the files before merging, no ?
          I agree that there is a benefit of making the files self contained. Ideally I would like to remove the redundancy and make it self contained by pointing to a schema file in the similar way as done for schema references in xml files. Can we somehow make it available remotely via url ? svn view url comes to my mind. Does that sound too awkward?

          Show
          Sharad Agarwal added a comment - Or we can strip out the schema string and put it in a TFile metadata section. That would require the parsing of schema string to figure out that the schema matches across the files before merging, no ? I agree that there is a benefit of making the files self contained. Ideally I would like to remove the redundancy and make it self contained by pointing to a schema file in the similar way as done for schema references in xml files. Can we somehow make it available remotely via url ? svn view url comes to my mind. Does that sound too awkward?
          Hide
          Hong Tang added a comment -

          No, this is probably harder to manage. What we could do is (not sure if this is close to what Doug said) that we can periodically group a bunch of job history files into a single TFile. Hopefully compression would take care of the redundancy in the schema string. Or we can strip out the schema string and put it in a TFile metadata section.

          Show
          Hong Tang added a comment - No, this is probably harder to manage. What we could do is (not sure if this is close to what Doug said) that we can periodically group a bunch of job history files into a single TFile. Hopefully compression would take care of the redundancy in the schema string. Or we can strip out the schema string and put it in a TFile metadata section.
          Hide
          Sharad Agarwal added a comment -

          Line 3: "Schema: <schema string>"

          Adding event schema to each file may be too verbose, especially for small jobs. We can add the mapreduce release version instead. The event schema will always be corresponding to the mapreduce version.

          Show
          Sharad Agarwal added a comment - Line 3: "Schema: <schema string>" Adding event schema to each file may be too verbose, especially for small jobs. We can add the mapreduce release version instead. The event schema will always be corresponding to the mapreduce version.
          Hide
          Hong Tang added a comment -

          I'd like to propose the following that may make it easier toward future extension without requiring too much special case code to support backward compatibility.

          Line 1: The version string. We, for now, would require an exact match. And it should be "Avro". In the future, it could be "Avro/2.0".
          Line 2: "Encoder: Json" (Or "Encoder: Binary")
          Line 3: "Schema: <schema string>"
          Line 4: Empty line
          Remaining file: encoded history-event objects.

          Essentially, the structure of this file mimic the HTTP style:

          Line 1: Protocol + Version. For JobHistoryParser to determine whether it can handle the input, and if so, what backend deserialization framework to invoke.
          Line 2 until the empty line: Header. These are parameters that are needed to instantiate the deserializer.
          Remainder of the file: serialized data.

          Does that make sense?

          Show
          Hong Tang added a comment - I'd like to propose the following that may make it easier toward future extension without requiring too much special case code to support backward compatibility. Line 1: The version string. We, for now, would require an exact match. And it should be "Avro". In the future, it could be "Avro/2.0". Line 2: "Encoder: Json" (Or "Encoder: Binary") Line 3: "Schema: <schema string>" Line 4: Empty line Remaining file: encoded history-event objects. Essentially, the structure of this file mimic the HTTP style: Line 1: Protocol + Version. For JobHistoryParser to determine whether it can handle the input, and if so, what backend deserialization framework to invoke. Line 2 until the empty line: Header. These are parameters that are needed to instantiate the deserializer. Remainder of the file: serialized data. Does that make sense?
          Hide
          Doug Cutting added a comment -

          > Are you saying that maybe we will never need to have a "avro-json-2.0"?

          I think it unlikely but not impossible. If Avro incompatibly changed it's schema format then we'd probably change it to "Avro2-json" or somesuch. We'd need to change the version string if we changed the schema in some critical way that's impossible to detect by examining the schema itself, but I have a hard time imagining a change like that, and we could always instead, e.g., change the name or package of the top-level class in the schema to make such a change easy to detect. So it's unlikely but not impossible that we'll need to change the version string, and, if we do, we can append a version number to it. The version string already future-proofs us sufficiently.

          > Also, what is your opinion on whether job-history format would be fixed on a specific encoder permanently?

          Nothing's permanent. It seems possible we someday might want to permit binary too, so I added the encoding to the version string. But for now I think the proposal is to be json-only.

          What seem more probable to me is that we might someday use a standard container format like TFile or Avro's data file. These can be differentiated by use of "magic": Avro data file always begins with 'AVR\0', the json file always begins with "Avro-Json\n", etc. So it's good that the file begins with a fixed string rather than just the schema, but exactly what string it begins with is less critical.

          Show
          Doug Cutting added a comment - > Are you saying that maybe we will never need to have a "avro-json-2.0"? I think it unlikely but not impossible. If Avro incompatibly changed it's schema format then we'd probably change it to "Avro2-json" or somesuch. We'd need to change the version string if we changed the schema in some critical way that's impossible to detect by examining the schema itself, but I have a hard time imagining a change like that, and we could always instead, e.g., change the name or package of the top-level class in the schema to make such a change easy to detect. So it's unlikely but not impossible that we'll need to change the version string, and, if we do, we can append a version number to it. The version string already future-proofs us sufficiently. > Also, what is your opinion on whether job-history format would be fixed on a specific encoder permanently? Nothing's permanent. It seems possible we someday might want to permit binary too, so I added the encoding to the version string. But for now I think the proposal is to be json-only. What seem more probable to me is that we might someday use a standard container format like TFile or Avro's data file. These can be differentiated by use of "magic": Avro data file always begins with 'AVR\0', the json file always begins with "Avro-Json\n", etc. So it's good that the file begins with a fixed string rather than just the schema, but exactly what string it begins with is less critical.
          Hide
          Hong Tang added a comment -

          Maybe I miss something. I am considering a situation where the JobHistory event schemas are significantly changed and we want to prevent older version of the software from reading job-history logs produced by the newer version of the software. This is consistent with your comments here: https://issues.apache.org/jira/browse/MAPREDUCE-980?focusedCommentId=12757389&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12757389.

          Again, this would be part of the JobHistory format specification and I think it would make sense to normalize the format of the version string. Are you saying that maybe we will never need to have a "avro-json-2.0"?

          Also, what is your opinion on whether job-history format would be fixed on a specific encoder permanently?

          Show
          Hong Tang added a comment - Maybe I miss something. I am considering a situation where the JobHistory event schemas are significantly changed and we want to prevent older version of the software from reading job-history logs produced by the newer version of the software. This is consistent with your comments here: https://issues.apache.org/jira/browse/MAPREDUCE-980?focusedCommentId=12757389&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12757389 . Again, this would be part of the JobHistory format specification and I think it would make sense to normalize the format of the version string. Are you saying that maybe we will never need to have a "avro-json-2.0"? Also, what is your opinion on whether job-history format would be fixed on a specific encoder permanently?
          Hide
          Doug Cutting added a comment -

          Note that if we add or remove a field we should not need to change the version. Even if we use a totally different schema, we could determine that by looking at the schema. So we should really only need to change the version string if we stop using Avro.

          Note that even if this turns out to be wrong, we can always add a version number to the version string later. But adding one now seems like belt-and-suspenders: the schema is the version.

          Show
          Doug Cutting added a comment - Note that if we add or remove a field we should not need to change the version. Even if we use a totally different schema, we could determine that by looking at the schema. So we should really only need to change the version string if we stop using Avro. Note that even if this turns out to be wrong, we can always add a version number to the version string later. But adding one now seems like belt-and-suspenders: the schema is the version.
          Hide
          Hong Tang added a comment -

          I think we are misusing version string here. I suggest the version string be "

          {serializer}-{encoder}-{version}". E.g. "avro-json-1.0". EventReader should compare only the {serializer}

          and

          {version}

          part of the string to determine whether a job history file matches (or is compatible with) the schema of the HistoryInfo object it is about to create - and it would use the

          {encoder}

          string to decide what decoder should be used.

          Show
          Hong Tang added a comment - I think we are misusing version string here. I suggest the version string be " {serializer}-{encoder}-{version}". E.g. "avro-json-1.0". EventReader should compare only the {serializer} and {version} part of the string to determine whether a job history file matches (or is compatible with) the schema of the HistoryInfo object it is about to create - and it would use the {encoder} string to decide what decoder should be used.
          Hide
          Doug Cutting added a comment -

          Test failure is unrelated and failing for other tasks.

          No new tests are required for this change.

          Show
          Doug Cutting added a comment - Test failure is unrelated and failing for other tasks. No new tests are required for this change.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12420284/MAPREDUCE-1016.patch
          against trunk revision 817740.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420284/MAPREDUCE-1016.patch against trunk revision 817740. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/123/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          Here's a patch that implements this.

          Show
          Doug Cutting added a comment - Here's a patch that implements this.
          Hide
          Owen O'Malley added a comment -

          The use case that I'm considering is that we have an AJAX client that reads the job history via hftp, parses it and displays it to the user. The only work the JobTracker needs to do is give out the URL for getting the job history file. The rest is done in JavaScript on the client's browser. Until we have an Avro library in JavaScript, it will be easier to do if the job history is in JSON.

          Show
          Owen O'Malley added a comment - The use case that I'm considering is that we have an AJAX client that reads the job history via hftp, parses it and displays it to the user. The only work the JobTracker needs to do is give out the URL for getting the job history file. The rest is done in JavaScript on the client's browser. Until we have an Avro library in JavaScript, it will be easier to do if the job history is in JSON.
          Hide
          Jothi Padmanabhan added a comment -

          I think storing it in JSON is a good idea too. Nevertheless, I think the best way for other history consumers/tools to insulate themselves from the underlying changes/incompatibilities would be to use the JobHistoryParsing API's. At the least, they will abstract out the underlying storage format.

          Show
          Jothi Padmanabhan added a comment - I think storing it in JSON is a good idea too. Nevertheless, I think the best way for other history consumers/tools to insulate themselves from the underlying changes/incompatibilities would be to use the JobHistoryParsing API's. At the least, they will abstract out the underlying storage format.
          Hide
          Arun C Murthy added a comment -

          +1

          Show
          Arun C Murthy added a comment - +1
          Hide
          Milind Bhandarkar added a comment -

          Oh Thank you Owen ! Thank you, thank you, thank you !

          Show
          Milind Bhandarkar added a comment - Oh Thank you Owen ! Thank you, thank you, thank you !

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development