Hadoop Common
  1. Hadoop Common
  2. HADOOP-342

Design/Implement a tool to support archival and analysis of logfiles.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: None
    • Labels:
      None

      Description

      Requirements:

      a) Create a tool support archival of logfiles (from diverse sources) in hadoop's dfs.
      b) The tool should also support analysis of the logfiles via grep/sort primitives. The tool should allow for fairly generic pattern 'grep's and let users 'sort' the matching lines (from grep) on 'columns' of their choice.

      E.g. from hadoop logs: Look for all log-lines with 'FATAL' and sort them based on timestamps (column x) and then on column y (column x, followed by column y).

      Design/Implementation:

      a) Log Archival

      Archival of logs from diverse sources can be accomplished using the distcp tool (HADOOP-341).

      b) Log analysis

      The idea is to enable users of the tool to perform analysis of logs via grep/sort primitives.

      This can be accomplished via a relatively simple Map-Reduce task where the map does the grep for the given pattern via RegexMapper and then the implicit sort (reducer) is used with a custom Comparator which performs the user-specified comparision (columns).

      The sort/grep specs can be fairly powerful by letting the user of the tool use java's in-built regex patterns (java.util.regex).

      1. logalyzer2.patch
        10 kB
        Arun C Murthy
      2. logalyzer.patch
        10 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          eric baldeschwieler added a comment -

          Sounds good. Will the output be one or more DFS files? stdout? Will it be text or a sequence file?
          Keeping everything in text seems most appropriate.

          Contributing a generic sorter sounds very valuable. Could you spec it in more detail.

          Show
          eric baldeschwieler added a comment - Sounds good. Will the output be one or more DFS files? stdout? Will it be text or a sequence file? Keeping everything in text seems most appropriate. Contributing a generic sorter sounds very valuable. Could you spec it in more detail.
          Hide
          Arun C Murthy added a comment -

          Should have clarified this: the plan is to let the user specify an output directory in which a single text file will contain the output of the 'analysis'.

          Generic Sorter:

          The generic sorter basically lets the user specify a column separator and a spec for priority of columns.
          The Comparator's compare function (implements WritableComparable) then splits each sequence of data based on user specified separator and then compares the 2 data streams on the given priorities.

          E.g. -sortColumnSpec 2,0,1 -separator \t
          (0-based columns)

          If there is enough interest, I can push this into mapred.lib. Appreciate any suggestions.

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Should have clarified this: the plan is to let the user specify an output directory in which a single text file will contain the output of the 'analysis'. Generic Sorter: The generic sorter basically lets the user specify a column separator and a spec for priority of columns. The Comparator's compare function (implements WritableComparable) then splits each sequence of data based on user specified separator and then compares the 2 data streams on the given priorities. E.g. -sortColumnSpec 2,0,1 -separator \t (0-based columns) If there is enough interest, I can push this into mapred.lib. Appreciate any suggestions. thanks, Arun
          Hide
          Arun C Murthy added a comment -

          Here's the 'logalyzer' tool.

          Doug: I felt that it made sense to create a org.apache.hadoop.tools package for logalyzer and other such tools in the future... let me know if you prefer it to be in some other package and i'll update it accordingly.

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Here's the 'logalyzer' tool. Doug: I felt that it made sense to create a org.apache.hadoop.tools package for logalyzer and other such tools in the future... let me know if you prefer it to be in some other package and i'll update it accordingly. thanks, Arun
          Hide
          Doug Cutting added a comment -

          I will look at your patch more closely soon.

          I think it would be good, rather than copy the logs into DFS, to use HTTP to retrieve the map input. Ideally, map tasks would be assigned to nodes where the log data is local.

          This could be implemented as an InputFormat that is parameterized by date. For example, one might specify something like:

          job.setInputFormat(LogInputFormat.class);
          job.set("log.input.start", "2006-07-13 12:00:00");
          job.set("log.input.end", "2006-07-13 15:00:00");

          The set of hosts can be determined automatically to be all hosts in the cluster. One could also specify a job id, in which case the job's start and end time would be used, or a start job id and end job id.

          We might implement parts of this by enhancing the web server run on each tasktracker, e.g., to directly support access to logs by date range.

          Does this make sense?

          Show
          Doug Cutting added a comment - I will look at your patch more closely soon. I think it would be good, rather than copy the logs into DFS, to use HTTP to retrieve the map input. Ideally, map tasks would be assigned to nodes where the log data is local. This could be implemented as an InputFormat that is parameterized by date. For example, one might specify something like: job.setInputFormat(LogInputFormat.class); job.set("log.input.start", "2006-07-13 12:00:00"); job.set("log.input.end", "2006-07-13 15:00:00"); The set of hosts can be determined automatically to be all hosts in the cluster. One could also specify a job id, in which case the job's start and end time would be used, or a start job id and end job id. We might implement parts of this by enhancing the web server run on each tasktracker, e.g., to directly support access to logs by date range. Does this make sense?
          Hide
          eric baldeschwieler added a comment -

          It does make sense. It might make sense to do it as a second pass though.

          We've got lots of logs from various sources we want this tool to work on. In many cases loading them into hadoop is a logical first step.

          We should make sure the loading (or HTTP scanning) is distinct from the query tools.

          Show
          eric baldeschwieler added a comment - It does make sense. It might make sense to do it as a second pass though. We've got lots of logs from various sources we want this tool to work on. In many cases loading them into hadoop is a logical first step. We should make sure the loading (or HTTP scanning) is distinct from the query tools.
          Hide
          Arun C Murthy added a comment -

          I concur with the need for (optional?) HTTP based map input... I'll start on it.
          (I have some ideas about generalising this infrastructure, which I'm in the process of compiling and will send it over to a separate email).

          Eric: Apologise for not clarifying this earlier: logalyzer (as-is) can be used in either mode independently or together i.e. it can be used either for archival or analysis (assuming logs are already in a given directory) or both.

          Doug: Can we get logalyzer as-is into the tree right-away and meanwhile I'll get on to the HTTP-base map input enhancement? There is some interest for using it right-away... hope it isn't too much of a problem.

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - I concur with the need for (optional?) HTTP based map input... I'll start on it. (I have some ideas about generalising this infrastructure, which I'm in the process of compiling and will send it over to a separate email). Eric: Apologise for not clarifying this earlier: logalyzer (as-is) can be used in either mode independently or together i.e. it can be used either for archival or analysis (assuming logs are already in a given directory) or both. Doug: Can we get logalyzer as-is into the tree right-away and meanwhile I'll get on to the HTTP-base map input enhancement? There is some interest for using it right-away... hope it isn't too much of a problem. thanks, Arun
          Hide
          Arun C Murthy added a comment -

          Summary of logalyzer usage:

          Logalyzer.0.0.1
          Usage:
          Logalyzer [-archive -logs urlsFile>] -archiveDir <archiveDirectory> -grep <pattern> -sort <column1,column2,...> -separator <separator> -analysis <outputDirectory>

          Usage Scenarios:
          ---------------------------

          a) Archive only:

          $ java org.apache.hadoop.tools.Logalyzer -archive -logs <urlsFile> -archiveDir <archiveDirectory>

          Fetch the logs specified in <urlsFile> (arbitrary combination of dfs & http based logs) and archive it in <archiveDirectory> (in the dfs).

          Archival of logs from diverse sources is accomplished using the distcp tool (HADOOP-341).

          b) Analyse only:

          $ java org.apache.hadoop.tools.Logalyzer -archiveDir <archiveDirectory> -grep <pattern> -sort <column1,column2,...> -separator <separator> -analysis <outputDirectory>

          Analyse the logs in <archiveDirectory> i.e. grep/sort-with-separator and store the output (as a single textfile) of 'analysis' in <outputDirectory>.

          This is accomplished via a Map-Reduce task where the map does the grep for the given pattern via RegexMapper and then the implicit sort (reducer) is used with a custom Comparator which performs the user-specified comparision (columns).

          c) Archive and analyse

          $ java org.apache.hadoop.tools.Logalyzer -archive -logs <urlsFile> -archiveDir <archiveDirectory> -grep <pattern> -sort <column1,column2,...> -separator <separator> -analysis <outputDirectory>

          Perform both a) and b) tasks.

          • * - * -

          Arun

          Show
          Arun C Murthy added a comment - Summary of logalyzer usage: Logalyzer.0.0.1 Usage: Logalyzer [-archive -logs urlsFile>] -archiveDir <archiveDirectory> -grep <pattern> -sort <column1,column2,...> -separator <separator> -analysis <outputDirectory> Usage Scenarios: --------------------------- a) Archive only: $ java org.apache.hadoop.tools.Logalyzer -archive -logs <urlsFile> -archiveDir <archiveDirectory> Fetch the logs specified in <urlsFile> (arbitrary combination of dfs & http based logs) and archive it in <archiveDirectory> (in the dfs). Archival of logs from diverse sources is accomplished using the distcp tool ( HADOOP-341 ). b) Analyse only: $ java org.apache.hadoop.tools.Logalyzer -archiveDir <archiveDirectory> -grep <pattern> -sort <column1,column2,...> -separator <separator> -analysis <outputDirectory> Analyse the logs in <archiveDirectory> i.e. grep/sort-with-separator and store the output (as a single textfile) of 'analysis' in <outputDirectory>. This is accomplished via a Map-Reduce task where the map does the grep for the given pattern via RegexMapper and then the implicit sort (reducer) is used with a custom Comparator which performs the user-specified comparision (columns). c) Archive and analyse $ java org.apache.hadoop.tools.Logalyzer -archive -logs <urlsFile> -archiveDir <archiveDirectory> -grep <pattern> -sort <column1,column2,...> -separator <separator> -analysis <outputDirectory> Perform both a) and b) tasks. * - * - Arun
          Hide
          Doug Cutting added a comment -

          This patch requires HADOOP-341.

          Show
          Doug Cutting added a comment - This patch requires HADOOP-341 .
          Hide
          Arun C Murthy added a comment -

          Here's a new patch for logalyzer incorporating changes in distcp (HADOOP-341).

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Here's a new patch for logalyzer incorporating changes in distcp ( HADOOP-341 ). thanks, Arun
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks!

          Show
          Doug Cutting added a comment - I just committed this. Thanks!

            People

            • Assignee:
              Unassigned
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development