Hive
  1. Hive
  2. HIVE-662

Add a method to parse apache weblogs

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Labels:
      None

      Description

      Apache weblogs is one of the more common formats for people to parse using Hadoop. Unfortunately the method provided to process the logs in Hive has some issues and seems to be on it's way out. See HIVE-519 and comments on HIVE-520. We should replace that method with something that works better and that can be supported in the future.

        Issue Links

          Activity

          Hide
          Johan Oskarsson added a comment -

          What is the best route to take here? I would assume a custom serde is the way to go?

          Show
          Johan Oskarsson added a comment - What is the best route to take here? I would assume a custom serde is the way to go?
          Hide
          Zheng Shao added a comment -

          Yes, I will work on adding a serde how-to and some examples into the new contrib directory HIVE-639 today.

          Show
          Zheng Shao added a comment - Yes, I will work on adding a serde how-to and some examples into the new contrib directory HIVE-639 today.
          Hide
          Zheng Shao added a comment -

          Fixed as a result of HIVE-167. HIVE-167 adds RegexSerDe which allows us to do the following:

          CREATE TABLE serde_regex(
            host STRING,
            identity STRING,
            user STRING,
            time STRING,
            request STRING,
            status STRING,
            size STRING,
            referer STRING,
            agent STRING)
          ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
          WITH SERDEPROPERTIES (
            "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
            "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
          )
          STORED AS TEXTFILE;
          
          LOAD DATA LOCAL INPATH "../data/files/apache.access.log" INTO TABLE serde_regex;
          LOAD DATA LOCAL INPATH "../data/files/apache.access.2.log" INTO TABLE serde_regex;
          
          SELECT * FROM serde_regex ORDER BY time;
          
          
          Show
          Zheng Shao added a comment - Fixed as a result of HIVE-167 . HIVE-167 adds RegexSerDe which allows us to do the following: CREATE TABLE serde_regex( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \" ]*|\ "[^\" ]*\ ") (-|[0-9]*) (-|[0-9]*)(?: ([^ \" ]*|\ "[^\" ]*\ ") ([^ \" ]*|\ "[^\" ]*\ "))?" , "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE; LOAD DATA LOCAL INPATH "../data/files/apache.access.log" INTO TABLE serde_regex; LOAD DATA LOCAL INPATH "../data/files/apache.access.2.log" INTO TABLE serde_regex; SELECT * FROM serde_regex ORDER BY time;
          Hide
          Zheng Shao added a comment -

          The example above is from contrib/src/test/queries/clientnegative/serde_regex.q

          Show
          Zheng Shao added a comment - The example above is from contrib/src/test/queries/clientnegative/serde_regex.q

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Johan Oskarsson
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development