Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9.0
    • Component/s: Storage - Other
    • Labels:
      None

      Description

      Add an HTTPD logparser based format plugin. The author has been kind enough to move the logparser project to be released under the Apache License. Can find it here:

      <dependency>
      <groupId>nl.basjes.parse.httpdlog</groupId>
      <artifactId>httpdlog-parser</artifactId>
      <version>2.0</version>
      </dependency>

        Issue Links

          Activity

          Hide
          sudheeshkatkam Sudheesh Katkam added a comment - - edited

          Fixed in 818f945, 46c0f2a and 4a82bc1

          Show
          sudheeshkatkam Sudheesh Katkam added a comment - - edited Fixed in 818f945 , 46c0f2a and 4a82bc1
          Hide
          parthc Parth Chandra added a comment -

          Jacques Nadeau I'm inclined to merge this latest version in. Thoughts?

          Show
          parthc Parth Chandra added a comment - Jacques Nadeau I'm inclined to merge this latest version in. Thoughts?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user cgivre commented on the issue:

          https://github.com/apache/drill/pull/607

          @chunhui-shi
          I've actually been thinking about writing a generic log parser for Drill in which the user would provide a regex with groups and a list of fields. For instance consider the following sshd log file:

          ```
          070823 21:00:32 1 Connect root@localhost on test1
          070823 21:00:48 1 Query show tables
          070823 21:00:56 1 Query select * from category
          070917 16:29:01 21 Query select * from location
          070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1
          ```
          You can't really split this by space or tab, and dissecting it with various string slicing functions would lead to some very complex and ugly queries. But with the following regex:
          ```
          ^(\d

          {6}

          \s\d

          {2}:\d{2}

          :\d

          {2}

          )\s+(\d+)\s(\w+)\s+(.+)$
          ```
          You can extract all the fields and query them.
          With respect to the HTTPD log parser, the log parser accepts a format string in the configuration (https://issues.apache.org/jira/browse/DRILL-3423) and with that you can parse any kind of HTTPD log.

          Show
          githubbot ASF GitHub Bot added a comment - Github user cgivre commented on the issue: https://github.com/apache/drill/pull/607 @chunhui-shi I've actually been thinking about writing a generic log parser for Drill in which the user would provide a regex with groups and a list of fields. For instance consider the following sshd log file: ``` 070823 21:00:32 1 Connect root@localhost on test1 070823 21:00:48 1 Query show tables 070823 21:00:56 1 Query select * from category 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1 ``` You can't really split this by space or tab, and dissecting it with various string slicing functions would lead to some very complex and ugly queries. But with the following regex: ``` ^(\d {6} \s\d {2}:\d{2} :\d {2} )\s+(\d+)\s(\w+)\s+(.+)$ ``` You can extract all the fields and query them. With respect to the HTTPD log parser, the log parser accepts a format string in the configuration ( https://issues.apache.org/jira/browse/DRILL-3423 ) and with that you can parse any kind of HTTPD log.
          Hide
          parthc Parth Chandra added a comment -

          Charles Givre I'll take a look at trying to get the merge conflicts resolved. Give me a bit of time though.
          The UDF sounds great. What license is the one that Niels has? It might be compatible with Apache. (see https://www.apache.org/legal/resolved.html#category-a)

          Show
          parthc Parth Chandra added a comment - Charles Givre I'll take a look at trying to get the merge conflicts resolved. Give me a bit of time though. The UDF sounds great. What license is the one that Niels has? It might be compatible with Apache. (see https://www.apache.org/legal/resolved.html#category-a )
          Hide
          cgivre Charles Givre added a comment - - edited

          I'm still working out how to use git, but I thought I'd explain what I've done so far:
          1. I cleaned up the field names such that they are a little more user friendly. IE HTTP_PATH:request_firstline_uri_path is now request_firstline_uri_path.
          2. I wrote two UDFs which I'll include with the PR, parse_url() and parse_query(). Parse_query splits up a query string and returns a map of the key value pairs. IE: parse_query( 'url?arg1=x&arg2=y') would return:
          {
          'arg1': 'x',
          'arg2': 'y'
          }
          Parse_url takes a URL and returns a map of the various components of the URL. It basically is a wrapper for java.net.URL and returns a map of the port, path, querystring, protocol, host and references. Is this acceptable to everyone? I think more or less, it follows what Jacques Nadeau described in his earlier comments.

          I'd really also like to add a user agent UDF parser as well. Niels has one that looked good, but it isn't under the Apache license.

          Show
          cgivre Charles Givre added a comment - - edited I'm still working out how to use git, but I thought I'd explain what I've done so far: 1. I cleaned up the field names such that they are a little more user friendly. IE HTTP_PATH:request_firstline_uri_path is now request_firstline_uri_path. 2. I wrote two UDFs which I'll include with the PR, parse_url() and parse_query(). Parse_query splits up a query string and returns a map of the key value pairs. IE: parse_query( 'url?arg1=x&arg2=y') would return: { 'arg1': 'x', 'arg2': 'y' } Parse_url takes a URL and returns a map of the various components of the URL. It basically is a wrapper for java.net.URL and returns a map of the port, path, querystring, protocol, host and references. Is this acceptable to everyone? I think more or less, it follows what Jacques Nadeau described in his earlier comments. I'd really also like to add a user agent UDF parser as well. Niels has one that looked good, but it isn't under the Apache license.
          Hide
          cgivre Charles Givre added a comment -

          I’d be happy to. I’m traveling the next few days, but when I get back, I’ll do that.
          Thanks,

          Show
          cgivre Charles Givre added a comment - I’d be happy to. I’m traveling the next few days, but when I get back, I’ll do that. Thanks,
          Hide
          parthc Parth Chandra added a comment -

          Charles Givre Since it looks like you're working on this, can I suggest that when you put together your PR, you include Jacques' original commit from his branch as well as Jim's changes on top of that so that all parties get due credit/
          Also, Jacques original branch has a ComplexWriterFacade class which might be useful in writing the complex fields.
          If you could post a link to your branch, we can assist more.

          Show
          parthc Parth Chandra added a comment - Charles Givre Since it looks like you're working on this, can I suggest that when you put together your PR, you include Jacques' original commit from his branch as well as Jim's changes on top of that so that all parties get due credit/ Also, Jacques original branch has a ComplexWriterFacade class which might be useful in writing the complex fields. If you could post a link to your branch, we can assist more.
          Hide
          kingmesal Jim Scott added a comment -

          Jacques, if you can point me to an example of creating a function within drill that I could easily adapt, I might very well be able to use the parsing functionality that exists within the parser to create the complex data types. Which would then nicely surface that functionality of the parser to all of Drill.

          Show
          kingmesal Jim Scott added a comment - Jacques, if you can point me to an example of creating a function within drill that I could easily adapt, I might very well be able to use the parsing functionality that exists within the parser to create the complex data types. Which would then nicely surface that functionality of the parser to all of Drill.
          Hide
          kingmesal Jim Scott added a comment -

          Jacques,

          Given everything you have said here I can see value in making some changes. I think that in order to move in that direction however, there are a considerable number of details not yet covered. I have tried to get them all below. I agree on the ideas of the functions and have put those which you suggested here in addition to others that would need to be covered. However, I would say that these issues must be resolved in order to move in this direction.

          Considerations

          User must specify a name that drill understands, and that can be mapped into a name the parser understands

          Option – There needs to be a mapping between every format string available for the user to be able to query that field (see table of mappings – user will reference with underscore and not dots).

          Format String Variable Name Type
          %a connection.client.ip IP
          %{c}a connection.client.peerip IP
          %A connection.server.ip IP
          %B response.body.bytes BYTES
          %b response.body.bytesclf BYTES
          %{Foobar}C request.cookies.* HTTP.COOKIE
          %D server.process.time MICROSECONDS
          %{Foobar}e server.environment.* VARIABLE
          %f server.filename FILENAME
          %h connection.client.host IP
          %H request.protocol PROTOCOL
          %{Foobar}i request.header. HTTP.HEADER
          %k connection.keepalivecount NUMBER
          %l connection.client.logname NUMBER
          %L request.errorlogid STRING
          %m request.method HTTP.METHOD
          %{Foobar}n server.module_note.* STRING
          %{Foobar}o response.header.* HTTP.HEADER
          %p request.server.port.canonical PORT
          %{canonical}p connection.server.port.canonical PORT
          %{local}p connection.server.port PORT
          %{remote}p connection.client.port PORT
          %P connection.server.child.processid NUMBER
          %{pid}P connection.server.child.processid NUMBER
          %{tid}P connection.server.child.threadid NUMBER
          %{hextid}P connection.server.child.hexthreadid NUMBER
          %q request.querystring HTTP.QUERYSTRING
          %r request.firstline HTTP.FIRSTLINE
          %R request.handler STRING
          %s request.status.original STRING
          %>s request.status.last STRING
          %t request.receive.time TIME.STAMP
          %{msec}t request.receive.time.begin.msec TIME.EPOCH
          %{begin:msec}t request.receive.time.begin.msec TIME.EPOCH
          %{end:msec}t request.receive.time.end.msec TIME.EPOCH
          %{usec}t request.receive.time.begin.usec TIME.EPOCH.USEC
          %{begin:usec}t request.receive.time.begin.usec TIME.EPOCH.USEC
          %{end:usec}t request.receive.time.end.usec TIME.EPOCH.USEC
          %{msec_frac}t request.receive.time.begin.msec_frac TIME.EPOCH
          %{begin:msec_frac}t request.receive.time.begin.msec_frac TIME.EPOCH
          %{end:msec_frac}t request.receive.time.end.msec_frac TIME.EPOCH
          %{usec_frac}t request.receive.time.begin.usec_frac TIME.EPOCH.USEC_FRAC
          %{begin:usec_frac}t request.receive.time.begin.usec_frac TIME.EPOCH.USEC_FRAC
          %{end:usec_frac}t request.receive.time.end.usec_frac TIME.EPOCH.USEC_FRAC
          %T response.server.processing.time SECONDS
          %u connection.client.user STRING
          %U request.urlpath URI
          %v connection.server.name.canonical STRING
          %V connection.server.name STRING
          %X response.connection.status HTTP.CONNECTSTATUS
          %I request.bytes BYTES
          %O response.bytes BYTES
          %{cookie}i request.cookies HTTP.COOKIES
          %{set-cookie}o response.cookies HTTP.SETCOOKIES
          %{user-agent}i request.user-agent HTTP.USERAGENT
          %{referer}i request.referer HTTP.URI

          There are fields which could be parsed and selected by the user that are complex (URL, URI, query string)

          Option – Provide a function to parse urls into map

          { 
            protocol: "...", 
            user: "...", 
            password: "...", 
            host: "...", 
            port: "...", 
            path: "...", 
            query: "...", 
            fragment: "..."
          }
          

          Option – Provide a function to parse a query string into (users can use kvgen on this if they need to)

          {
            "fieldName1": "fieldValue1", 
            "fieldName2": "fieldValue2", 
            ... 
          }
          

          There are fields which could be parsed and selected by the user that are arbitrary (cookies, headers, etc..)

          Option – Cookies are named and contain (domain, expires, path, value)

          [ 
            name: {
              domain: "...", 
              expires: "...", 
              path: "...", 
              value: "..."
            }, 
            ... 
          ]
          

          Issue to Address
          There are details in the string format represented by Foobar (e.g. header names) that cannot necessarily be identified before hand and must be accounted for or else the parser won't be completely effective and the user will not be able to query headers, etc... that exist in the log.

          Other Possible Issues

          Who is going to write the functions to expose the functionality for all Drill queries?

          Show
          kingmesal Jim Scott added a comment - Jacques, Given everything you have said here I can see value in making some changes. I think that in order to move in that direction however, there are a considerable number of details not yet covered. I have tried to get them all below. I agree on the ideas of the functions and have put those which you suggested here in addition to others that would need to be covered. However, I would say that these issues must be resolved in order to move in this direction. Considerations User must specify a name that drill understands, and that can be mapped into a name the parser understands Option – There needs to be a mapping between every format string available for the user to be able to query that field (see table of mappings – user will reference with underscore and not dots). Format String Variable Name Type %a connection.client.ip IP %{c}a connection.client.peerip IP %A connection.server.ip IP %B response.body.bytes BYTES %b response.body.bytesclf BYTES %{Foobar}C request.cookies.* HTTP.COOKIE %D server.process.time MICROSECONDS %{Foobar}e server.environment.* VARIABLE %f server.filename FILENAME %h connection.client.host IP %H request.protocol PROTOCOL %{Foobar}i request.header. HTTP.HEADER %k connection.keepalivecount NUMBER %l connection.client.logname NUMBER %L request.errorlogid STRING %m request.method HTTP.METHOD %{Foobar}n server.module_note.* STRING %{Foobar}o response.header.* HTTP.HEADER %p request.server.port.canonical PORT %{canonical}p connection.server.port.canonical PORT %{local}p connection.server.port PORT %{remote}p connection.client.port PORT %P connection.server.child.processid NUMBER %{pid}P connection.server.child.processid NUMBER %{tid}P connection.server.child.threadid NUMBER %{hextid}P connection.server.child.hexthreadid NUMBER %q request.querystring HTTP.QUERYSTRING %r request.firstline HTTP.FIRSTLINE %R request.handler STRING %s request.status.original STRING %>s request.status.last STRING %t request.receive.time TIME.STAMP %{msec}t request.receive.time.begin.msec TIME.EPOCH %{begin:msec}t request.receive.time.begin.msec TIME.EPOCH %{end:msec}t request.receive.time.end.msec TIME.EPOCH %{usec}t request.receive.time.begin.usec TIME.EPOCH.USEC %{begin:usec}t request.receive.time.begin.usec TIME.EPOCH.USEC %{end:usec}t request.receive.time.end.usec TIME.EPOCH.USEC %{msec_frac}t request.receive.time.begin.msec_frac TIME.EPOCH %{begin:msec_frac}t request.receive.time.begin.msec_frac TIME.EPOCH %{end:msec_frac}t request.receive.time.end.msec_frac TIME.EPOCH %{usec_frac}t request.receive.time.begin.usec_frac TIME.EPOCH.USEC_FRAC %{begin:usec_frac}t request.receive.time.begin.usec_frac TIME.EPOCH.USEC_FRAC %{end:usec_frac}t request.receive.time.end.usec_frac TIME.EPOCH.USEC_FRAC %T response.server.processing.time SECONDS %u connection.client.user STRING %U request.urlpath URI %v connection.server.name.canonical STRING %V connection.server.name STRING %X response.connection.status HTTP.CONNECTSTATUS %I request.bytes BYTES %O response.bytes BYTES %{cookie}i request.cookies HTTP.COOKIES %{set-cookie}o response.cookies HTTP.SETCOOKIES %{user-agent}i request.user-agent HTTP.USERAGENT %{referer}i request.referer HTTP.URI There are fields which could be parsed and selected by the user that are complex (URL, URI, query string) Option – Provide a function to parse urls into map { protocol: "..." , user: "..." , password: "..." , host: "..." , port: "..." , path: "..." , query: "..." , fragment: "..." } Option – Provide a function to parse a query string into (users can use kvgen on this if they need to) { "fieldName1" : "fieldValue1" , "fieldName2" : "fieldValue2" , ... } There are fields which could be parsed and selected by the user that are arbitrary (cookies, headers, etc..) Option – Cookies are named and contain (domain, expires, path, value) [ name: { domain: "..." , expires: "..." , path: "..." , value: "..." }, ... ] Issue to Address There are details in the string format represented by Foobar (e.g. header names) that cannot necessarily be identified before hand and must be accounted for or else the parser won't be completely effective and the user will not be able to query headers, etc... that exist in the log. Other Possible Issues Who is going to write the functions to expose the functionality for all Drill queries?
          Hide
          tshiran Tomer Shiran added a comment -

          I agree we shouldn't expand a date into multiple parts when we already have a date/timestamp type.

          For the functions you mentioned, I think we should look at the functions available in Python (urllib....),JavaScript or relational databases.

          Show
          tshiran Tomer Shiran added a comment - I agree we shouldn't expand a date into multiple parts when we already have a date/timestamp type. For the functions you mentioned, I think we should look at the functions available in Python (urllib....),JavaScript or relational databases.
          Hide
          jnadeau Jacques Nadeau added a comment - - edited

          Here is my alternative proposal:

          With the log format above:

          "%h %t \"%r\" %>s %b \"%{Referer}i\""
          

          I propose a user gets the following fields (in order)

          remote_host (varchar)
          request_receive_time (timestamp)
          request_method (varchar)
          request_uri (varchar)
          response_status (int)
          response_bytes (bigint)
          header_referer (varchar)

          Additionally, I think we should provide two new functions:

          parse_url(varchar url)
          parse_url_query(varchar querystring, varchar pairDelimiter, varchar keyValueDelimiter)

          parse_url(varchar) would provide an output of map type similar to:

          {
            protocol: ...,
            user: ...,
            password: ...,
            host: ...,
            port: 
            path: 
            query:
            fragment:
          }
          

          parse_url_query(...) would return an array of key values:

          [
            {key: "...", value: "..."},
            {key: "...", value: "..."},
            {key: "...", value: "..."},
            {key: "...", value: "..."}
          ]
          

          In response to your proposal: I don't think it makes sense to return many fields for a date field. Drill already provides functionality to get parts of a date. I also don't think it makes sense to prefix a field with its datatype, we don't do that anywhere else in Drill. We should also expose parsing an optional behavior in Drill. Note also that my proposal substantially reduces the number of fields exposed to the user. I think this proposal has much better usability in the context of sql.

          If you want to take advantage of the underlying formats capabilities, you can treat that as a pushdown of a particular function (data part or the url parsing functions above).

          Show
          jnadeau Jacques Nadeau added a comment - - edited Here is my alternative proposal: With the log format above: "%h %t \" %r\ " %>s %b \" %{Referer}i\"" I propose a user gets the following fields (in order) remote_host (varchar) request_receive_time (timestamp) request_method (varchar) request_uri (varchar) response_status (int) response_bytes (bigint) header_referer (varchar) Additionally, I think we should provide two new functions: parse_url(varchar url) parse_url_query(varchar querystring, varchar pairDelimiter, varchar keyValueDelimiter) parse_url(varchar) would provide an output of map type similar to: { protocol: ..., user: ..., password: ..., host: ..., port: path: query: fragment: } parse_url_query(...) would return an array of key values: [ {key: "..." , value: "..." }, {key: "..." , value: "..." }, {key: "..." , value: "..." }, {key: "..." , value: "..." } ] In response to your proposal: I don't think it makes sense to return many fields for a date field. Drill already provides functionality to get parts of a date. I also don't think it makes sense to prefix a field with its datatype, we don't do that anywhere else in Drill. We should also expose parsing an optional behavior in Drill. Note also that my proposal substantially reduces the number of fields exposed to the user. I think this proposal has much better usability in the context of sql. If you want to take advantage of the underlying formats capabilities, you can treat that as a pushdown of a particular function (data part or the url parsing functions above).
          Hide
          kingmesal Jim Scott added a comment -

          I have made some modifications to change the :map to now end with _$ for maps of data.

          When the parser has fields like:
          TIME.DAY:request.receive.time.day_utc
          They will now be identified as:
          TIME_DAY:request_receive_time_day__utc

          The type remapping capability is to prefix the field name with a # like:
          #HTTP_URI:request_firstline_uri_query_myvariable

          Additionally, due to these changes, I have removed the fields mapping completely from the bootstrap and the user configuration which should make this easier for the user.

          I believe the documentation for this plugin will be very straightforward and yield a solid user experience.

          Show
          kingmesal Jim Scott added a comment - I have made some modifications to change the :map to now end with _$ for maps of data. When the parser has fields like: TIME.DAY:request.receive.time.day_utc They will now be identified as: TIME_DAY:request_receive_time_day__utc The type remapping capability is to prefix the field name with a # like: #HTTP_URI:request_firstline_uri_query_myvariable Additionally, due to these changes, I have removed the fields mapping completely from the bootstrap and the user configuration which should make this easier for the user. I believe the documentation for this plugin will be very straightforward and yield a solid user experience.
          Hide
          kingmesal Jim Scott added a comment -

          To start fresh on this topic. My understanding of the capabilities of this parser grew ten fold while building this implementation. I do feel that it is already built in such a way that it will deliver the most flexibility and power to the user. That being said, I'm open to discussing the why's and why not's on this because I think this is one of the most important file formats we can add to drill.

          On to the present...
          I think we will be best served by using these examples with enough description so that we are being very specific and not speaking in generalities.

          As of right now, with this logFormat: "%h %t \"%r\" %>s %b \"%

          {Referer}

          i\""
          this query: select * from dfs.`jimslogfile.log`
          with NO user configuration

          Drill will yield these fields to the user:
          TIME_STAMP:request_receive_time
          TIME_DAY:request_receive_time_day
          TIME_MONTHNAME:request_receive_time_monthname
          TIME_MONTH:request_receive_time_month
          TIME_WEEK:request_receive_time_weekofweekyear
          TIME_YEAR:request_receive_time_weekyear
          TIME_YEAR:request_receive_time_year
          TIME_HOUR:request_receive_time_hour
          TIME_MINUTE:request_receive_time_minute
          TIME_SECOND:request_receive_time_second
          TIME_MILLISECOND:request_receive_time_millisecond
          TIME_ZONE:request_receive_time_timezone
          TIME_EPOCH:request_receive_time_epoch
          TIME_DAY:request_receive_time_day_utc
          TIME_MONTHNAME:request_receive_time_monthname_utc
          TIME_MONTH:request_receive_time_month_utc
          TIME_WEEK:request_receive_time_weekofweekyear_utc
          TIME_YEAR:request_receive_time_weekyear_utc
          TIME_YEAR:request_receive_time_year_utc
          TIME_HOUR:request_receive_time_hour_utc
          TIME_MINUTE:request_receive_time_minute_utc
          TIME_SECOND:request_receive_time_second_utc
          TIME_MILLISECOND:request_receive_time_millisecond_utc
          IP:connection_client_host
          HTTP_FIRSTLINE:request_firstline
          HTTP_METHOD:request_firstline_method
          HTTP_URI:request_firstline_uri
          HTTP_PROTOCOL:request_firstline_uri_protocol
          HTTP_USERINFO:request_firstline_uri_userinfo
          HTTP_HOST:request_firstline_uri_host
          HTTP_PORT:request_firstline_uri_port
          HTTP_PATH:request_firstline_uri_path
          HTTP_QUERYSTRING:request_firstline_uri_query
          STRING:request_firstline_uri_query:map
          HTTP_REF:request_firstline_uri_ref
          HTTP_PROTOCOL:request_firstline_protocol
          HTTP_PROTOCOL_VERSION:request_firstline_protocol_version
          HTTP_URI:request_referer
          HTTP_PROTOCOL:request_referer_protocol
          HTTP_USERINFO:request_referer_userinfo
          HTTP_HOST:request_referer_host
          HTTP_PORT:request_referer_port
          HTTP_PATH:request_referer_path
          HTTP_QUERYSTRING:request_referer_query
          STRING:request_referer_query:map
          HTTP_REF:request_referer_ref
          STRING:request_status_last
          BYTES:response_body_bytesclf

          I believe the benefit of this is that the user will be able to easily refine and figure out what they are looking for, which will allow them to then optimize the parsing by adding specific fields to the configuration file. This could be copy & paste style if we change the plugin configuration be use _ instead of . as mentioned in my previous comment. Which I would be good with as it would certainly make it easier for the user and will reduce the likelihood of configuration mistakes.

          By removing the first part of the field name "HTTP_URI:" it would clean up the names, but while it is cleaner it doesn't simplify the user experience in my opinion. I also don't believe that allowing a user to map those fields to different names improves the user experience, and I would actually argue that it would detract from it by introducing the possibility of confusion or mistakes (we know users mess up configurations all the time and these are difficult for beginners to troubleshoot).

          With respect to nesting the data in maps, I think the only time we would want to do that is when there is a wildcard they are trying to capture. The reason being, to me, when I think about parsing a log line in any application, I expect to get a flat, tabular type of result set. I wouldn't be expecting complex data structures to come back.

          Show
          kingmesal Jim Scott added a comment - To start fresh on this topic. My understanding of the capabilities of this parser grew ten fold while building this implementation. I do feel that it is already built in such a way that it will deliver the most flexibility and power to the user. That being said, I'm open to discussing the why's and why not's on this because I think this is one of the most important file formats we can add to drill. On to the present... I think we will be best served by using these examples with enough description so that we are being very specific and not speaking in generalities. As of right now, with this logFormat: "%h %t \"%r\" %>s %b \"% {Referer} i\"" this query: select * from dfs.`jimslogfile.log` with NO user configuration Drill will yield these fields to the user: TIME_STAMP:request_receive_time TIME_DAY:request_receive_time_day TIME_MONTHNAME:request_receive_time_monthname TIME_MONTH:request_receive_time_month TIME_WEEK:request_receive_time_weekofweekyear TIME_YEAR:request_receive_time_weekyear TIME_YEAR:request_receive_time_year TIME_HOUR:request_receive_time_hour TIME_MINUTE:request_receive_time_minute TIME_SECOND:request_receive_time_second TIME_MILLISECOND:request_receive_time_millisecond TIME_ZONE:request_receive_time_timezone TIME_EPOCH:request_receive_time_epoch TIME_DAY:request_receive_time_day_utc TIME_MONTHNAME:request_receive_time_monthname_utc TIME_MONTH:request_receive_time_month_utc TIME_WEEK:request_receive_time_weekofweekyear_utc TIME_YEAR:request_receive_time_weekyear_utc TIME_YEAR:request_receive_time_year_utc TIME_HOUR:request_receive_time_hour_utc TIME_MINUTE:request_receive_time_minute_utc TIME_SECOND:request_receive_time_second_utc TIME_MILLISECOND:request_receive_time_millisecond_utc IP:connection_client_host HTTP_FIRSTLINE:request_firstline HTTP_METHOD:request_firstline_method HTTP_URI:request_firstline_uri HTTP_PROTOCOL:request_firstline_uri_protocol HTTP_USERINFO:request_firstline_uri_userinfo HTTP_HOST:request_firstline_uri_host HTTP_PORT:request_firstline_uri_port HTTP_PATH:request_firstline_uri_path HTTP_QUERYSTRING:request_firstline_uri_query STRING:request_firstline_uri_query:map HTTP_REF:request_firstline_uri_ref HTTP_PROTOCOL:request_firstline_protocol HTTP_PROTOCOL_VERSION:request_firstline_protocol_version HTTP_URI:request_referer HTTP_PROTOCOL:request_referer_protocol HTTP_USERINFO:request_referer_userinfo HTTP_HOST:request_referer_host HTTP_PORT:request_referer_port HTTP_PATH:request_referer_path HTTP_QUERYSTRING:request_referer_query STRING:request_referer_query:map HTTP_REF:request_referer_ref STRING:request_status_last BYTES:response_body_bytesclf I believe the benefit of this is that the user will be able to easily refine and figure out what they are looking for, which will allow them to then optimize the parsing by adding specific fields to the configuration file. This could be copy & paste style if we change the plugin configuration be use _ instead of . as mentioned in my previous comment. Which I would be good with as it would certainly make it easier for the user and will reduce the likelihood of configuration mistakes. By removing the first part of the field name "HTTP_URI:" it would clean up the names, but while it is cleaner it doesn't simplify the user experience in my opinion. I also don't believe that allowing a user to map those fields to different names improves the user experience, and I would actually argue that it would detract from it by introducing the possibility of confusion or mistakes (we know users mess up configurations all the time and these are difficult for beginners to troubleshoot). With respect to nesting the data in maps, I think the only time we would want to do that is when there is a wildcard they are trying to capture. The reason being, to me, when I think about parsing a log line in any application, I expect to get a flat, tabular type of result set. I wouldn't be expecting complex data structures to come back.
          Hide
          jnadeau Jacques Nadeau added a comment -

          I think everyone is focusing too much on what the parser is capable of doing. That should be the last thing we focus on. We should start with the user API. Let's take an example log file format and decide what the output table should look like. Then let's talk about how we could vary things to provide more flexibility.

          I proposed a particular format. When you guys saw it, you thought that we needed more flexibility. I then proposed a modification to provide flexibility around mapping between log file fields and table fields.

          Niels Basjes, I appreciate your statements about the flexibility of the plugin and agree it is very powerful. What we need to figure out is what is the right way to expose that power in a SQL context. It doesn't make sense for Drill to support custom dissectors. If someone wanted to provide that capability, they would implement a Drill UDF (a similarly easy thing to implement).

          Jim Scott, with regards to your comment "This model makes it extremely difficult to support mapping of data types", my whole suggestion there was to expose more flexibility by using the mapping suggestion above. I'm thinking that maybe I wasn't clear enough in my recommendation and you misunderstood what I was suggesting.

          So let's start with what a user would want. Then figure out how to implement that. I think that will make this discussion substantially less conceptual.

          Show
          jnadeau Jacques Nadeau added a comment - I think everyone is focusing too much on what the parser is capable of doing. That should be the last thing we focus on. We should start with the user API. Let's take an example log file format and decide what the output table should look like. Then let's talk about how we could vary things to provide more flexibility. I proposed a particular format. When you guys saw it, you thought that we needed more flexibility. I then proposed a modification to provide flexibility around mapping between log file fields and table fields. Niels Basjes , I appreciate your statements about the flexibility of the plugin and agree it is very powerful. What we need to figure out is what is the right way to expose that power in a SQL context. It doesn't make sense for Drill to support custom dissectors. If someone wanted to provide that capability, they would implement a Drill UDF (a similarly easy thing to implement). Jim Scott , with regards to your comment "This model makes it extremely difficult to support mapping of data types", my whole suggestion there was to expose more flexibility by using the mapping suggestion above. I'm thinking that maybe I wasn't clear enough in my recommendation and you misunderstood what I was suggesting. So let's start with what a user would want. Then figure out how to implement that. I think that will make this discussion substantially less conceptual.
          Hide
          kingmesal Jim Scott added a comment -

          Regarding configuration, I would like to reconsider allowing the user to configure the fields they want parsed with _ instead of dot, that way what they select in their query would match directly with what is in the configuration of the plugin. It would be very easy to make this change. From a code perspective this is very easy to change and could yield a smaller learning curve to the user.

          e.g. The config would allow them to specify
          STRING:request_status_last instead of STRING:request.status.last
          Then their query would be the same with
          select STRING:request_status_last from dfs.`file.httpd.log`

          Show
          kingmesal Jim Scott added a comment - Regarding configuration, I would like to reconsider allowing the user to configure the fields they want parsed with _ instead of dot, that way what they select in their query would match directly with what is in the configuration of the plugin. It would be very easy to make this change. From a code perspective this is very easy to change and could yield a smaller learning curve to the user. e.g. The config would allow them to specify STRING:request_status_last instead of STRING:request.status.last Then their query would be the same with select STRING:request_status_last from dfs.`file.httpd.log`
          Hide
          nielsbasjes Niels Basjes added a comment -

          I would also like to stress that I created this parser to be pluggable and flexible by design.

          If you look for example at the 'UserData' cookie in this example the value is really
          Username:JULINHO:Homepage:1:ReReg:0:Trialist:0:Language:en:Ccode:br:ForceReReg:0.
          If a company needs to analyze clicks by Username or Language they can create their own Dissector to pull this specific thing apart and use it just like all the other values.

          Show
          nielsbasjes Niels Basjes added a comment - I would also like to stress that I created this parser to be pluggable and flexible by design. If you look for example at the 'UserData' cookie in this example the value is really Username:JULINHO:Homepage:1:ReReg:0:Trialist:0:Language:en:Ccode:br:ForceReReg:0 . If a company needs to analyze clicks by Username or Language they can create their own Dissector to pull this specific thing apart and use it just like all the other values.
          Hide
          kingmesal Jim Scott added a comment -

          Jacques,

          I'm not sure I follow this comment "We should also avoid the use of dot delimiters being automatically generated by Drill."

          Am I correct that your concern is specifically with the configuration of the plugin and mapping of the field names.

          Here is the problem I have with creating the mappings in the configuration:
          1. There are WAY more ways the parser can parse a field than are logical for us to create mappings for (e.g. a time field will yield timezone based result and a utc based.)
          2. By providing a mapping within the drill plugin we have to expose every default for anything that may show up in the log parser (e.g. if a new feature shows up in the log parser we wouldn't be able to expose it until we make a change in the plugin).

          Regarding wildcard maps of data I can just as easily remove the :map from the end of the field name. I'm indifferent, really. I put it on there to make it blatantly obvious.

          As for creating maps like this example:

                      case "IP:connection.client.ip":
                        add(parser, path, writer.rootAsMap().map("client").varChar("ip"));
                        break;
                      case "IP:connection.client.peerip":
                        add(parser, path, writer.rootAsMap().map("client").varChar("peer_ip"));
                        break;
                      case "IP:connection.server.ip":
                        add(parser, path, writer.rootAsMap().map("server").varChar("ip"));
          

          This model makes it extremely difficult to support mapping of data types. This makes an assumption that those fields are varChar and nothing else. Also based on the life cycle of creating maps within Drill I don't think this is the most logical approach to take. Putting the technical details aside, I as a user don't know that I benefit from nesting the data into maps. While from a data structure perspective I understand why someone might want to do this, from a query perspective I think it makes querying the data more difficult.

          Show
          kingmesal Jim Scott added a comment - Jacques, I'm not sure I follow this comment "We should also avoid the use of dot delimiters being automatically generated by Drill." Am I correct that your concern is specifically with the configuration of the plugin and mapping of the field names. Here is the problem I have with creating the mappings in the configuration: 1. There are WAY more ways the parser can parse a field than are logical for us to create mappings for (e.g. a time field will yield timezone based result and a utc based.) 2. By providing a mapping within the drill plugin we have to expose every default for anything that may show up in the log parser (e.g. if a new feature shows up in the log parser we wouldn't be able to expose it until we make a change in the plugin). Regarding wildcard maps of data I can just as easily remove the :map from the end of the field name. I'm indifferent, really. I put it on there to make it blatantly obvious. As for creating maps like this example: case "IP:connection.client.ip" : add(parser, path, writer.rootAsMap().map( "client" ).varChar( "ip" )); break ; case "IP:connection.client.peerip" : add(parser, path, writer.rootAsMap().map( "client" ).varChar( "peer_ip" )); break ; case "IP:connection.server.ip" : add(parser, path, writer.rootAsMap().map( "server" ).varChar( "ip" )); This model makes it extremely difficult to support mapping of data types. This makes an assumption that those fields are varChar and nothing else. Also based on the life cycle of creating maps within Drill I don't think this is the most logical approach to take. Putting the technical details aside, I as a user don't know that I benefit from nesting the data into maps. While from a data structure perspective I understand why someone might want to do this, from a query perspective I think it makes querying the data more difficult.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user kingmesal commented on the pull request:

          https://github.com/apache/drill/pull/234#issuecomment-153927870

          I think this looks better now. Let me know if I am otherwise mistaken.

          Show
          githubbot ASF GitHub Bot added a comment - Github user kingmesal commented on the pull request: https://github.com/apache/drill/pull/234#issuecomment-153927870 I think this looks better now. Let me know if I am otherwise mistaken.
          Hide
          jnadeau Jacques Nadeau added a comment -

          The pull request (#234) is very different from what we discussed on this JIRA. I think it adds flexibility at the cost of usability. This is done by removing a standard simplified mapping from the initial plugin code. Since Drill supports map and arrays, we shouldn't use complicated field names to express hierarchical relationships. We should also avoid the use of dot delimiters being automatically generated by Drill. I think we need to resolve these concerns before merging this patch. As such, I'm moving this to the 1.4 release.

          Show
          jnadeau Jacques Nadeau added a comment - The pull request (#234) is very different from what we discussed on this JIRA. I think it adds flexibility at the cost of usability. This is done by removing a standard simplified mapping from the initial plugin code. Since Drill supports map and arrays, we shouldn't use complicated field names to express hierarchical relationships. We should also avoid the use of dot delimiters being automatically generated by Drill. I think we need to resolve these concerns before merging this patch. As such, I'm moving this to the 1.4 release.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jacques-n commented on the pull request:

          https://github.com/apache/drill/pull/234#issuecomment-153838703

          Can you please repost this pull request with a clean rebase and separate commits? We need to see two commits on top of a master branch: mine and then your enhancements on top. That way I can easily focus on your enhancements. (Also, we don't allow merges on Drill so you'll need to use rebase.) This ensures that random merge changes are not part of your pull request given they are here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jacques-n commented on the pull request: https://github.com/apache/drill/pull/234#issuecomment-153838703 Can you please repost this pull request with a clean rebase and separate commits? We need to see two commits on top of a master branch: mine and then your enhancements on top. That way I can easily focus on your enhancements. (Also, we don't allow merges on Drill so you'll need to use rebase.) This ensures that random merge changes are not part of your pull request given they are here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user kingmesal opened a pull request:

          https://github.com/apache/drill/pull/234

          DRILL-3423

          This pull request is waiting for maven central to have the latest parser library (2.3) which it depends on: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22httpdlog-parser%22

          This was passing all tests in my build environment.

          This storage format plugin has complete support for:

          • Full Parsing Pushdown
          • Type Remapping
          • Maps (e.g. like query string parameters)
          • Multiple log formats in the same storage definition

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/kingmesal/drill DRILL-3423

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/drill/pull/234.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #234



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user kingmesal opened a pull request: https://github.com/apache/drill/pull/234 DRILL-3423 This pull request is waiting for maven central to have the latest parser library (2.3) which it depends on: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22httpdlog-parser%22 This was passing all tests in my build environment. This storage format plugin has complete support for: Full Parsing Pushdown Type Remapping Maps (e.g. like query string parameters) Multiple log formats in the same storage definition You can merge this pull request into a Git repository by running: $ git pull https://github.com/kingmesal/drill DRILL-3423 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #234
          Hide
          jnadeau Jacques Nadeau added a comment -

          The idea of the outputPath is where to position that value in the output record.

          For example, the first three default I used were:

                      case "IP:connection.client.ip":
                        add(parser, path, writer.rootAsMap().map("client").varChar("ip"));
                        break;
                      case "IP:connection.client.peerip":
                        add(parser, path, writer.rootAsMap().map("client").varChar("peer_ip"));
                        break;
                      case "IP:connection.server.ip":
                        add(parser, path, writer.rootAsMap().map("server").varChar("ip"));
          

          those would become:

          mapping: [
            {httpdPath: "NUMBER:connection.client.logname", outputPath: "client.ip"}
            {httpdPath: "IP:connection.client.peerip", outputPath: "client.peer_ip"}
            {httpdPath: "IP:connection.server.ip", outputPath: "server.ip"}
          ]
          

          In this example, "client.ip" would mean put this value in a field called ip inside a map called client.

          Show
          jnadeau Jacques Nadeau added a comment - The idea of the outputPath is where to position that value in the output record. For example, the first three default I used were: case "IP:connection.client.ip" : add(parser, path, writer.rootAsMap().map( "client" ).varChar( "ip" )); break ; case "IP:connection.client.peerip" : add(parser, path, writer.rootAsMap().map( "client" ).varChar( "peer_ip" )); break ; case "IP:connection.server.ip" : add(parser, path, writer.rootAsMap().map( "server" ).varChar( "ip" )); those would become: mapping: [ {httpdPath: "NUMBER:connection.client.logname" , outputPath: "client.ip" } {httpdPath: "IP:connection.client.peerip" , outputPath: "client.peer_ip" } {httpdPath: "IP:connection.server.ip" , outputPath: "server.ip" } ] In this example, "client.ip" would mean put this value in a field called ip inside a map called client.
          Hide
          kingmesal Jim Scott added a comment -

          I like that idea.

          One question though, what is the outputPath variable use for? I get the purpose and use of httpdPath and mapping, just not sure I get outputPath.

          Thanks!

          Show
          kingmesal Jim Scott added a comment - I like that idea. One question though, what is the outputPath variable use for? I get the purpose and use of httpdPath and mapping, just not sure I get outputPath. Thanks!
          Hide
          jnadeau Jacques Nadeau added a comment -

          Based on Niels Basjes comments, I think it makes sense to add an additional property as part of the HttpdLogFormatConfig that provides a set of mappings.

          @JsonTypeName("httpd")
          public static class HttpdLogFormatConfig implements FormatPluginConfig {
            public String format;
          
            @JsonTypeName("text") @JsonInclude(Include.NON_DEFAULT)
            public List<Mapping> mapping
          }
          
          public static class HttpdLogMapping {
            public String httpdPath;
            public String outputPath;
          }
          
          

          Then we can set it to a default mapping (like I originally did) unless the user has special needs. Drill's primary mechanism for remapping is SQL projections and views as opposed to exposing in a format plugin level. We would need to parse the outputPath according to a drill map path and use the type detection as Niels suggested. I'm probably not going to make much progress on this for some time so if someone else wants to move it forward, they should feel free.

          Show
          jnadeau Jacques Nadeau added a comment - Based on Niels Basjes comments, I think it makes sense to add an additional property as part of the HttpdLogFormatConfig that provides a set of mappings. @JsonTypeName( "httpd" ) public static class HttpdLogFormatConfig implements FormatPluginConfig { public String format; @JsonTypeName( "text" ) @JsonInclude(Include.NON_DEFAULT) public List<Mapping> mapping } public static class HttpdLogMapping { public String httpdPath; public String outputPath; } Then we can set it to a default mapping (like I originally did) unless the user has special needs. Drill's primary mechanism for remapping is SQL projections and views as opposed to exposing in a format plugin level. We would need to parse the outputPath according to a drill map path and use the type detection as Niels suggested. I'm probably not going to make much progress on this for some time so if someone else wants to move it forward, they should feel free.
          Hide
          jnadeau Jacques Nadeau added a comment -

          This is great feedback. I'll try to think about the best way to incorporate. My thought is that we could have a default target mapping as I implemented it but support an alternative if the user decided to provide one.

          Show
          jnadeau Jacques Nadeau added a comment - This is great feedback. I'll try to think about the best way to incorporate. My thought is that we could have a default target mapping as I implemented it but support an alternative if the user decided to provide one.
          Hide
          kingmesal Jim Scott added a comment -

          Jacques Nadeau Have you given Niels suggests any thought? It would be great to see this get into Drill sooner than later.

          Show
          kingmesal Jim Scott added a comment - Jacques Nadeau Have you given Niels suggests any thought? It would be great to see this get into Drill sooner than later.
          Hide
          nielsbasjes Niels Basjes added a comment -

          I took a quick stab at the addAsParseTarget method (didn't even run it or test this) to show you how to use the casting options that the parser has built in.
          Effectively I would move the mappings out into a property file and then do

                       EnumSet<Casts> casts = parser.getCasts(path);
                        if (casts.contains(Casts.DOUBLE)) {
                          add(parser, path, writer.rootAsMap().map(mapping.key()).float8(mapping.value()));
                        } else if (casts.contains(Casts.LONG)) {
                          add(parser, path, writer.rootAsMap().map(mapping.key()).bigInt(mapping.value()));
                        } else {
                          add(parser, path, writer.rootAsMap().map(mapping.key()).varChar(mapping.value()));
                        }
          

          Here is the entire snippet I hacked at

                private void addParseTargetMapping(Map<String,Pair<String,String>> mappings, String path, String map, String name){
                  mappings.put(path, new Pair<String,String>(map, name));
                }
                public void addAsParseTarget(Parser<ComplexWriterFacade> parser) {
                  try {
                    Map<String,Pair<String,String>> mappings = new HashMap<>();
                    
                    // TODO: Move this to a property file
                    addParseTargetMapping(mappings, "NUMBER:connection.keepalivecount", "client","keepalivecount");
                    addParseTargetMapping(mappings, "NUMBER:connection.client.logname" ,"request","logname");
                    addParseTargetMapping(mappings, "STRING:request.errorlogid" ,"request","errorlogid");
                    addParseTargetMapping(mappings, "HTTP.METHOD:request.method" ,"request","method");
                    addParseTargetMapping(mappings, "PORT:request.server.port.canonical" ,"server","canonical_port");
                    addParseTargetMapping(mappings, "PORT:connection.server.port.canonical" ,"server","canonical_port");
                    addParseTargetMapping(mappings, "PORT:connection.client.port" ,"client","port");
                    addParseTargetMapping(mappings, "NUMBER:connection.server.child.processid" ,"server","process_id");
                    addParseTargetMapping(mappings, "NUMBER:connection.server.child.threadid" ,"server","thread_id");
                    addParseTargetMapping(mappings, "STRING:connection.server.child.hexthreadid" ,"connection","hex_thread_id");
                    addParseTargetMapping(mappings, "HTTP.QUERYSTRING:request.querystring" ,"","");
                    addParseTargetMapping(mappings, "HTTP.FIRSTLINE:request.firstline" ,"","");
                    addParseTargetMapping(mappings, "STRING:request.handler" ,"request","handler");
                    addParseTargetMapping(mappings, "STRING:request.status.original" ,"request","status_original");
                    addParseTargetMapping(mappings, "STRING:request.status.last" ,"request","status_last");
                    addParseTargetMapping(mappings, "TIME.STAMP:request.receive.time" ,"request","timestamp");
                    addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.begin.msec" ,"request","begin_msec");
                    addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.end.msec" ,"request","end_msec");
                    addParseTargetMapping(mappings, "TIME.EPOCH.USEC:request.receive.time.begin.usec" ,"request","begin_usec");
                    addParseTargetMapping(mappings, "TIME.EPOCH.USEC:request.receive.time.end.usec" ,"request","end_usec");
                    addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.begin.msec_frac" ,"request","begin_msec_frac");
                    addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.end.msec_frac" ,"request","end_msec_frac");
                    addParseTargetMapping(mappings, "TIME.EPOCH.USEC_FRAC:request.receive.time.begin.usec_frac" ,"request","begin_usec_frac");
                    addParseTargetMapping(mappings, "TIME.EPOCH.USEC_FRAC:request.receive.time.end.usec_frac" ,"request","end_usec_frac");
                    addParseTargetMapping(mappings, "SECONDS:response.server.processing.time" ,"response","processing_time");
                    addParseTargetMapping(mappings, "STRING:connection.client.user" ,"client","user");
                    addParseTargetMapping(mappings, "URI:request.urlpath" ,"request","url");
                    addParseTargetMapping(mappings, "STRING:connection.server.name.canonical" ,"server","canonical_name");
                    addParseTargetMapping(mappings, "STRING:connection.server.name" ,"server","name");
                    addParseTargetMapping(mappings, "HTTP.CONNECTSTATUS:response.connection.status" ,"response","connection_status");
                    addParseTargetMapping(mappings, "BYTES:request.bytes" ,"request","bytes");
                    addParseTargetMapping(mappings, "BYTES:response.bytes" ,"response","bytes");
                    addParseTargetMapping(mappings, "HTTP.COOKIES:request.cookies" ,"request","cookies");
                    addParseTargetMapping(mappings, "HTTP.SETCOOKIES:response.cookies" ,"response","cookies");
                    addParseTargetMapping(mappings, "HTTP.USERAGENT:request.user-agent" ,"request","useragent");
                    addParseTargetMapping(mappings, "HTTP.URI:request.referer" ,"request","referer");
                    addParseTargetMapping(mappings, "HTTP.METHOD:method" ,"request","method");
                    addParseTargetMapping(mappings, "HTTP.URI:uri" ,"request","uri");
                    addParseTargetMapping(mappings, "HTTP.PROTOCOL:protocol" ,"request","protocol");
                    addParseTargetMapping(mappings, "HTTP.PROTOCOL.VERSION:protocol.version" ,"request","protocol_version");
                    addParseTargetMapping(mappings, "HTTP.METHOD:request.firstline.method" ,"request","method");
                    addParseTargetMapping(mappings, "HTTP.URI:request.firstline.uri" ,"request","uri");
                    addParseTargetMapping(mappings, "HTTP.PROTOCOL:request.firstline.protocol" ,"request","protocol");
                    addParseTargetMapping(mappings, "HTTP.PROTOCOL.VERSION:request.firstline.protocol.version" ,"request","protocol_version");
          
                    for (final String path : parser.getPossiblePaths()) {
                      EnumSet<Casts> casts = parser.getCasts(path);
                      Pair<String, String> mapping = mappings.get(path);
          
                      if (mapping == null) {
                        final String noPeriodPath = path.replace(".", "_");
                        if (casts.contains(Casts.DOUBLE)) {
                          add(parser, path, writer.rootAsMap().float8(noPeriodPath));
                        } else if (casts.contains(Casts.LONG)) {
                          add(parser, path, writer.rootAsMap().bigInt(noPeriodPath));
                        } else {
                          add(parser, path, writer.rootAsMap().varChar(noPeriodPath));
                        }
                      } else {
                        if (casts.contains(Casts.DOUBLE)) {
                          add(parser, path, writer.rootAsMap().map(mapping.key()).float8(mapping.value()));
                        } else if (casts.contains(Casts.LONG)) {
                          add(parser, path, writer.rootAsMap().map(mapping.key()).bigInt(mapping.value()));
                        } else {
                          add(parser, path, writer.rootAsMap().map(mapping.key()).varChar(mapping.value()));
                        }
                      }
                    }
          
          
                  } catch (MissingDissectorsException | SecurityException | NoSuchMethodException | InvalidDissectorException e) {
                    throw handleAndGenerate("Failure while setting up log mappings.", e);
                  }
                }
              }
            }
          
          Show
          nielsbasjes Niels Basjes added a comment - I took a quick stab at the addAsParseTarget method (didn't even run it or test this) to show you how to use the casting options that the parser has built in. Effectively I would move the mappings out into a property file and then do EnumSet<Casts> casts = parser.getCasts(path); if (casts.contains(Casts.DOUBLE)) { add(parser, path, writer.rootAsMap().map(mapping.key()).float8(mapping.value())); } else if (casts.contains(Casts.LONG)) { add(parser, path, writer.rootAsMap().map(mapping.key()).bigInt(mapping.value())); } else { add(parser, path, writer.rootAsMap().map(mapping.key()).varChar(mapping.value())); } Here is the entire snippet I hacked at private void addParseTargetMapping(Map< String ,Pair< String , String >> mappings, String path, String map, String name){ mappings.put(path, new Pair< String , String >(map, name)); } public void addAsParseTarget(Parser<ComplexWriterFacade> parser) { try { Map< String ,Pair< String , String >> mappings = new HashMap<>(); // TODO: Move this to a property file addParseTargetMapping(mappings, "NUMBER:connection.keepalivecount" , "client" , "keepalivecount" ); addParseTargetMapping(mappings, "NUMBER:connection.client.logname" , "request" , "logname" ); addParseTargetMapping(mappings, "STRING:request.errorlogid" , "request" , "errorlogid" ); addParseTargetMapping(mappings, "HTTP.METHOD:request.method" , "request" , "method" ); addParseTargetMapping(mappings, "PORT:request.server.port.canonical" , "server" , "canonical_port" ); addParseTargetMapping(mappings, "PORT:connection.server.port.canonical" , "server" , "canonical_port" ); addParseTargetMapping(mappings, "PORT:connection.client.port" , "client" , "port" ); addParseTargetMapping(mappings, "NUMBER:connection.server.child.processid" , "server" , "process_id" ); addParseTargetMapping(mappings, "NUMBER:connection.server.child.threadid" , "server" , "thread_id" ); addParseTargetMapping(mappings, "STRING:connection.server.child.hexthreadid" , "connection" , "hex_thread_id" ); addParseTargetMapping(mappings, "HTTP.QUERYSTRING:request.querystring" , ""," "); addParseTargetMapping(mappings, "HTTP.FIRSTLINE:request.firstline" , ""," "); addParseTargetMapping(mappings, "STRING:request.handler" , "request" , "handler" ); addParseTargetMapping(mappings, "STRING:request.status.original" , "request" , "status_original" ); addParseTargetMapping(mappings, "STRING:request.status.last" , "request" , "status_last" ); addParseTargetMapping(mappings, "TIME.STAMP:request.receive.time" , "request" , "timestamp" ); addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.begin.msec" , "request" , "begin_msec" ); addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.end.msec" , "request" , "end_msec" ); addParseTargetMapping(mappings, "TIME.EPOCH.USEC:request.receive.time.begin.usec" , "request" , "begin_usec" ); addParseTargetMapping(mappings, "TIME.EPOCH.USEC:request.receive.time.end.usec" , "request" , "end_usec" ); addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.begin.msec_frac" , "request" , "begin_msec_frac" ); addParseTargetMapping(mappings, "TIME.EPOCH:request.receive.time.end.msec_frac" , "request" , "end_msec_frac" ); addParseTargetMapping(mappings, "TIME.EPOCH.USEC_FRAC:request.receive.time.begin.usec_frac" , "request" , "begin_usec_frac" ); addParseTargetMapping(mappings, "TIME.EPOCH.USEC_FRAC:request.receive.time.end.usec_frac" , "request" , "end_usec_frac" ); addParseTargetMapping(mappings, "SECONDS:response.server.processing.time" , "response" , "processing_time" ); addParseTargetMapping(mappings, "STRING:connection.client.user" , "client" , "user" ); addParseTargetMapping(mappings, "URI:request.urlpath" , "request" , "url" ); addParseTargetMapping(mappings, "STRING:connection.server.name.canonical" , "server" , "canonical_name" ); addParseTargetMapping(mappings, "STRING:connection.server.name" , "server" , "name" ); addParseTargetMapping(mappings, "HTTP.CONNECTSTATUS:response.connection.status" , "response" , "connection_status" ); addParseTargetMapping(mappings, "BYTES:request.bytes" , "request" , "bytes" ); addParseTargetMapping(mappings, "BYTES:response.bytes" , "response" , "bytes" ); addParseTargetMapping(mappings, "HTTP.COOKIES:request.cookies" , "request" , "cookies" ); addParseTargetMapping(mappings, "HTTP.SETCOOKIES:response.cookies" , "response" , "cookies" ); addParseTargetMapping(mappings, "HTTP.USERAGENT:request.user-agent" , "request" , "useragent" ); addParseTargetMapping(mappings, "HTTP.URI:request.referer" , "request" , "referer" ); addParseTargetMapping(mappings, "HTTP.METHOD:method" , "request" , "method" ); addParseTargetMapping(mappings, "HTTP.URI:uri" , "request" , "uri" ); addParseTargetMapping(mappings, "HTTP.PROTOCOL:protocol" , "request" , "protocol" ); addParseTargetMapping(mappings, "HTTP.PROTOCOL.VERSION:protocol.version" , "request" , "protocol_version" ); addParseTargetMapping(mappings, "HTTP.METHOD:request.firstline.method" , "request" , "method" ); addParseTargetMapping(mappings, "HTTP.URI:request.firstline.uri" , "request" , "uri" ); addParseTargetMapping(mappings, "HTTP.PROTOCOL:request.firstline.protocol" , "request" , "protocol" ); addParseTargetMapping(mappings, "HTTP.PROTOCOL.VERSION:request.firstline.protocol.version" , "request" , "protocol_version" ); for ( final String path : parser.getPossiblePaths()) { EnumSet<Casts> casts = parser.getCasts(path); Pair< String , String > mapping = mappings.get(path); if (mapping == null ) { final String noPeriodPath = path.replace( "." , "_" ); if (casts.contains(Casts.DOUBLE)) { add(parser, path, writer.rootAsMap().float8(noPeriodPath)); } else if (casts.contains(Casts.LONG)) { add(parser, path, writer.rootAsMap().bigInt(noPeriodPath)); } else { add(parser, path, writer.rootAsMap().varChar(noPeriodPath)); } } else { if (casts.contains(Casts.DOUBLE)) { add(parser, path, writer.rootAsMap().map(mapping.key()).float8(mapping.value())); } else if (casts.contains(Casts.LONG)) { add(parser, path, writer.rootAsMap().map(mapping.key()).bigInt(mapping.value())); } else { add(parser, path, writer.rootAsMap().map(mapping.key()).varChar(mapping.value())); } } } } catch (MissingDissectorsException | SecurityException | NoSuchMethodException | InvalidDissectorException e) { throw handleAndGenerate( "Failure while setting up log mappings." , e); } } } }
          Hide
          nielsbasjes Niels Basjes added a comment -

          Found a nasty typo: NUBMER (right before connection.server.child.processid)

          Show
          nielsbasjes Niels Basjes added a comment - Found a nasty typo: NUBMER (right before connection.server.child.processid)
          Hide
          jnadeau Jacques Nadeau added a comment -

          Q1: I should provide better comments in the code. Vector memory allocations work on powers of 2. VarChar uses n+1 slots when allocating data. As such, if we make batches 4095 in size, then varchar allocations will be 4096 in size and we will have minimal wastage due to power 2 rounding. If we chose 4096, then varchar allocations would be 4097 and thus the underlying memory allocation would be 8192 with virtually half of that wasted.

          Q2: My plan was actually to write a blog post around this plugin so people could use it as a model. (One of the reasons I actually kept in a single file.) I wanted to get something up for feedback but will be working on adding javadocs to clarify things.

          Q3: Good point. We should implement a new FormatMatcher for access logs that recognizes this pattern. Can you provide a couple of examples and maybe propose a format matching algorithm?

          Show
          jnadeau Jacques Nadeau added a comment - Q1: I should provide better comments in the code. Vector memory allocations work on powers of 2. VarChar uses n+1 slots when allocating data. As such, if we make batches 4095 in size, then varchar allocations will be 4096 in size and we will have minimal wastage due to power 2 rounding. If we chose 4096, then varchar allocations would be 4097 and thus the underlying memory allocation would be 8192 with virtually half of that wasted. Q2: My plan was actually to write a blog post around this plugin so people could use it as a model. (One of the reasons I actually kept in a single file.) I wanted to get something up for feedback but will be working on adding javadocs to clarify things. Q3: Good point. We should implement a new FormatMatcher for access logs that recognizes this pattern. Can you provide a couple of examples and maybe propose a format matching algorithm?
          Hide
          jnadeau Jacques Nadeau added a comment -

          Q1: The main reason is that Drill is targeting analysts rather than developers. We are very focused on separating out data definition from business rules. The user should have to provide no more information than is necessary to interact with a new data source. In the case of an Apache HTTPD log, the only that is needed is a format string. From there, a user can use the SQL interface to create alternative views, etc (things that support their particular business needs). The future goal is to make more formats self-describing directly (as we have already done with Parquet) or indirectly using what we call a .drill file. This is the same pattern than we use for JSON, Avro, HBase, etc. It allows non-technical users to interact with new data quickly and easily. (Note that this also works better in Drill because we have first class capabilities around complex data and the JSON document model.)

          Q2: This has to do with the most efficient way to write into Drill and the fact that we want to manage the path of write to provide a clean and consistent complex data model for the underlying format.

          Show
          jnadeau Jacques Nadeau added a comment - Q1: The main reason is that Drill is targeting analysts rather than developers. We are very focused on separating out data definition from business rules. The user should have to provide no more information than is necessary to interact with a new data source. In the case of an Apache HTTPD log, the only that is needed is a format string. From there, a user can use the SQL interface to create alternative views, etc (things that support their particular business needs). The future goal is to make more formats self-describing directly (as we have already done with Parquet) or indirectly using what we call a .drill file. This is the same pattern than we use for JSON, Avro, HBase, etc. It allows non-technical users to interact with new data quickly and easily. (Note that this also works better in Drill because we have first class capabilities around complex data and the JSON document model.) Q2: This has to do with the most efficient way to write into Drill and the fact that we want to manage the path of write to provide a clean and consistent complex data model for the underlying format.
          Hide
          nielsbasjes Niels Basjes added a comment -

          I'll have a closer look tonight but my first question is why your code explicitly names many of the possible fields. In none of the plugins/udfs I've written so far (PIG, Hive, Java) do I assume anything about the output fields of the parser.
          Why do you do that in this situation?
          In addition: The "casts" the parser exposes specify if the field can be a string, long or double. No need to hardcode that part too.

          Show
          nielsbasjes Niels Basjes added a comment - I'll have a closer look tonight but my first question is why your code explicitly names many of the possible fields. In none of the plugins/udfs I've written so far (PIG, Hive, Java) do I assume anything about the output fields of the parser. Why do you do that in this situation? In addition: The "casts" the parser exposes specify if the field can be a string, long or double. No need to hardcode that part too.
          Hide
          kingmesal Jim Scott added a comment -

          Jacques,

          Couple comments about the code:
          1. There is a next() method with the magic number 4095 in it. Curious as to the relevance.
          2. This plugin seems like it would be a great way to showcase how easy it is to plugin new functionality for file formats. If you could add some comments about what you have going on in the code it would help tremendously.
          3. In the bootstrap setup section you add this new format. I'm wondering what might be able to be done within drill to support the common practices for log rotation file formats. They usually start with error or access and then have a date pattern after them.

          Regarding the URI breakdown, the functionality for breaking a field down already exists in this library, so exposing it as a UDF makes sense as it would be able to be used regardless of whether the source of the log was an apache server or not.

          Jim

          Show
          kingmesal Jim Scott added a comment - Jacques, Couple comments about the code: 1. There is a next() method with the magic number 4095 in it. Curious as to the relevance. 2. This plugin seems like it would be a great way to showcase how easy it is to plugin new functionality for file formats. If you could add some comments about what you have going on in the code it would help tremendously. 3. In the bootstrap setup section you add this new format. I'm wondering what might be able to be done within drill to support the common practices for log rotation file formats. They usually start with error or access and then have a date pattern after them. Regarding the URI breakdown, the functionality for breaking a field down already exists in this library, so exposing it as a UDF makes sense as it would be able to be used regardless of whether the source of the log was an apache server or not. Jim
          Hide
          jnadeau Jacques Nadeau added a comment -

          Initial feature branch can be seen here:

          https://github.com/jacques-n/drill/tree/DRILL-3423

          Jim Scott and Niels Basjes, would love to have some initial feedback. Note that I don't use all the dissectors provided by the library since that makes the table view less consumable. I propose that we add Complex outputting UDFs for the key things that are interesting (such as uri and query string breakdown).

          Show
          jnadeau Jacques Nadeau added a comment - Initial feature branch can be seen here: https://github.com/jacques-n/drill/tree/DRILL-3423 Jim Scott and Niels Basjes , would love to have some initial feedback. Note that I don't use all the dissectors provided by the library since that makes the table view less consumable. I propose that we add Complex outputting UDFs for the key things that are interesting (such as uri and query string breakdown).
          Hide
          jnadeau Jacques Nadeau added a comment -

          I'm trying to incorporate some part of the remapping feature but the initial integration will primarily stay focused on the basics to provide some value and then we can look to enhance later.

          The good news is much of this features functionality is also available via SQL projections and UDFs.

          Show
          jnadeau Jacques Nadeau added a comment - I'm trying to incorporate some part of the remapping feature but the initial integration will primarily stay focused on the basics to provide some value and then we can look to enhance later. The good news is much of this features functionality is also available via SQL projections and UDFs.
          Hide
          kingmesal Jim Scott added a comment -

          Niels mentioned the following and I wanted to find out if there are any updates on this ticket regarding this functionality.

          A feature that the library supports that is really important for main use cases is the type-remapping feature as described in: https://github.com/nielsbasjes/logparser/blob/master/README.md

          Show
          kingmesal Jim Scott added a comment - Niels mentioned the following and I wanted to find out if there are any updates on this ticket regarding this functionality. A feature that the library supports that is really important for main use cases is the type-remapping feature as described in: https://github.com/nielsbasjes/logparser/blob/master/README.md
          Hide
          nielsbasjes Niels Basjes added a comment -

          The main source repo of this library: https://github.com/nielsbasjes/logparser/

          Show
          nielsbasjes Niels Basjes added a comment - The main source repo of this library: https://github.com/nielsbasjes/logparser/

            People

            • Assignee:
              kingmesal Jim Scott
              Reporter:
              jnadeau Jacques Nadeau
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development