Hive
  1. Hive
  2. HIVE-693

Add a AWS S3 log format deserializer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. HIVE-693.1.patch
      19 kB
      Zheng Shao
    2. HIVE-693.2.patch
      22 kB
      Zheng Shao
    3. inputs3.q
      0.5 kB
      Andraz Tori
    4. s3.log
      3 kB
      Andraz Tori
    5. s3deserializer.diff
      15 kB
      Andraz Tori
    6. S3LogDeserializer.java
      8 kB
      Andraz Tori
    7. S3LogStruct.java
      0.7 kB
      Andraz Tori

      Activity

      Hide
      Andraz Tori added a comment -

      Deserializer implementation.

      While it works, code is by no means release-ready, it has to be cleaned up first. But it is better than nothing as a starting point for someone looking to integrate S3 log deserializer.

      I was quite amazed to find out that no one else needed this/published this.

      Show
      Andraz Tori added a comment - Deserializer implementation. While it works, code is by no means release-ready, it has to be cleaned up first. But it is better than nothing as a starting point for someone looking to integrate S3 log deserializer. I was quite amazed to find out that no one else needed this/published this.
      Hide
      Zheng Shao added a comment -

      Cool. This looks like a good example for deserializing into a user-predefined Java class.

      Can you put the Hive commands to create table, load data, and query data into a comment here?

      Can you clean up the code a bit and move the test into a junit class?

      It will be great if you can move the deserializer into contrib package as well.

      Show
      Zheng Shao added a comment - Cool. This looks like a good example for deserializing into a user-predefined Java class. Can you put the Hive commands to create table, load data, and query data into a comment here? Can you clean up the code a bit and move the test into a junit class? It will be great if you can move the deserializer into contrib package as well.
      Hide
      Andraz Tori added a comment -

      here's a patch with expected inputs and outputs so unittests can be created...

      I am still new to Hive source tree, so someone else should take care of moving it to contrib.

      Show
      Andraz Tori added a comment - here's a patch with expected inputs and outputs so unittests can be created... I am still new to Hive source tree, so someone else should take care of moving it to contrib.
      Hide
      Andraz Tori added a comment -

      the patch...

      Show
      Andraz Tori added a comment - the patch...
      Hide
      Andraz Tori added a comment -

      ... forgot to add a s3.log for previous patch

      are there any chances of getting this into 0.4 ?

      Show
      Andraz Tori added a comment - ... forgot to add a s3.log for previous patch are there any chances of getting this into 0.4 ?
      Hide
      Ashish Thusoo added a comment -

      Hi Andraz,

      Can you add the new serde to hive/contrib/src ... Ideally we would want to have the core serde directory just have the lazy serde which is native to hive and have all others in contrib. Also you would have to take the main part out from the serde class and put it in hive/serde/test directory. You can look at some of the examples there to see how it is being done for other serdes. If you can get those things we can bring it into hive contrib. We can also get it into 0.4 as well.

      Show
      Ashish Thusoo added a comment - Hi Andraz, Can you add the new serde to hive/contrib/src ... Ideally we would want to have the core serde directory just have the lazy serde which is native to hive and have all others in contrib. Also you would have to take the main part out from the serde class and put it in hive/serde/test directory. You can look at some of the examples there to see how it is being done for other serdes. If you can get those things we can bring it into hive contrib. We can also get it into 0.4 as well.
      Hide
      Andraz Tori added a comment -

      I am developing against 0.3 [which is what we have on our servers] which doesn't have contrib directory. So it would be much easier for someone actually familiar with the trunk to do it.

      Moving it to contrib should be trivial for someone having a buildable trunk checkout.

      The tests are attached in a separate patch - inputs and outputs, so main class can be easily deleted if it bothers anyone. I have found it extremely helpful to have everything in one place during development.

      Show
      Andraz Tori added a comment - I am developing against 0.3 [which is what we have on our servers] which doesn't have contrib directory. So it would be much easier for someone actually familiar with the trunk to do it. Moving it to contrib should be trivial for someone having a buildable trunk checkout. The tests are attached in a separate patch - inputs and outputs, so main class can be easily deleted if it bothers anyone. I have found it extremely helpful to have everything in one place during development.
      Hide
      Zheng Shao added a comment -

      HIVE-693.1.patch: Andraz, I've moved all data and code to contrib. Can you review and comment?

      Please note that when you want to upgrade from hive 0.3 to hive 0.4 to use this new serde, you would need to manually go through the metastore tables and replace the name of the SerDe class (since it's changed to org.apache.hadoop.hive.contrib.serde2.s3.S3LogDeserializer.

      Show
      Zheng Shao added a comment - HIVE-693 .1.patch: Andraz, I've moved all data and code to contrib. Can you review and comment? Please note that when you want to upgrade from hive 0.3 to hive 0.4 to use this new serde, you would need to manually go through the metastore tables and replace the name of the SerDe class (since it's changed to org.apache.hadoop.hive.contrib.serde2.s3.S3LogDeserializer.
      Hide
      Andraz Tori added a comment -

      actually, the input.q was a bit old, sorry for that, here's the fixed one.

      everything else seems ok

      Show
      Andraz Tori added a comment - actually, the input.q was a bit old, sorry for that, here's the fixed one. everything else seems ok
      Hide
      Ashish Thusoo added a comment -

      Zheng,

      I noticed that in contrib/build.xml we have the same logFile name in the gen-test target as the one that appears in ql. This will overwrite the logs. We should fix that, perhaps as part of this patch?

      Show
      Ashish Thusoo added a comment - Zheng, I noticed that in contrib/build.xml we have the same logFile name in the gen-test target as the one that appears in ql. This will overwrite the logs. We should fix that, perhaps as part of this patch?
      Hide
      Andraz Tori added a comment -

      Zheng, I have two questions:

      1) How does one "manually go through the metastore tables" ?

      2) What should be done to optimize the code so it executes faster. Are there any optimizations that you can spot on the first sight? Are there any benchmarking tools to test different implementations of deserializers.

      In the above, my doubts are about regex speed and creating so many new strings every time.

      Show
      Andraz Tori added a comment - Zheng, I have two questions: 1) How does one "manually go through the metastore tables" ? 2) What should be done to optimize the code so it executes faster. Are there any optimizations that you can spot on the first sight? Are there any benchmarking tools to test different implementations of deserializers. In the above, my doubts are about regex speed and creating so many new strings every time.
      Hide
      Zheng Shao added a comment -

      Incorporated Ashish's comments.

      Also removed the column definition since they will come directly from serde.

      @Andraz: For speed improvement: Instead of using regex, you can read in the data as org.apache.hadoop.io.Text, and do split by yourself. Each field can be stored in a Text as well, and the Text object can be reused across the rows. In this way, the processing will be much faster.

      Show
      Zheng Shao added a comment - Incorporated Ashish's comments. Also removed the column definition since they will come directly from serde. @Andraz: For speed improvement: Instead of using regex, you can read in the data as org.apache.hadoop.io.Text, and do split by yourself. Each field can be stored in a Text as well, and the Text object can be reused across the rows. In this way, the processing will be much faster.
      Hide
      Namit Jain added a comment -

      Should create table be allowed where the serde specified does not implement the serializer ?

      If one tries to insert into this table, it will fail.

      Show
      Namit Jain added a comment - Should create table be allowed where the serde specified does not implement the serializer ? If one tries to insert into this table, it will fail.
      Hide
      Namit Jain added a comment -

      Committed. Thanks Zheng and Andraz

      Show
      Namit Jain added a comment - Committed. Thanks Zheng and Andraz
      Hide
      Andraz Tori added a comment -

      Amazon Changed the format of the logs at the beginning of February 2010, so now the new regex is:

      static Pattern regexpat = Pattern.compile( "(
      S+) (
      S+) \\[(.?)
      ] (
      S+) (
      S+) (
      S+) (
      S+) (
      S+) \"(.)\" (
      S
      ) (
      S+) (
      S+) (
      S+) (
      S+) (
      S+) \"(.
      )\" \"(.*)\"(?: -)?");

      (the only difference is addition of (?: -)? at the end.

      Since Amazon hasn't yet documented the last field, I don't know if it is ok to do a catch-all regex for that field instead of the very specific one I've added.

      Show
      Andraz Tori added a comment - Amazon Changed the format of the logs at the beginning of February 2010, so now the new regex is: static Pattern regexpat = Pattern.compile( "( S+) ( S+) \\[(. ?) ] ( S+) ( S+) ( S+) ( S+) ( S+) \"(. )\" ( S ) ( S+) ( S+) ( S+) ( S+) ( S+) \"(. )\" \"(.*)\"(?: -)?"); (the only difference is addition of (?: -)? at the end. Since Amazon hasn't yet documented the last field, I don't know if it is ok to do a catch-all regex for that field instead of the very specific one I've added.
      Hide
      Andraz Tori added a comment -

      Anyone caring to apply this minor fix before next version?

      Show
      Andraz Tori added a comment - Anyone caring to apply this minor fix before next version?
      Hide
      Namit Jain added a comment -

      Can you file a new patch and a fix for the same ? We can take a look at it immediately.

      Show
      Namit Jain added a comment - Can you file a new patch and a fix for the same ? We can take a look at it immediately.
      Hide
      Carl Steinbach added a comment -

      Filed HIVE-1483 to cover the required update to the regex.

      Show
      Carl Steinbach added a comment - Filed HIVE-1483 to cover the required update to the regex.

        People

        • Assignee:
          Andraz Tori
          Reporter:
          Zheng Shao
        • Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development