Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: UDF
    • Labels:
      None

      Description

      Here some UD(A)Fs which can be incorporated into the Hive distribution:

      UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
      UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_

      {i}

      but <= b_

      {i+1}

      . Returns 0 if x is smaller than all the buckets.
      UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
      UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
      UDFLDA - Performs LDA inference on a vector given fixed topics.
      UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
      UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
      UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
      UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
      UDFWhich - Given a boolean array, return the indices which are TRUE.
      UDFJaccard

      UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
      UDAFCollectMap - Like collect except that it takes tuples and generates a map.
      UDAFEntropy - Compute the entropy of a column.
      UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
      UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
      UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
      UDAFHistogram

      1. udfs.tar.gz
        7 kB
        Jonathan Chang
      2. udfs.tar.gz
        11 kB
        Jonathan Chang
      3. core.tar.gz
        19 kB
        Jonathan Chang
      4. ext.tar.gz
        19 kB
        Jonathan Chang
      5. UDFFindInString.java
        2 kB
        Jonathan Chang
      6. UDFEndsWith.java
        2 kB
        Jonathan Chang
      7. UDFStartsWith.java
        2 kB
        Jonathan Chang
      8. UDFTrim.java
        3 kB
        Jonathan Chang
      9. UDFRtrim.java
        3 kB
        Jonathan Chang
      10. UDFLtrim.java
        3 kB
        Jonathan Chang

        Issue Links

          Activity

          Jonathan Chang created issue -
          Hide
          Jonathan Chang added a comment -

          Here is a tarball of the poorly documented/tested udfs.

          Show
          Jonathan Chang added a comment - Here is a tarball of the poorly documented/tested udfs.
          Jonathan Chang made changes -
          Field Original Value New Value
          Attachment udfs.tar.gz [ 12452223 ]
          Jeff Hammerbacher made changes -
          Link This issue is related to HIVE-1549 [ HIVE-1549 ]
          Hide
          Jeff Hammerbacher added a comment -

          UDAFPearson looks quite similar to CORR(X, Y) proposed in HIVE-1549

          Show
          Jeff Hammerbacher added a comment - UDAFPearson looks quite similar to CORR(X, Y) proposed in HIVE-1549
          Carl Steinbach made changes -
          Component/s UDF [ 12313585 ]
          Hide
          Terje Marthinussen added a comment -

          Was just quickly looking at this and noticed that

          grep lib com/facebook/hive/udf/*java
          com/facebook/hive/udf/UDAFHistogram.java:import com.facebook.hive.udf.lib.Counter;
          com/facebook/hive/udf/UDFJaccard.java:import com.facebook.hive.udf.lib.SetOps;

          however, there is no com.facebook.hive.udf.lib included.

          Show
          Terje Marthinussen added a comment - Was just quickly looking at this and noticed that grep lib com/facebook/hive/udf/*java com/facebook/hive/udf/UDAFHistogram.java:import com.facebook.hive.udf.lib.Counter; com/facebook/hive/udf/UDFJaccard.java:import com.facebook.hive.udf.lib.SetOps; however, there is no com.facebook.hive.udf.lib included.
          Jonathan Chang made changes -
          Attachment udfs.tar.gz [ 12456413 ]
          Hide
          Jonathan Chang added a comment -

          Sorry about that. I've uploaded a new tarball which should contain the lib directory along with some new UDFs. UDAFPearson has also been removed since it's been obsoleted by CORR.

          Show
          Jonathan Chang added a comment - Sorry about that. I've uploaded a new tarball which should contain the lib directory along with some new UDFs. UDAFPearson has also been removed since it's been obsoleted by CORR.
          Hide
          Jonathan Chang added a comment -

          Some UDFs for tomorrow's contributor meeting. Summary of contents:

          Core:

          Basic functionality - CAST, HEX2DEC, MAP_GET
          Date/time functions - DAY_OF_WEEK, DST_OFFSET
          Multiple row manipulations - EXPLODE_INDEX, EXPLODE_MAP, REPEAT_ROWS
          Extensions of existing aggregations - COUNT_WHERE, SUM_WHERE,
          WEIGHTED_AVG, WEIGHTED_PERCENTILE
          Aggregations for collecting - COLLECT, COLLECT_MAP, COLLECT_WHERE,
          HISTOGRAM, UNION_MAP, UNION_SET
          Basic mathematical operations - ARG_MIN, ARG_MAX, BUCKET, IS_FINITE, PMAX,
          PMIN, PSUM
          Generally useful aggregations - ALL, ANY, CHOOSE_ONE, TOP, TOP_N
          JSON functionality - JSON_AS_ARRAY, JSON_AS_MAP, MAKE_JSON_ARRAY,
          MAKE_JSON_OBJ
          Generally useful array ops - ARRAY_CONCAT, ARRAY_INTERSECT, ARRAY_JOIN,
          ARRAY_SORT, ARRAY_SUBSET, ARRAY_UNION

          Ext:

          Maintaining state across rows - CUMPROD, CUMSUM, FILL, NUMBER_ROWS, PREV
          Probability (narrowly focused) - CHOOSE, ENTROPY, KMEANS, LDA,
          MAP_ENTROPY, PPOIS, RPOIS, SAMPLE, LINEAR_REGRESSION
          Narrowly focused string ops - MD5, LEVENSHTEIN, LONGEST,
          NORMALIZE_UNICODE, UNESCAPE, URL_QUOTE, GROUP_LONGEST, TITLECASE,
          REGEXP_EXTRACT_ALL
          More esoteric array/map ops - ARRAY_AGGREGATE, ARRAY_COUNT_OVERLAP,
          ARRAY_EXCLUDE, ARRAY_SLICE, FIND_SEQUENCE_IN_ARRAY, MAP_EXCLUDE

          Show
          Jonathan Chang added a comment - Some UDFs for tomorrow's contributor meeting. Summary of contents: Core: Basic functionality - CAST, HEX2DEC, MAP_GET Date/time functions - DAY_OF_WEEK, DST_OFFSET Multiple row manipulations - EXPLODE_INDEX, EXPLODE_MAP, REPEAT_ROWS Extensions of existing aggregations - COUNT_WHERE, SUM_WHERE, WEIGHTED_AVG, WEIGHTED_PERCENTILE Aggregations for collecting - COLLECT, COLLECT_MAP, COLLECT_WHERE, HISTOGRAM, UNION_MAP, UNION_SET Basic mathematical operations - ARG_MIN, ARG_MAX, BUCKET, IS_FINITE, PMAX, PMIN, PSUM Generally useful aggregations - ALL, ANY, CHOOSE_ONE, TOP, TOP_N JSON functionality - JSON_AS_ARRAY, JSON_AS_MAP, MAKE_JSON_ARRAY, MAKE_JSON_OBJ Generally useful array ops - ARRAY_CONCAT, ARRAY_INTERSECT, ARRAY_JOIN, ARRAY_SORT, ARRAY_SUBSET, ARRAY_UNION Ext: Maintaining state across rows - CUMPROD, CUMSUM, FILL, NUMBER_ROWS, PREV Probability (narrowly focused) - CHOOSE, ENTROPY, KMEANS, LDA, MAP_ENTROPY, PPOIS, RPOIS, SAMPLE, LINEAR_REGRESSION Narrowly focused string ops - MD5, LEVENSHTEIN, LONGEST, NORMALIZE_UNICODE, UNESCAPE, URL_QUOTE, GROUP_LONGEST, TITLECASE, REGEXP_EXTRACT_ALL More esoteric array/map ops - ARRAY_AGGREGATE, ARRAY_COUNT_OVERLAP, ARRAY_EXCLUDE, ARRAY_SLICE, FIND_SEQUENCE_IN_ARRAY, MAP_EXCLUDE
          Jonathan Chang made changes -
          Attachment core.tar.gz [ 12484694 ]
          Attachment ext.tar.gz [ 12484695 ]
          Hide
          cyril liao added a comment -

          com.facebook.hive.udf.lib.UDFUtils is not included.

          Would you please upload it?

          Show
          cyril liao added a comment - com.facebook.hive.udf.lib.UDFUtils is not included. Would you please upload it?
          Hide
          Jonathan Chang added a comment -

          Sure. Can you let me know which functions are not included? I think part of the resolution of this issue will be to port some of the UDFUtils-specific stuff to the UDF development package on github.

          Show
          Jonathan Chang added a comment - Sure. Can you let me know which functions are not included? I think part of the resolution of this issue will be to port some of the UDFUtils-specific stuff to the UDF development package on github.
          Hide
          Jonathan Chang added a comment -

          Or maybe we should open another issue to track augmenting the other package with the necessary functionality?

          Show
          Jonathan Chang added a comment - Or maybe we should open another issue to track augmenting the other package with the necessary functionality?
          Hide
          John Sichi added a comment -

          I'm way behind on the PDK (probably not gonna make it for 0.8), but I'm planning to rework the UDFUtils into annotations as part of it.

          Cyril, I think they are mostly used for validation purposes, in which case you can just comment out the calls for now if you want to use the UDF without validation.

          Show
          John Sichi added a comment - I'm way behind on the PDK (probably not gonna make it for 0.8), but I'm planning to rework the UDFUtils into annotations as part of it. Cyril, I think they are mostly used for validation purposes, in which case you can just comment out the calls for now if you want to use the UDF without validation.
          Hide
          cyril liao added a comment -

          Neither in core.tar.gz nor ext.tar.gz,there is a class named com.facebook.hive.udf.lib.UDFUtils,which is used by many UDFs.
          In package com.facebook.hive.udf.lib ,only Counter and SetOps are included.

          Show
          cyril liao added a comment - Neither in core.tar.gz nor ext.tar.gz,there is a class named com.facebook.hive.udf.lib.UDFUtils,which is used by many UDFs. In package com.facebook.hive.udf.lib ,only Counter and SetOps are included.
          John Sichi made changes -
          Link This issue is related to HIVE-2523 [ HIVE-2523 ]
          John Sichi made changes -
          Link This issue depends on HIVE-2524 [ HIVE-2524 ]
          Hide
          Jonathan Chang added a comment -

          5 string UDFs with Apache header

          Show
          Jonathan Chang added a comment - 5 string UDFs with Apache header
          Jonathan Chang made changes -
          Attachment UDFFindInString.java [ 12542017 ]
          Attachment UDFEndsWith.java [ 12542018 ]
          Attachment UDFStartsWith.java [ 12542019 ]
          Attachment UDFTrim.java [ 12542020 ]
          Attachment UDFRtrim.java [ 12542021 ]
          Attachment UDFLtrim.java [ 12542022 ]
          Gavin made changes -
          Link This issue depends on HIVE-2524 [ HIVE-2524 ]
          Gavin made changes -
          Link This issue depends upon HIVE-2524 [ HIVE-2524 ]
          Hide
          Jeff Wu added a comment -

          Trying to compile these and load the jar but the package com.facebook.hive.udf.tests isn't included. Can someone attach that?

          Show
          Jeff Wu added a comment - Trying to compile these and load the jar but the package com.facebook.hive.udf.tests isn't included. Can someone attach that?
          Hide
          Jonathan Chang added a comment -

          I think they should be migrated to use the equivalent facilities in the PDK?

          Show
          Jonathan Chang added a comment - I think they should be migrated to use the equivalent facilities in the PDK?
          Hide
          Jonathan Chang added a comment -

          For the time being you can remove those packages (and the corresponding annotations) without affecting the functionality.

          Show
          Jonathan Chang added a comment - For the time being you can remove those packages (and the corresponding annotations) without affecting the functionality.
          Hide
          Brenden Matthews added a comment -

          Where's the rest of the source?

          Show
          Brenden Matthews added a comment - Where's the rest of the source?
          Hide
          Jonathan Chang added a comment -

          What are you looking for in particular?

          Show
          Jonathan Chang added a comment - What are you looking for in particular?
          Hide
          Brenden Matthews added a comment -

          There's a bunch of code missing. Your code doesn't build without modifications.

          I've made a copy of this which seems to work (minus the broken parts) here:

          https://github.com/brndnmtthws/facebook-hive-udfs

          Show
          Brenden Matthews added a comment - There's a bunch of code missing. Your code doesn't build without modifications. I've made a copy of this which seems to work (minus the broken parts) here: https://github.com/brndnmtthws/facebook-hive-udfs
          Hide
          Edward Capriolo added a comment -

          The annotations and other things you are seeing are part of an internal testing framework at FB that was never open sourced, the hive plugin developer kit had similar annotations but they were removed. So the UDFS likely compilefine but the test cases will not.

          Show
          Edward Capriolo added a comment - The annotations and other things you are seeing are part of an internal testing framework at FB that was never open sourced, the hive plugin developer kit had similar annotations but they were removed. So the UDFS likely compilefine but the test cases will not.
          Hide
          Jonathan Chang added a comment -

          Awesome. Thanks for setting up that github (so much nicer than JIRA ducks). Lemme see if I can upload the bare minimum files to get the annotations compiling. I have some open tasks internally to rewrite the testing framework — it should be much easier to integrate into a standard testing flow once we do so.

          Show
          Jonathan Chang added a comment - Awesome. Thanks for setting up that github (so much nicer than JIRA ducks ). Lemme see if I can upload the bare minimum files to get the annotations compiling. I have some open tasks internally to rewrite the testing framework — it should be much easier to integrate into a standard testing flow once we do so.
          Hide
          Brenden Matthews added a comment -

          There were a number of things missing. For example, there are some validation utilities which aren't present, such as this one:

          https://github.com/brndnmtthws/facebook-hive-udfs/commit/bc6b3cc5ab4c413458f17ec3d981306b2b670946#L7L34

          Show
          Brenden Matthews added a comment - There were a number of things missing. For example, there are some validation utilities which aren't present, such as this one: https://github.com/brndnmtthws/facebook-hive-udfs/commit/bc6b3cc5ab4c413458f17ec3d981306b2b670946#L7L34
          Hide
          Jonathan Chang added a comment -

          Ok, lemme work on getting everything uploaded. Is it better to work off the github or would you prefer I continue to upload tarballs here.

          Show
          Jonathan Chang added a comment - Ok, lemme work on getting everything uploaded. Is it better to work off the github or would you prefer I continue to upload tarballs here.
          Hide
          Brenden Matthews added a comment -

          Why don't we get everything into a good state on github first, then we can submit it here. You can just shoot me a pull request when you want a review.

          Thanks Jonathan!

          Show
          Brenden Matthews added a comment - Why don't we get everything into a good state on github first, then we can submit it here. You can just shoot me a pull request when you want a review. Thanks Jonathan!
          Hide
          Jonathan Chang added a comment -

          Done. Just to put the discussion here as well, I think the first steps will be to get a few simple string related ones in along with the corresponding tests (ignoring the fact that the tests are no-ops for now). If this goes smoothly we can go subpackage by subpackage adding the necessary infrastructural bits a little bit at a time.

          Show
          Jonathan Chang added a comment - Done. Just to put the discussion here as well, I think the first steps will be to get a few simple string related ones in along with the corresponding tests (ignoring the fact that the tests are no-ops for now). If this goes smoothly we can go subpackage by subpackage adding the necessary infrastructural bits a little bit at a time.
          Hide
          Jonathan Chang added a comment -

          Ok, I've put a self-contained, compiling repo here: https://github.com/slycoder/hive-udfs

          What are the next steps?

          Show
          Jonathan Chang added a comment - Ok, I've put a self-contained, compiling repo here: https://github.com/slycoder/hive-udfs What are the next steps?

            People

            • Assignee:
              Jonathan Chang
              Reporter:
              Jonathan Chang
            • Votes:
              3 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:

                Development