Hive
  1. Hive
  2. HIVE-1027

Create UDFs for XPath expression evaluation

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      Create UDFs for evaluating XPath expressions against XML documents.

      Examples:

      > SELECT xpath_double ('<a><b class="odd">1</b><b class="even">2</b><b class="odd">4</b><c>8</c></a>', 'sum(a/b[@class="odd"])') FROM src LIMIT 1 ;
      5.0
      > SELECT xpath_string ('<a><b>b1</b><b>b2</b></a>', 'a/b[2]') FROM src LIMIT 1 ;
      b2
      > SELECT xpath ('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>', 'a/c/text()') FROM src LIMIT 1 ;
      ["c1","c2"]

      Included functions are: xpath_short, xpath_int, xpath_long, xpath_float, xpath_double/xpath_number, xpath_string, xpath

      1. HIVE-1027_3.patch
        80 kB
        Ning Zhang
      2. hive-1027-v3.patch
        78 kB
        Patrick Angeles
      3. hive-1027-v2.patch
        83 kB
        Patrick Angeles
      4. hive-1027.patch
        74 kB
        Patrick Angeles
      5. udf_xpath.patch
        69 kB
        Patrick Angeles

        Activity

        Hide
        Ning Zhang added a comment -

        That's great. Thank you very much Patrick.

        Show
        Ning Zhang added a comment - That's great. Thank you very much Patrick.
        Hide
        Patrick Angeles added a comment -

        Here it is. I created a 'master' UDF guide page that is linked to via a list item in the 'Hive Users Guide':

        http://wiki.apache.org/hadoop/Hive/HiveUDFGuide

        Side note: the wiki site is painfully slow

        Show
        Patrick Angeles added a comment - Here it is. I created a 'master' UDF guide page that is linked to via a list item in the 'Hive Users Guide': http://wiki.apache.org/hadoop/Hive/HiveUDFGuide Side note: the wiki site is painfully slow
        Hide
        Ning Zhang added a comment -

        Patrick, can you update the wiki page for these new UDFs?

        Show
        Ning Zhang added a comment - Patrick, can you update the wiki page for these new UDFs?
        Hide
        Ning Zhang added a comment -

        Committed to trunk (0.6.0). Thanks for the contribution Patrick!

        Show
        Ning Zhang added a comment - Committed to trunk (0.6.0). Thanks for the contribution Patrick!
        Hide
        Ning Zhang added a comment -

        Attaching a patch that reverses the type casting changes in FunctionRegistry.java for Hadoop 0.17.2.1.

        Show
        Ning Zhang added a comment - Attaching a patch that reverses the type casting changes in FunctionRegistry.java for Hadoop 0.17.2.1.
        Hide
        Patrick Angeles added a comment -

        Cleaned up patch.

        The Eclipse code cleanup hook generated by 'ant eclipse-files' was automatically removing casts. This seems to only affect the build for version 0.17.2.1

        Show
        Patrick Angeles added a comment - Cleaned up patch. The Eclipse code cleanup hook generated by 'ant eclipse-files' was automatically removing casts. This seems to only affect the build for version 0.17.2.1
        Hide
        Ning Zhang added a comment -

        Patrick, the patch didn't build correctly with Hadoop 0.17.2.1 (ant -Dhadoop.version=0.17.2.1 clean package). I think it is caused by the change to remove type castings in FunctionRegistry.java (e.g., line 410 etc.). Can you take a look and fix that?

        Show
        Ning Zhang added a comment - Patrick, the patch didn't build correctly with Hadoop 0.17.2.1 (ant -Dhadoop.version=0.17.2.1 clean package). I think it is caused by the change to remove type castings in FunctionRegistry.java (e.g., line 410 etc.). Can you take a look and fix that?
        Hide
        Ning Zhang added a comment -

        Thanks Patrick! I will take a look and get back soon.

        Show
        Ning Zhang added a comment - Thanks Patrick! I will take a look and get back soon.
        Hide
        Patrick Angeles added a comment -

        Hi Ning,

        Attaching a new patch against trunk.

        Show
        Patrick Angeles added a comment - Hi Ning, Attaching a new patch against trunk.
        Hide
        Ning Zhang added a comment -

        Hi Patrick,

        Could you regenerate the patch? It has some conflict with the current trunk. I'll make sure review it quickly and commit it after it is regenerated.

        Thanks,
        Ning

        Show
        Ning Zhang added a comment - Hi Patrick, Could you regenerate the patch? It has some conflict with the current trunk. I'll make sure review it quickly and commit it after it is regenerated. Thanks, Ning
        Hide
        Patrick Angeles added a comment -

        >Thanks for the detailed explanations. It seems we are supporting XPath 1.0 here. When you say "xpath() returns multiple nodes(list)", do you mean it returns a
        > serialized XML string representing the list of nodes such as <a>a1</a><a>a2</a> ...? In this case, do you have a test case for composing xpath() functions. For
        > example and subquery returns XML string from the result of xpath() and the outer query takes that input to another xpath*() function?

        No, xpath() always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array. So really, there's only 2 uses for xpath(): to get a list of node text values or to get a list of attribute values. For example:

        > select xpath('<a><b>b1</b><b>b2</b></a>','a/*') from src limit 1 ;
        []
        > select xpath('<a><b>b1</b><b>b2</b></a>','a/*/text()') from src limit 1 ; // note the text() at the end of the expression
        ["b1","b2"]
        > select xpath('<a><b id="foo">b1</b><b id="bar">b2</b></a>','//@id') from src limit 1 ;
        ["foo","bar"]

        This behavior can be changed, but I feel that going down the path of returning nested results is suboptimal. I'm open to ideas, however.

        > For (4) I'm sure whether we should interpret of empty list as empty string etc. We can definitely define the mapping between the XML model to relation model this way,
        > but it doesn't distinguish the case where the xpath_string() result is an empty list or it is a single node but the value of the node is empty (e.g., <a/> vs. no <a>
        > element).
        Agreed. Unfortunately, the Java XPath API on which this is built on returns an empty string on both cases. I can internally change it so it queries for a node instead of a string, then extract the string from the node. I get the feeling that this is less performant but I have no facts to back this up.

        > Also all this information is better to be exposed to the wider community (not only developers) as well. Can you also add all these to the Hive's wiki page?
        Absolutely... I will update the Hive Wiki once this is committed.

        Show
        Patrick Angeles added a comment - >Thanks for the detailed explanations. It seems we are supporting XPath 1.0 here. When you say "xpath() returns multiple nodes(list)", do you mean it returns a > serialized XML string representing the list of nodes such as <a>a1</a><a>a2</a> ...? In this case, do you have a test case for composing xpath() functions. For > example and subquery returns XML string from the result of xpath() and the outer query takes that input to another xpath*() function? No, xpath() always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array. So really, there's only 2 uses for xpath(): to get a list of node text values or to get a list of attribute values. For example: > select xpath('<a><b>b1</b><b>b2</b></a>','a/*') from src limit 1 ; [] > select xpath('<a><b>b1</b><b>b2</b></a>','a/*/text()') from src limit 1 ; // note the text() at the end of the expression ["b1","b2"] > select xpath('<a><b id="foo">b1</b><b id="bar">b2</b></a>','//@id') from src limit 1 ; ["foo","bar"] This behavior can be changed, but I feel that going down the path of returning nested results is suboptimal. I'm open to ideas, however. > For (4) I'm sure whether we should interpret of empty list as empty string etc. We can definitely define the mapping between the XML model to relation model this way, > but it doesn't distinguish the case where the xpath_string() result is an empty list or it is a single node but the value of the node is empty (e.g., <a/> vs. no <a> > element). Agreed. Unfortunately, the Java XPath API on which this is built on returns an empty string on both cases. I can internally change it so it queries for a node instead of a string, then extract the string from the node. I get the feeling that this is less performant but I have no facts to back this up. > Also all this information is better to be exposed to the wider community (not only developers) as well. Can you also add all these to the Hive's wiki page? Absolutely... I will update the Hive Wiki once this is committed.
        Hide
        Ning Zhang added a comment -

        Thanks for the detailed explanations. It seems we are supporting XPath 1.0 here. When you say "xpath() returns multiple nodes(list)", do you mean it returns a serialized XML string representing the list of nodes such as <a>a1</a><a>a2</a> ...? In this case, do you have a test case for composing xpath() functions. For example and subquery returns XML string from the result of xpath() and the outer query takes that input to another xpath*() function?

        For (4) I'm sure whether we should interpret of empty list as empty string etc. We can definitely define the mapping between the XML model to relation model this way, but it doesn't distinguish the case where the xpath_string() result is an empty list or it is a single node but the value of the node is empty (e.g., <a/> vs. no <a> element).

        Also all this information is better to be exposed to the wider community (not only developers) as well. Can you also add all these to the Hive's wiki page?

        Show
        Ning Zhang added a comment - Thanks for the detailed explanations. It seems we are supporting XPath 1.0 here. When you say "xpath() returns multiple nodes(list)", do you mean it returns a serialized XML string representing the list of nodes such as <a>a1</a><a>a2</a> ...? In this case, do you have a test case for composing xpath() functions. For example and subquery returns XML string from the result of xpath() and the outer query takes that input to another xpath*() function? For (4) I'm sure whether we should interpret of empty list as empty string etc. We can definitely define the mapping between the XML model to relation model this way, but it doesn't distinguish the case where the xpath_string() result is an empty list or it is a single node but the value of the node is empty (e.g., <a/> vs. no <a> element). Also all this information is better to be exposed to the wider community (not only developers) as well. Can you also add all these to the Hive's wiki page?
        Hide
        Patrick Angeles added a comment -

        1) In general XPath queries return a list of nodes. What is the semantics of xpath_double (eg.) return if XPath evaluates to multiple nodes.

        Only xpath() returns multiple nodes (list).

        xpath_string() returns the text of the first matching node (and its subnodes, if any).

        • xpath_string('<a>aa<b>b1</b><b>b2</b></a>','a') returns 'aab1b2'
        • xpath_string('<a>aa<b>b1</b><b>b2</b></a>','b') returns 'b1'

        xpath_double()/float() return the numeric value of the text of the first matching node, or NaN if the text value is not numeric.
        xpath_int()/long()/short() return the numberic value of the text of the first matching node, or 0 if the text value is not numeric, or MAX_INT, MAX_LONG, MAX_SHORT respectively if the value overflows.

        2) Is the XPath query parsed for every input row, or only parsed once?

        The XPath expression is compiled and cached. It is reused if the next expression matches the previous. Otherwise, it is recompiled. So, the xml is always parsed for every input row, but the xpath expression is precompiled and reused for the vast majority of use cases.

        3a) Do you support DTD and XMLSchema?

        Not sure how these would apply, as the Java XPath API is schema agnostic (no validation being performed). However, malformed xml (e.g., '<a><b>1</b></aa>') will result in a runtime exception being thrown.

        3b) What about namespace and backward axes in XPath?

        Namespace is not currently supported, but could be easily added later.

        Backward axes are supported:

        > select xpath ('<a><b id="1"><c/></b><b id="2"><c/></b></a>','/descendant::c/ancestor::b/@id') from t1 limit 1 ;
        ["1","2"]

        4) If XPath evaluates to empty list, do you return NULL or empty string (in case of xpath())?

        When no match is found:
        xpath() returns an empty list.
        xpath_string() returns an empty string.
        xpath_int(), float(), etc. will return 0.
        xpath_boolean() will return false.

        Show
        Patrick Angeles added a comment - 1) In general XPath queries return a list of nodes. What is the semantics of xpath_double (eg.) return if XPath evaluates to multiple nodes. Only xpath() returns multiple nodes (list). xpath_string() returns the text of the first matching node (and its subnodes, if any). xpath_string('<a>aa<b>b1</b><b>b2</b></a>','a') returns 'aab1b2' xpath_string('<a>aa<b>b1</b><b>b2</b></a>','b') returns 'b1' xpath_double()/float() return the numeric value of the text of the first matching node, or NaN if the text value is not numeric. xpath_int()/long()/short() return the numberic value of the text of the first matching node, or 0 if the text value is not numeric, or MAX_INT, MAX_LONG, MAX_SHORT respectively if the value overflows. 2) Is the XPath query parsed for every input row, or only parsed once? The XPath expression is compiled and cached. It is reused if the next expression matches the previous. Otherwise, it is recompiled. So, the xml is always parsed for every input row, but the xpath expression is precompiled and reused for the vast majority of use cases. 3a) Do you support DTD and XMLSchema? Not sure how these would apply, as the Java XPath API is schema agnostic (no validation being performed). However, malformed xml (e.g., '<a><b>1</b></aa>') will result in a runtime exception being thrown. 3b) What about namespace and backward axes in XPath? Namespace is not currently supported, but could be easily added later. Backward axes are supported: > select xpath ('<a><b id="1"><c/></b><b id="2"><c/></b></a>','/descendant::c/ancestor::b/@id') from t1 limit 1 ; ["1","2"] 4) If XPath evaluates to empty list, do you return NULL or empty string (in case of xpath())? When no match is found: xpath() returns an empty list. xpath_string() returns an empty string. xpath_int(), float(), etc. will return 0. xpath_boolean() will return false.
        Hide
        Ning Zhang added a comment -

        This is cool stuff. Just some questions:
        1) In general XPath queries return a list of nodes. What is the semantics of xpath_double (eg.) return if XPath evaluates to multiple nodes.
        2) Is the XPath query parsed for every input row, or only parsed once?
        3) Do you support DTD and XMLSchema? What about namespace and backward axes in XPath?
        4) If XPath evaluates to empty list, do you return NULL or empty string (in case of xpath())?

        Show
        Ning Zhang added a comment - This is cool stuff. Just some questions: 1) In general XPath queries return a list of nodes. What is the semantics of xpath_double (eg.) return if XPath evaluates to multiple nodes. 2) Is the XPath query parsed for every input row, or only parsed once? 3) Do you support DTD and XMLSchema? What about namespace and backward axes in XPath? 4) If XPath evaluates to empty list, do you return NULL or empty string (in case of xpath())?
        Hide
        Namit Jain added a comment -

        +1

        looks good - will commit if the tests pass

        Show
        Namit Jain added a comment - +1 looks good - will commit if the tests pass
        Hide
        Patrick Angeles added a comment -

        updated patch... includes show_functions.q.out

        Show
        Patrick Angeles added a comment - updated patch... includes show_functions.q.out
        Hide
        Patrick Angeles added a comment -

        Updated patch (this one includes show_functions.q.out).

        Show
        Patrick Angeles added a comment - Updated patch (this one includes show_functions.q.out).
        Hide
        Patrick Angeles added a comment -

        Code uses the built-in javax.xml.xpath library.

        Show
        Patrick Angeles added a comment - Code uses the built-in javax.xml.xpath library.

          People

          • Assignee:
            Patrick Angeles
            Reporter:
            Patrick Angeles
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development