Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Hadoop does not have a good XML parser for XML input. The XML parser in the streaming class is fairly difficult to work with and doesn't have proper test cases around it.

        Activity

        Hide
        Alan Ho added a comment - - edited

        This patch is for adding XML input support to hadoop. This patch does rely on J2EE 5 libraries, namely the libraries required for StAX parsing.

        The libraries it depends are are:
        appserv-ws.jar
        javaee.jar
        webservices-rt.jar
        webservices-tools.jar

        I've attached the jar files seperately

        Show
        Alan Ho added a comment - - edited This patch is for adding XML input support to hadoop. This patch does rely on J2EE 5 libraries, namely the libraries required for StAX parsing. The libraries it depends are are: appserv-ws.jar javaee.jar webservices-rt.jar webservices-tools.jar I've attached the jar files seperately
        Hide
        Alan Ho added a comment -

        JAR files for StAX parsing

        Show
        Alan Ho added a comment - JAR files for StAX parsing
        Hide
        Doug Cutting added a comment -

        What license are these jar files distributed under?

        Show
        Doug Cutting added a comment - What license are these jar files distributed under?
        Show
        Alan Ho added a comment - I believe under GPL - http://java.sun.com/javaee/downloads/licenses/JavaEE5SDKU3.thirdpartylicensereadme.txt
        Hide
        Alan Ho added a comment -

        The two jars webservices-rt.jar and webservices-tools.jar seems to be too big. I'm not sure what is the best way to get these libraries in. Please note that these libraries are part of the J2SE.1.6 distro.

        Show
        Alan Ho added a comment - The two jars webservices-rt.jar and webservices-tools.jar seems to be too big. I'm not sure what is the best way to get these libraries in. Please note that these libraries are part of the J2SE.1.6 distro.
        Hide
        James P. White added a comment - - edited

        As Doug will say, GPL licensed material cannot be hosted here. But I doubt those JARs are necessarily GPL. They are probably CDDL because I think they are from Glassfish.

        https://glassfish.dev.java.net/public/faq/GF_FAQ_2.html#terms

        The java.net JAX-WS RI is dual licensed CDDL and GPL:

        https://jax-ws.dev.java.net/

        It might also be Sun's Binary License:

        https://sjsxp.dev.java.net/

        In any case you'll need to identify exactly where these files are coming from and which license is applicable. If it is not an Apache compatible license (such as GPL) then the build will have to download them separately.

        Note that redistributing items less than the entire distro is prohibited by the J2SE license.

        http://java.sun.com/javase/6/jdk-6u3-license.txt

        Show
        James P. White added a comment - - edited As Doug will say, GPL licensed material cannot be hosted here. But I doubt those JARs are necessarily GPL. They are probably CDDL because I think they are from Glassfish. https://glassfish.dev.java.net/public/faq/GF_FAQ_2.html#terms The java.net JAX-WS RI is dual licensed CDDL and GPL: https://jax-ws.dev.java.net/ It might also be Sun's Binary License: https://sjsxp.dev.java.net/ In any case you'll need to identify exactly where these files are coming from and which license is applicable. If it is not an Apache compatible license (such as GPL) then the build will have to download them separately. Note that redistributing items less than the entire distro is prohibited by the J2SE license. http://java.sun.com/javase/6/jdk-6u3-license.txt
        Hide
        Doug Cutting added a comment -

        Apache's policy for 3rd party libraries is stated here:

        http://people.apache.org/~cliffs/3party.html

        In particular, redistribution of GPL libraries is prohibited, while CDDL libraries are permitted in binary form only.

        If these are included in Java 6, perhaps we should make this issue depend on HADOOP-2325 (upgrading Hadoop to require Java 6).

        Show
        Doug Cutting added a comment - Apache's policy for 3rd party libraries is stated here: http://people.apache.org/~cliffs/3party.html In particular, redistribution of GPL libraries is prohibited, while CDDL libraries are permitted in binary form only. If these are included in Java 6, perhaps we should make this issue depend on HADOOP-2325 (upgrading Hadoop to require Java 6).
        Hide
        Alan Ho added a comment -

        I believe James is right. J2EE 5 is basically bundled with the Glassfish Application server. This means that all the binaries are under CDDL. Regarding upgrading to Java 6, I personally haven't upgraded yet because there isn't any support for MacOSX.

        Show
        Alan Ho added a comment - I believe James is right. J2EE 5 is basically bundled with the Glassfish Application server. This means that all the binaries are under CDDL. Regarding upgrading to Java 6, I personally haven't upgraded yet because there isn't any support for MacOSX.
        Hide
        Milind Bhandarkar added a comment -

        Now that Hadoop requires java 6, can this be included ?

        Show
        Milind Bhandarkar added a comment - Now that Hadoop requires java 6, can this be included ?
        Hide
        Alan Ho added a comment -

        I'd like to. Doug - who should I ask to get it merged in ? I've already uploaded the patch, and should probably run a test to make sure it works with Java 6.

        Show
        Alan Ho added a comment - I'd like to. Doug - who should I ask to get it merged in ? I've already uploaded the patch, and should probably run a test to make sure it works with Java 6.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12371782/javaee.jar
        against trunk revision 739416.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3787/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12371782/javaee.jar against trunk revision 739416. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3787/console This message is automatically generated.
        Hide
        Tom White added a comment -

        I think the Hudson run failed because it tried to use the latest attachment, which in this case is a jar file, rather than the patch file. Also, I couldn't get the patch to apply. Could you regenerate it please?

        Show
        Tom White added a comment - I think the Hudson run failed because it tried to use the latest attachment, which in this case is a jar file, rather than the patch file. Also, I couldn't get the patch to apply. Could you regenerate it please?
        Hide
        Alan Ho added a comment -

        Sure. Its been a while since I've worked on the system. Do you want me to apply the patch to the trunk ?

        Show
        Alan Ho added a comment - Sure. Its been a while since I've worked on the system. Do you want me to apply the patch to the trunk ?
        Hide
        Tom White added a comment -

        Yes please.

        BTW, how does this code compare with the approach described at http://www.nabble.com/Re%3A-map-reduce-function-on-xml-string-p15835195.html?

        Show
        Tom White added a comment - Yes please. BTW, how does this code compare with the approach described at http://www.nabble.com/Re%3A-map-reduce-function-on-xml-string-p15835195.html?
        Hide
        Alan Ho added a comment -

        The approach described doesn't use a StaX parser, and probably isn't going to be as robust to failure or as extensible as using a StaX parser. If you look at the code, my patch allows you to specify the XML element name, "namespace_prefix", and namespace_URI when identifying the correct tag. My patch also makes it easier to massage the XML too when reading in data.

        Initially when I tried to create a XML parser, I tried to hack something up like the previous approach described. But after trying to parse real-world data (e.g. a dump of wikipedia), I threw up my arms and decided to use a proper pull-parser.

        Show
        Alan Ho added a comment - The approach described doesn't use a StaX parser, and probably isn't going to be as robust to failure or as extensible as using a StaX parser. If you look at the code, my patch allows you to specify the XML element name, "namespace_prefix", and namespace_URI when identifying the correct tag. My patch also makes it easier to massage the XML too when reading in data. Initially when I tried to create a XML parser, I tried to hack something up like the previous approach described. But after trying to parse real-world data (e.g. a dump of wikipedia), I threw up my arms and decided to use a proper pull-parser.

          People

          • Assignee:
            Unassigned
            Reporter:
            Alan Ho
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development