Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3878

Support XML Querying (selects/projections, no writing)

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Future
    • Fix Version/s: Future
    • Component/s: None
    • Labels:

      Description

      Support querying of the XML documents (as read-only selects,
      Writing should be implemented as a different feature that brings its own set of challenges.)

      To consider is reading of the trivial, schema-less, XML documents, DTD-oriented ones and also of schema-defined ones.

      Also, we should consider direct querying vs. using converter tools to change the representation from XML to JSON, CSV, etc.

      Design and Implementation discussion, notes, ideas and implementation suggestions should be captured here:
      https://docs.google.com/document/d/1oS-cObSaTlAmuW_XghDLmHbBEorLl0z-axaHnjy7vg0/edit?usp=sharing
      (no vandalism, please)

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user magpierre closed the pull request at:

        https://github.com/apache/drill/pull/451

        Show
        githubbot ASF GitHub Bot added a comment - Github user magpierre closed the pull request at: https://github.com/apache/drill/pull/451
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user magpierre opened a pull request:

        https://github.com/apache/drill/pull/451

        Drill 3878

        Please review my fix for JIRA DRILL-3878 provide XML support for Apache Drill.
        The fix utilizes the existing support for JSON by converting XML to JSON using a simple SAX parser built for the purpose.
        The parser tries to produce acceptable JSON documents that are then fed into the JSONRecordReader for futher processing.

        To add xml support into Apache Drill, please include the built package to 3rdparty folder of the built Apache Drill environment, and start.
        Add:

        "xml":

        { "type": "xml", "extensions": [ "xml" ], "keepPrefix": true }

        to the type section in dfs
        (keepPrefix = false will remove namespace from tags in Apache Drill since namespace can be named differently between documents and are not really part of the tagname)

        The parser tries to be nice to Drill / JSON Reader by avoiding mixing types, arranging recurring values in arrays, and by removing empty elements. This in order to minimize the amount of JSON errors due to the different nature of XML and Drill.

        Convention in JSON
        Attributes are named using convetiion @ and then the attribute name and store simple values.
        All other objects are stored as objects with a #value field.
        This is somewhat conforming with Apache Spark XML, but I need to store all values in objects in order to avoid as many map of different type problems as possible.

        Current limitations:
        DTD tags are currently not liked.
        Schema is not validated against XSD's.

        Also: SInce I am not a Drill Developer, I might have broken all rules possible of syntax, format, layout, test frameworks, as well as how to submit pull requests.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/magpierre/drill DRILL-3878

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/drill/pull/451.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #451


        commit 844f34a16e75719535ff94c54d5337746ea18c20
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2015-11-05T14:42:06Z

        Initial commit

        XML support in Apache Drill

        commit 592b3af06c2ff45198136577561f2ec1f7caaee0
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2015-11-05T21:21:42Z

        Fixed some minor outstanding bugs

        EasyRecordReader have a new field userName, and I forgot to change
        jsonProcessor to protected from private.

        commit 8fad811edab43d3499b41bb66cb419248d11208f
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2015-11-09T08:59:08Z

        Merge remote-tracking branch 'apache/master' into DRILL-3878

        commit 38f4884fe9b8456c1cde5de44c1e54177301a974
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-16T11:33:15Z

        Syncing to latest release of drill

        commit 909c5dec8bdb01bfe0ed358ebc64c959785738df
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-16T11:34:10Z

        syncing to latest release of drill

        commit 597d9657d613fa35df2c10dff23681545b13e531
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-18T08:55:51Z

        Cleaned up deliver

        Cleaned up the output generated by the SAX Parser, and removed all
        unnecessary code.

        commit 0cfaa31ab9af89833417288a290d21d0ce88c4ac
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-18T10:29:51Z

        Merge remote-tracking branch 'apache/master' into DRILL-3878

        commit aaaff05eb921125ad64854c89c179292c4441fb7
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-24T13:05:53Z

        Adjusted output from Parser to fit Drill better

        I have adjusted the SAX parser to produce JSON that Drill likes. Among
        the things corrected is to remove empty objects from the tree built.
        And to consolidate repeating values in arrays.

        commit ba19a356d850224c01b9e807183377b46cf7e545
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-24T13:10:57Z

        Fixed small typo

        commit 8ba6705be42c7847d469611ab070b869e0c76d8c
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-24T21:17:30Z

        Further enhancements of the output format to fit Drill

        commit e2273f13b8e0136a33c1576c4667f16e23e1631c
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-24T21:22:41Z

        Removed comment

        commit c1b6ff8375a7e3c8161167d1a5f2b34ba165e750
        Author: MPierre <magnus.pierre@icloud.com>
        Date: 2016-03-29T12:48:53Z

        Merge remote-tracking branch 'apache/master' into DRILL-3878


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user magpierre opened a pull request: https://github.com/apache/drill/pull/451 Drill 3878 Please review my fix for JIRA DRILL-3878 provide XML support for Apache Drill. The fix utilizes the existing support for JSON by converting XML to JSON using a simple SAX parser built for the purpose. The parser tries to produce acceptable JSON documents that are then fed into the JSONRecordReader for futher processing. To add xml support into Apache Drill, please include the built package to 3rdparty folder of the built Apache Drill environment, and start. Add: "xml": { "type": "xml", "extensions": [ "xml" ], "keepPrefix": true } to the type section in dfs (keepPrefix = false will remove namespace from tags in Apache Drill since namespace can be named differently between documents and are not really part of the tagname) The parser tries to be nice to Drill / JSON Reader by avoiding mixing types, arranging recurring values in arrays, and by removing empty elements. This in order to minimize the amount of JSON errors due to the different nature of XML and Drill. Convention in JSON Attributes are named using convetiion @ and then the attribute name and store simple values. All other objects are stored as objects with a #value field. This is somewhat conforming with Apache Spark XML, but I need to store all values in objects in order to avoid as many map of different type problems as possible. Current limitations: DTD tags are currently not liked. Schema is not validated against XSD's. Also: SInce I am not a Drill Developer, I might have broken all rules possible of syntax, format, layout, test frameworks, as well as how to submit pull requests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/magpierre/drill DRILL-3878 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/451.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #451 commit 844f34a16e75719535ff94c54d5337746ea18c20 Author: MPierre <magnus.pierre@icloud.com> Date: 2015-11-05T14:42:06Z Initial commit XML support in Apache Drill commit 592b3af06c2ff45198136577561f2ec1f7caaee0 Author: MPierre <magnus.pierre@icloud.com> Date: 2015-11-05T21:21:42Z Fixed some minor outstanding bugs EasyRecordReader have a new field userName, and I forgot to change jsonProcessor to protected from private. commit 8fad811edab43d3499b41bb66cb419248d11208f Author: MPierre <magnus.pierre@icloud.com> Date: 2015-11-09T08:59:08Z Merge remote-tracking branch 'apache/master' into DRILL-3878 commit 38f4884fe9b8456c1cde5de44c1e54177301a974 Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-16T11:33:15Z Syncing to latest release of drill commit 909c5dec8bdb01bfe0ed358ebc64c959785738df Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-16T11:34:10Z syncing to latest release of drill commit 597d9657d613fa35df2c10dff23681545b13e531 Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-18T08:55:51Z Cleaned up deliver Cleaned up the output generated by the SAX Parser, and removed all unnecessary code. commit 0cfaa31ab9af89833417288a290d21d0ce88c4ac Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-18T10:29:51Z Merge remote-tracking branch 'apache/master' into DRILL-3878 commit aaaff05eb921125ad64854c89c179292c4441fb7 Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-24T13:05:53Z Adjusted output from Parser to fit Drill better I have adjusted the SAX parser to produce JSON that Drill likes. Among the things corrected is to remove empty objects from the tree built. And to consolidate repeating values in arrays. commit ba19a356d850224c01b9e807183377b46cf7e545 Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-24T13:10:57Z Fixed small typo commit 8ba6705be42c7847d469611ab070b869e0c76d8c Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-24T21:17:30Z Further enhancements of the output format to fit Drill commit e2273f13b8e0136a33c1576c4667f16e23e1631c Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-24T21:22:41Z Removed comment commit c1b6ff8375a7e3c8161167d1a5f2b34ba165e750 Author: MPierre <magnus.pierre@icloud.com> Date: 2016-03-29T12:48:53Z Merge remote-tracking branch 'apache/master' into DRILL-3878
        Hide
        magnusp Magnus Pierre added a comment -

        The initial code to be used as inspiration for a more complete solution is here: https://github.com/magpierre/drill/tree/DRILL-3878

        Show
        magnusp Magnus Pierre added a comment - The initial code to be used as inspiration for a more complete solution is here: https://github.com/magpierre/drill/tree/DRILL-3878
        Hide
        magnusp Magnus Pierre added a comment -

        Hello,
        I have a simple implementation of a format converter that converts XML to JSON and run it through Drill JSONRecordReader which works fine for the test data I have available. The concept works well and the performance is decent, but it will build the complete JSON document in memory before handing it over to the JSONRecordReader and that is an issue for larger documents. Currently I am using a home-grown sax parser that builds the JSON document using org.JSON classes. However, there are dom variants that also can do XSD validations and so on. in order to be able to plug directly into JSONRecordReader without having to duplicate the code, embeddedInfo, hadoopPath, and stream need either to be changed from private to protected, or getters and setters need to be provided.

        Regarding XSD's I am considering if in dfs configuration if an additional option per workspace referring to the file type XML, can have a XSD list/array so any document in that workspace should adhere to the XSD's referred to otherwise they will not be considered by Drill.

        I will fill in the document, but I believe adding information in the jira itself makes it more visible to other people in the community.

        Best regards,
        Magnus

        Show
        magnusp Magnus Pierre added a comment - Hello, I have a simple implementation of a format converter that converts XML to JSON and run it through Drill JSONRecordReader which works fine for the test data I have available. The concept works well and the performance is decent, but it will build the complete JSON document in memory before handing it over to the JSONRecordReader and that is an issue for larger documents. Currently I am using a home-grown sax parser that builds the JSON document using org.JSON classes. However, there are dom variants that also can do XSD validations and so on. in order to be able to plug directly into JSONRecordReader without having to duplicate the code, embeddedInfo, hadoopPath, and stream need either to be changed from private to protected, or getters and setters need to be provided. Regarding XSD's I am considering if in dfs configuration if an additional option per workspace referring to the file type XML, can have a XSD list/array so any document in that workspace should adhere to the XSD's referred to otherwise they will not be considered by Drill. I will fill in the document, but I believe adding information in the jira itself makes it more visible to other people in the community. Best regards, Magnus

          People

          • Assignee:
            Unassigned
            Reporter:
            ebegoli Edmon Begoli
          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 3,360h
              3,360h
              Remaining:
              Remaining Estimate - 3,360h
              3,360h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development