Solr
  1. Solr
  2. SOLR-133

change XmlUpdateRequestHandler to use StAX instead of XPP

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      there has been discussion of using StAX for XML parsing of updates instead of XPP ... opening an issue to track it as a possible improvement (orriginally mentioned in SOLR-61, but that task was more specificly about refactoring the existing code)

      1. SOLR-133.diff
        24 kB
        Thorsten Scherler
      2. SOLR-133.diff
        21 kB
        Thorsten Scherler
      3. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        23 kB
        Ryan McKinley
      4. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        49 kB
        Ryan McKinley
      5. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        26 kB
        Ryan McKinley
      6. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        25 kB
        Ryan McKinley
      7. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        27 kB
        Ryan McKinley
      8. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        27 kB
        Ryan McKinley
      9. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        28 kB
        Ryan McKinley
      10. SOLR-133-XmlUpdateRequestHandler-StAX-139.patch
        27 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          This bug was modified as part of a bulk update using the criteria...

          • Marked "Resolved" and "Fixed"
          • Had no "Fix Version" versions
          • Was listed in the CHANGES.txt for 1.3 as of today 2008-03-15

          The Fix Version for all 29 issues found was set to 1.3, email notification was suppressed to prevent excessive email.

          For a list of all the issues modified, search jira comments for this (hopefully) unique string: batch20070315hossman1

          Show
          Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked "Resolved" and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.3 as of today 2008-03-15 The Fix Version for all 29 issues found was set to 1.3, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: batch20070315hossman1
          Hide
          Ryan McKinley added a comment -

          this is the default implementation since r552198

          Show
          Ryan McKinley added a comment - this is the default implementation since r552198
          Hide
          Ryan McKinley added a comment -

          updated to apply with trunk + SOLR-262

          Changed it to sit next to XmlUpdateRequestHandler – this way it can be evaluated next to XmlUpdateRequestHandler rather then replace it.

          The UpdateRequestProcessor interface needs review, but I think it can be done CTR...

          Show
          Ryan McKinley added a comment - updated to apply with trunk + SOLR-262 Changed it to sit next to XmlUpdateRequestHandler – this way it can be evaluated next to XmlUpdateRequestHandler rather then replace it. The UpdateRequestProcessor interface needs review, but I think it can be done CTR...
          Hide
          Ryan McKinley added a comment -

          dooh – wrong issue

          Show
          Ryan McKinley added a comment - dooh – wrong issue
          Hide
          Ryan McKinley added a comment -

          Extracts the request parsing and update handling into two parts.

          This adds an "UpdateRequestProcessor" that handles the actual updating. This offers a good place for authentication / document transformation etc. This can all be reuse if we have a JSONUpdate handler. The UpdateRequestProcessor can be changed using an init param in solrconfig,xml:

          <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
          <str name="update.processor.class">org.apache.solr.handler.UpdateRequestProcessor</str>
          </requestHandler>

          Moved the XPP version to XppUpdateRequestHandler and mapped it to:
          <requestHandler name="/update/xpp" class="solr.XppUpdateRequestHandler" />

          My initial (not accurate) tests don't show any significant time difference between the two – we should keep both in the code until we are confident the new one is stable.

          • - - - -

          Thorsten - can you check if the STAX includes are all in good shape? Is it ok to use:
          import javanet.staxutils.BaseXMLInputFactory;

          Show
          Ryan McKinley added a comment - Extracts the request parsing and update handling into two parts. This adds an "UpdateRequestProcessor" that handles the actual updating. This offers a good place for authentication / document transformation etc. This can all be reuse if we have a JSONUpdate handler. The UpdateRequestProcessor can be changed using an init param in solrconfig,xml: <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" > <str name="update.processor.class">org.apache.solr.handler.UpdateRequestProcessor</str> </requestHandler> Moved the XPP version to XppUpdateRequestHandler and mapped it to: <requestHandler name="/update/xpp" class="solr.XppUpdateRequestHandler" /> My initial (not accurate) tests don't show any significant time difference between the two – we should keep both in the code until we are confident the new one is stable. - - - - Thorsten - can you check if the STAX includes are all in good shape? Is it ok to use: import javanet.staxutils.BaseXMLInputFactory;
          Hide
          Ryan McKinley added a comment -

          Updated to apply against trunk and removed the SOLR-139 dependency.

          Not it just relies on SOLR-193.

          Show
          Ryan McKinley added a comment - Updated to apply against trunk and removed the SOLR-139 dependency. Not it just relies on SOLR-193 .
          Hide
          Ryan McKinley added a comment -

          From Yonik on SOLR-231

          >> Solr should assume UTF-8 encoding unless the contentType says otherwise.
          >
          > In general yes (when Solr is asked for a Reader).
          > For XML, we should probably give the parser an InputStream.
          > http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448
          >

          Show
          Ryan McKinley added a comment - From Yonik on SOLR-231 >> Solr should assume UTF-8 encoding unless the contentType says otherwise. > > In general yes (when Solr is asked for a Reader). > For XML, we should probably give the parser an InputStream. > http://www.nabble.com/double-curl-calls-in-post.sh--tf2287469.html#a6369448 >
          Hide
          Ryan McKinley added a comment -

          If you have SOLR-193 + SOLR-139, i think most things are good...

          We could easily remove the SOLR-139 dependency.

          I think after solr1.2 (assuming it is sometime soon), this and SOLR-193 would be good to include.

          Show
          Ryan McKinley added a comment - If you have SOLR-193 + SOLR-139 , i think most things are good... We could easily remove the SOLR-139 dependency. I think after solr1.2 (assuming it is sometime soon), this and SOLR-193 would be good to include.
          Hide
          Thorsten Scherler added a comment -

          What is missing with this issue, where can I give a helping had.

          Show
          Thorsten Scherler added a comment - What is missing with this issue, where can I give a helping had.
          Hide
          Ryan McKinley added a comment -

          applies (almost cleanly) with trunk + SOLR-193 + SOLR-139

          Show
          Ryan McKinley added a comment - applies (almost cleanly) with trunk + SOLR-193 + SOLR-139
          Hide
          Ryan McKinley added a comment -

          Applies cleaner with trunk - it still depends on SOLR-193 and SOLR-139, so "clean" may not be the best description.

          Show
          Ryan McKinley added a comment - Applies cleaner with trunk - it still depends on SOLR-193 and SOLR-139 , so "clean" may not be the best description.
          Hide
          Ryan McKinley added a comment -

          updated to work with most recent SOLR-139

          Show
          Ryan McKinley added a comment - updated to work with most recent SOLR-139
          Hide
          Ryan McKinley added a comment -

          fixed the document parser to handle fields with CDATA.

          switch (event) {
          // Add everything to the text
          case XMLStreamConstants.SPACE:
          case XMLStreamConstants.CDATA:
          case XMLStreamConstants.CHARACTERS:
          text.append( parser.getText() );
          break;
          ...

          Show
          Ryan McKinley added a comment - fixed the document parser to handle fields with CDATA. switch (event) { // Add everything to the text case XMLStreamConstants.SPACE: case XMLStreamConstants.CDATA: case XMLStreamConstants.CHARACTERS: text.append( parser.getText() ); break; ...
          Hide
          Hoss Man added a comment -

          haven't reviewed any patches, but to address some of J.J.s earlier points: i assume the goal would be to move to STaX because it is reportadly as fast as XPP but is also the new standard for "fast" stream based processing.

          As for replacing other XML parsing code in the solr code base – XML parsing can probably be divided into two lumps:
          1) processing input streams for updates (or more generally: "requests with ContentStreams" based on some of Ryan's recent patches)
          2) config file parsing

          while STaX sounds like it makes a lot of sense for #1, sticking with DOM parsing for #2 seems like a good idea ... using XPath to access arbitrary sections of config information is extremely handy and the performance issues with initialization from the config DOM doens't seem like a big issue.

          Show
          Hoss Man added a comment - haven't reviewed any patches, but to address some of J.J.s earlier points: i assume the goal would be to move to STaX because it is reportadly as fast as XPP but is also the new standard for "fast" stream based processing. As for replacing other XML parsing code in the solr code base – XML parsing can probably be divided into two lumps: 1) processing input streams for updates (or more generally: "requests with ContentStreams" based on some of Ryan's recent patches) 2) config file parsing while STaX sounds like it makes a lot of sense for #1, sticking with DOM parsing for #2 seems like a good idea ... using XPath to access arbitrary sections of config information is extremely handy and the performance issues with initialization from the config DOM doens't seem like a big issue.
          Hide
          Ryan McKinley added a comment -

          Thorsten - this looks good. I cleaned it up a bit and modified it to use SOLR-139. The big changes I made are:

          • It uses two spaces (not tabs or 4 spaces)
          • It overwrites the existing XmlUpdateRequestHandler rather then adding a parallel one. (We should either use StAX or XPP, but not both)
          • It breaks out the xml parsing so that parsing a single document is an easily testable chunk:

          SolrDocument readDoc(XMLStreamReader parser)

          • It adds a test to make sure it reads documents correctly
          • Since it is the XmlUpdateRequestHandler all the other tests that insert documents use it.
          Show
          Ryan McKinley added a comment - Thorsten - this looks good. I cleaned it up a bit and modified it to use SOLR-139 . The big changes I made are: It uses two spaces (not tabs or 4 spaces) It overwrites the existing XmlUpdateRequestHandler rather then adding a parallel one. (We should either use StAX or XPP, but not both) It breaks out the xml parsing so that parsing a single document is an easily testable chunk: SolrDocument readDoc(XMLStreamReader parser) It adds a test to make sure it reads documents correctly Since it is the XmlUpdateRequestHandler all the other tests that insert documents use it.
          Hide
          Thorsten Scherler added a comment -

          @Larrea
          1) standards-based
          2) agree
          3) agree
          4) agree

          StAX is become a standard. Not as fast as SAX but nearly. IMO the StAX implementation is as easy to follow as the xpp, personally I think even easier.

          Show
          Thorsten Scherler added a comment - @Larrea 1) standards-based 2) agree 3) agree 4) agree StAX is become a standard. Not as fast as SAX but nearly. IMO the StAX implementation is as easy to follow as the xpp, personally I think even easier.
          Hide
          Thorsten Scherler added a comment -

          Fixing bugs from first version.

          Adding workaround for problem with direct use of the handler (never gets a stream).
          http://www.mail-archive.com/solr-dev@lucene.apache.org/msg02759.html
          by patching the SolrUpdateServlet

          Please test, it works fine for me.

          Show
          Thorsten Scherler added a comment - Fixing bugs from first version. Adding workaround for problem with direct use of the handler (never gets a stream). http://www.mail-archive.com/solr-dev@lucene.apache.org/msg02759.html by patching the SolrUpdateServlet Please test, it works fine for me.
          Hide
          Thorsten Scherler added a comment -

          It seems the diff does not show the other libs you need to compile.

          You can download them from:
          https://svn.apache.org/repos/asf/forrest/trunk/whiteboard/dispatcher/lib/

          Show
          Thorsten Scherler added a comment - It seems the diff does not show the other libs you need to compile. You can download them from: https://svn.apache.org/repos/asf/forrest/trunk/whiteboard/dispatcher/lib/
          Hide
          Thorsten Scherler added a comment -

          Refactoring the XmlUpdateRequestHandler to use constant variables that can be reused by the Stax implementation. Adding a stax implementation for the XmlUpdateRequestHandler. Till now I get an error about missing content stream.

          NOTE:
          To make the version compile you need to download the JSR 173 API from
          http://www.ibiblio.org/maven2/stax/stax-api/1.0/stax-api-1.0.jar
          and copy it to $SOLR_HOME/lib/.

          Show
          Thorsten Scherler added a comment - Refactoring the XmlUpdateRequestHandler to use constant variables that can be reused by the Stax implementation. Adding a stax implementation for the XmlUpdateRequestHandler. Till now I get an error about missing content stream. NOTE: To make the version compile you need to download the JSR 173 API from http://www.ibiblio.org/maven2/stax/stax-api/1.0/stax-api-1.0.jar and copy it to $SOLR_HOME/lib/.
          Hide
          J.J. Larrea added a comment -

          It would be useful if there first were some consensus as to what the goals are for making a change to the XML Update Handler; some possibilities I can think of include:

          1) To use standards-based rather than non-standards-based technologies as much as possible
          2) To use as few different XML technologies (and coding styles related to the technology) as possible
          3) To reduce as much as possible the complexity of code needed for interpreting XML command and/or configuration streams
          4) To lower resource consumption and limitations for XML handling, e.g. stream-based rather than random-access

          By all means add to that list, prioritize, and remove goals which are not seen as important.

          Then it seems to me the question would be how many of those goals are addressed by changing XML Update Handler to stAX, vs. other technologies. One might at the same time also want to look at other places where SOLR decodes XML such as config files, to see if there can be more commonality rather than continued isolation.

          Show
          J.J. Larrea added a comment - It would be useful if there first were some consensus as to what the goals are for making a change to the XML Update Handler; some possibilities I can think of include: 1) To use standards-based rather than non-standards-based technologies as much as possible 2) To use as few different XML technologies (and coding styles related to the technology) as possible 3) To reduce as much as possible the complexity of code needed for interpreting XML command and/or configuration streams 4) To lower resource consumption and limitations for XML handling, e.g. stream-based rather than random-access By all means add to that list, prioritize, and remove goals which are not seen as important. Then it seems to me the question would be how many of those goals are addressed by changing XML Update Handler to stAX, vs. other technologies. One might at the same time also want to look at other places where SOLR decodes XML such as config files, to see if there can be more commonality rather than continued isolation.

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Hoss Man
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development