Solr
  1. Solr
  2. SOLR-3246

UpdateRequestProcessor to extract Solr XML from rich documents

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: update
    • Labels:
      None

      Description

      This would be an update request handler to save a file with the xml that represents the document in an external directory. The original
      idea behind this was to add it to the processing chain of the ExtractingRequestHandler to store an already parsed version of the docs. This storage of pre-parsed documents will make the re indexing of the entire index faster (avoiding the Tika phase, and just sending the xml to the standard update processor).
      As a side effect, extracting the xml can make debugging of rich docs easier.

      1. SOLR-3246.patch
        19 kB
        Emmanuel Espina
      2. SOLR-3246.patch
        23 kB
        Emmanuel Espina

        Activity

        Hide
        Emmanuel Espina added a comment - - edited

        Added some changes to let the user select the format of the output. In the patch there is only a XML writer, but others like the CSV can be added.

        Show
        Emmanuel Espina added a comment - - edited Added some changes to let the user select the format of the output. In the patch there is only a XML writer, but others like the CSV can be added.
        Hide
        Emmanuel Espina added a comment -

        Probably the output format could be set in a similar way to how it's done with the response writers. In that way the XMLWritingUpdateProcessor would be just WritingUpdateProcessor and the writer can be selected with a parameter in the configuration, having a default (being that xml or csv). That would be:

        <updateRequestProcessorChain name="writer">
        <processor class="org.apache.solr.update.processor.WritingUpdateProcessorFactory">
        <str name="outputDir">"./dacDumps</str>
        <str name="writer">xml</str>
        <str name="groupFiles">100</str>
        </processor>
        </updateRequestProcessorChain>

        Also with another parameter one could select to add to the same file one, n or unlimited documents.

        Show
        Emmanuel Espina added a comment - Probably the output format could be set in a similar way to how it's done with the response writers. In that way the XMLWritingUpdateProcessor would be just WritingUpdateProcessor and the writer can be selected with a parameter in the configuration, having a default (being that xml or csv). That would be: <updateRequestProcessorChain name="writer"> <processor class="org.apache.solr.update.processor.WritingUpdateProcessorFactory"> <str name="outputDir">"./dacDumps</str> <str name="writer">xml</str> <str name="groupFiles">100</str> </processor> </updateRequestProcessorChain> Also with another parameter one could select to add to the same file one, n or unlimited documents.
        Hide
        Jan Høydahl added a comment -

        We wrote a data dumper in a project as a patched ExtractingUpdateRequestHandler. It writes a CSV format (including Base64 encoded binary input) to one file. We were thinking about rewriting it as an UpdateProcessor, which will then work much like yours. The benefit with CSV format is that it is much more compact. Also, a file system may kneal with too many files in a folder.

        Show
        Jan Høydahl added a comment - We wrote a data dumper in a project as a patched ExtractingUpdateRequestHandler. It writes a CSV format (including Base64 encoded binary input) to one file. We were thinking about rewriting it as an UpdateProcessor, which will then work much like yours. The benefit with CSV format is that it is much more compact. Also, a file system may kneal with too many files in a folder.
        Hide
        Emmanuel Espina added a comment -

        Initial code for this component (with a very simple test)

        Show
        Emmanuel Espina added a comment - Initial code for this component (with a very simple test)
        Hide
        Emmanuel Espina added a comment -

        This is similar to https://issues.apache.org/jira/browse/SOLR-903
        But this would be a server side component.

        Show
        Emmanuel Espina added a comment - This is similar to https://issues.apache.org/jira/browse/SOLR-903 But this would be a server side component.

          People

          • Assignee:
            Unassigned
            Reporter:
            Emmanuel Espina
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development