Solr
  1. Solr
  2. SOLR-1837

Reconstruct a Document (stored fields, indexed fields, payloads)

    Details

      Description

      One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document. While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.

      One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.). The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

      1. SOLR-1837.patch
        24 kB
        Trey Grainger
      2. SOLR-1837_WithHandler.patch
        30 kB
        John Wooden

        Activity

        Hide
        Trey Grainger added a comment -

        I've been working on implementing the document reconstruction feature over the past week and have created an additional admin page which exposes it. The functionality is essentially a reworking of the lucene document reconstruction functionality in Luke, but with improvements to handle the problems listed in the jira issue description above.

        I'll be pushing up a patch soon and will look forward to any additional recommendations after others have had a chance to try it out.

        Show
        Trey Grainger added a comment - I've been working on implementing the document reconstruction feature over the past week and have created an additional admin page which exposes it. The functionality is essentially a reworking of the lucene document reconstruction functionality in Luke, but with improvements to handle the problems listed in the jira issue description above. I'll be pushing up a patch soon and will look forward to any additional recommendations after others have had a chance to try it out.
        Hide
        Andrzej Bialecki added a comment -

        Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker.

        Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation).

        Show
        Andrzej Bialecki added a comment - Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker. Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation).
        Hide
        Trey Grainger added a comment -

        Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker.

        I just pulled down the most recent Luke code, and it does looks like that recent fix was made to cover the bug I saw. Unfortunately, the fix results in a null ref for me on my index. I'll open an issue, as it looks like all that's needed is an extra null check.

        Re: Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation).

        I hear you about it being IO-intensive. There's also other admin tools in Solr which do similarly intensive operations (the schema browser, for example, which generates a list of all fields and a distribution of terms within those fields). The intent of the tool is for one-off debugging, not for any kind of automated querying, but I'll try do some tests to see to what degree this tool is affecting our current production systems (I have not see any noticeable effect thus far).

        Also, regarding the process being lossy. In this case, that is kind of the point of the tool (in my use) - to see what has actually been put into the index vs what was in the document sent to the engine. For example, if I index a field with the text "Wi-fi hotspots are a life-saver" with payloads on parts of speech, as well as stemming I want to be able to see something like:
        "wi [1] / fi [1] | wifi [1] / hotspot [1] / are [2] / a [3] / life [1] / saver [1] | lifesaver [1]"

        With no payloads, this would simply be
        "wi / fi | wifi / hotspots | hotspot / are / a / life / saver | lifesaver"

        So I had initially named to tool the Solr Document Reconstructor, after the name you gave to the tool in Luke. Based on your comments, I think it might be less confusing for me to call it something like "Document Inspector", since it is not truly reconstructing the original document.

        I'll try to get what I have pushed up today so you can check it out if you want. Thanks for your great work on that tool!

        Show
        Trey Grainger added a comment - Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker. I just pulled down the most recent Luke code, and it does looks like that recent fix was made to cover the bug I saw. Unfortunately, the fix results in a null ref for me on my index. I'll open an issue, as it looks like all that's needed is an extra null check. Re: Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation). I hear you about it being IO-intensive. There's also other admin tools in Solr which do similarly intensive operations (the schema browser, for example, which generates a list of all fields and a distribution of terms within those fields). The intent of the tool is for one-off debugging, not for any kind of automated querying, but I'll try do some tests to see to what degree this tool is affecting our current production systems (I have not see any noticeable effect thus far). Also, regarding the process being lossy. In this case, that is kind of the point of the tool (in my use) - to see what has actually been put into the index vs what was in the document sent to the engine. For example, if I index a field with the text "Wi-fi hotspots are a life-saver" with payloads on parts of speech, as well as stemming I want to be able to see something like: "wi [1] / fi [1] | wifi [1] / hotspot [1] / are [2] / a [3] / life [1] / saver [1] | lifesaver [1] " With no payloads, this would simply be "wi / fi | wifi / hotspots | hotspot / are / a / life / saver | lifesaver" So I had initially named to tool the Solr Document Reconstructor, after the name you gave to the tool in Luke. Based on your comments, I think it might be less confusing for me to call it something like "Document Inspector", since it is not truly reconstructing the original document. I'll try to get what I have pushed up today so you can check it out if you want. Thanks for your great work on that tool!
        Hide
        Trey Grainger added a comment -

        Here's what I have thusfar. Only bug I currently know about is that Solr multi-valued fields (i.e. <field name="x">value1</field><field name="x">value2</field>) currently display as concatenated together instead of as an array of separate fields in the "stored fields" view.

        I've referred to the tool in the admin interface as the "Document Inspector" instead of "Document Reconstructor" to prevent confusion over lost/changed/added terms due to index-time analysis.

        Any feedback appreciated.

        Show
        Trey Grainger added a comment - Here's what I have thusfar. Only bug I currently know about is that Solr multi-valued fields (i.e. <field name="x">value1</field><field name="x">value2</field>) currently display as concatenated together instead of as an array of separate fields in the "stored fields" view. I've referred to the tool in the admin interface as the "Document Inspector" instead of "Document Reconstructor" to prevent confusion over lost/changed/added terms due to index-time analysis. Any feedback appreciated.
        Hide
        Hoss Man added a comment -

        Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

        A unique token for finding these 240 issues in the future: hossversioncleanup20100527

        Show
        Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
        Hide
        Hoss Man added a comment -

        1) Solr is moving away from having any JSPs at all – we've been focusing on having RequestHandlers return all of the data in a structured machine parsable form (with UIs being made possible using XSLT or AJAX)

        2) it seems like instead of adding a new requesthandler (or JSP) for this, it would make more sense to add this as optional info that could be requested when using LukeRequestHandler's "id" (or docId) functionality...

        http://wiki.apache.org/solr/LukeRequestHandler#id

        Show
        Hoss Man added a comment - 1) Solr is moving away from having any JSPs at all – we've been focusing on having RequestHandlers return all of the data in a structured machine parsable form (with UIs being made possible using XSLT or AJAX) 2) it seems like instead of adding a new requesthandler (or JSP) for this, it would make more sense to add this as optional info that could be requested when using LukeRequestHandler's "id" (or docId) functionality... http://wiki.apache.org/solr/LukeRequestHandler#id
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Robert Muir added a comment -

        3.4 -> 3.5

        Show
        Robert Muir added a comment - 3.4 -> 3.5
        Hide
        Hoss Man added a comment -

        Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

        email notification suppressed to prevent mass-spam
        psuedo-unique token identifying these issues: hoss20120321nofix36

        Show
        Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
        Hide
        Jan Høydahl added a comment -

        Trey Grainger, is this something you still want to pursue for Solr4/5, perhaps as an extension to LukeReqHandler?

        Show
        Jan Høydahl added a comment - Trey Grainger , is this something you still want to pursue for Solr4/5, perhaps as an extension to LukeReqHandler?
        Hide
        John Wooden added a comment -

        I've updated this patch to use a handler rather than JSP. Patch is also confirmed working with 4.2.1.

        Performance is still quite slow. The SolrDocReconstructor class hasn't changed much since the prior version.

        – How to use –

        1. Add the handler to your config:

        <requestHandler name="/admin/docinspector" class="solr.DocumentReconstructorHandler" />

        2. Sample call:

        /solr/coreX/admin/docinspector?documentid=12345

        3. Wait. Time required varies by size of document and index. A large document in a large index may allow enough time for a doughnut & coffee run.

        4. Sample output:

        <response>
        <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">x</int>
        </lst>
        <str name="DocumentID">12345</str>
        <lst name="Fields">
        <lst name="Stored">
        <str name="documentid">12345</str>
        <str name="isstarter.b_s">true</str>
        <str name="jerseynumber.i_is">16</str>
        <str name="schema">test</str>
        <str name="solrdt">2013-07-03T19:06:42.069Z</str>
        </lst>
        <lst name="Indexed">
        <str name="documentid">12345</str>
        <str name="dodges.i_i">28 | 0 | 0 | 0</str>
        <str name="hits.i_i">17 | 0 | 0 | 0</str>
        <str name="jerseynumber.i_is">16 | 0 | 0 | 0</str>
        <str name="schema">test</str>
        <str name="solrdt">2013-07-03T19:06:42.069Z | 2013-07-03T19:06:42.048Z | 2013-07-03T19:05:40.096Z | 2013-07-03T14:46:48.064Z | 2013-06-01T13:49:27.424Z | 2004-11-03T19:53:47.776Z | 1970-01-01T00:00:00Z | 1970-01-01T00:00:00Z</str>
        </lst>
        </lst>
        </response>

        Show
        John Wooden added a comment - I've updated this patch to use a handler rather than JSP. Patch is also confirmed working with 4.2.1. Performance is still quite slow. The SolrDocReconstructor class hasn't changed much since the prior version. – How to use – 1. Add the handler to your config: <requestHandler name="/admin/docinspector" class="solr.DocumentReconstructorHandler" /> 2. Sample call: /solr/coreX/admin/docinspector?documentid=12345 3. Wait. Time required varies by size of document and index. A large document in a large index may allow enough time for a doughnut & coffee run. 4. Sample output: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">x</int> </lst> <str name="DocumentID">12345</str> <lst name="Fields"> <lst name="Stored"> <str name="documentid">12345</str> <str name="isstarter.b_s">true</str> <str name="jerseynumber.i_is">16</str> <str name="schema">test</str> <str name="solrdt">2013-07-03T19:06:42.069Z</str> </lst> <lst name="Indexed"> <str name="documentid">12345</str> <str name="dodges.i_i">28 | 0 | 0 | 0</str> <str name="hits.i_i">17 | 0 | 0 | 0</str> <str name="jerseynumber.i_is">16 | 0 | 0 | 0</str> <str name="schema">test</str> <str name="solrdt">2013-07-03T19:06:42.069Z | 2013-07-03T19:06:42.048Z | 2013-07-03T19:05:40.096Z | 2013-07-03T14:46:48.064Z | 2013-06-01T13:49:27.424Z | 2004-11-03T19:53:47.776Z | 1970-01-01T00:00:00Z | 1970-01-01T00:00:00Z</str> </lst> </lst> </response>
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Trey Grainger
          • Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development