Solr
  1. Solr
  2. SOLR-380

There's no way to convert search results into page-level hits of a "structured document".

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: search
    • Labels:
      None

      Description

      "Paged-Text" FieldType for Solr

      A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

      The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

      At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

      <lst name="pages">
        <lst name="doc1">
           <int name="pageid">234</int>
           <int name="pageid">236</int>
         </lst>
         <lst name="doc2">
           <int name="pageid">19</int>
         </lst>
      </lst>
      <lst name="hitpos">
         <lst name="doc1">
           <lst name="234">
             <int name="pos">14325</int>
           </lst>
         </lst>
         ...
      </lst>

      1. SOLR-380-XmlPayload.patch
        155 kB
        Tricia Jenkins
      2. SOLR-380-XmlPayload.patch
        92 kB
        Tricia Jenkins
      3. xmlpayload-src.jar
        5.74 MB
        Tricia Jenkins
      4. xmlpayload.jar
        10 kB
        Tricia Jenkins
      5. xmlpayload-example.zip
        8.55 MB
        Tricia Jenkins

        Issue Links

          Activity

          Tricia Jenkins created issue -
          Tricia Jenkins made changes -
          Field Original Value New Value
          Summary The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. There's no way to convert search results into page-level hits of a "structured document".
          Description "Paged-Text" FieldType for Solr
          >
          > A chance to dig into the guts of Solr. The problem: If we index a
          > monograph in Solr, there's no way to convert search results into
          > page-level hits. The solution: have a "paged-text" fieldtype which keeps
          > track of page divisions as it indexes, and reports page-level hits in the
          > search results.
          >
          > The input would contain page milestones: <page id="234"/>. As Solr
          > processed the tokens (using its standard tokenizers and filters), it would
          > concurrently build a structural map of the item, indicating which term
          > position marked the beginning of which page: <page id="234"
          > firstterm="14324"/>. This map would be stored in an unindexed field in
          > some efficient format.
          >
          > At search time, Solr would retrieve term positions for all hits that are
          > returned in the current request, and use the stored map to determine page
          > ids for each term position. The results would imitate the results for
          > highlighting, something like:
          >
          > <lst name="pages">
          > <lst name="doc1">
          > <int name="pageid">234</int>
          > <int name="pageid">236</int>
          > </lst>
          > <lst name="doc2">
          > <int name="pageid">19</int>
          > </lst>
          > </lst>
          > <lst name="hitpos">
          > <lst name="doc1">
          > <lst name="234">
          > <int name="pos">14325</int>
          > </lst>
          > </lst>
          > ...
          > </lst>
          "Paged-Text" FieldType for Solr

          A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

          The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

          At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

          <lst name="pages">
                  <lst name="doc1">
                          <int name="pageid">234</int>
                          <int name="pageid">236</int>
                  </lst>
                  <lst name="doc2">
                          <int name="pageid">19</int>
                  </lst>
          </lst>
          <lst name="hitpos">
                  <lst name="doc1">
                          <lst name="234">
                                  <int name="pos">14325</int>
                          </lst>
                  </lst>
                  ...
          </lst>
          Hide
          Peter Binkley added a comment -

          I've been wondering about what's required to get this output added to the response. It appears that a response writer isn't the answer: those are for different formats (xml, json, etc.). Is everything we need included in the FieldType methods (write(), etc.)? The highlighting functionality is probably a good model for what we want to do.

          Show
          Peter Binkley added a comment - I've been wondering about what's required to get this output added to the response. It appears that a response writer isn't the answer: those are for different formats (xml, json, etc.). Is everything we need included in the FieldType methods (write(), etc.)? The highlighting functionality is probably a good model for what we want to do.
          Hide
          Ryan McKinley added a comment -

          I don't totally understand how a field type solves your problem (I'm sure it can... i just don't quite follow)

          But - If you want your search results to return pages, why not just index each page as a new SolrDocument?

          Show
          Ryan McKinley added a comment - I don't totally understand how a field type solves your problem (I'm sure it can... i just don't quite follow) But - If you want your search results to return pages, why not just index each page as a new SolrDocument?
          Hide
          Peter Binkley added a comment -

          The problem with the page-as-SorlDocument approach is that you then have to group the pages back under their container documents to present a unified result to the user (like this: http://tinyurl.com/yt2a25 ). I want the primary unit of granularity in search results to be the book, and the pages to be only a secondary layer. I also want to be able to do proximity searches that bridge page boundaries, have relevance ranking consider the whole book text and not just that page, etc.: i.e. treat the text as continuous for searching purposes. So I gain a lot by treating the book as the SolrDocument; I just need that extra bit of work to resolve the page positions to have it all.

          Show
          Peter Binkley added a comment - The problem with the page-as-SorlDocument approach is that you then have to group the pages back under their container documents to present a unified result to the user (like this: http://tinyurl.com/yt2a25 ). I want the primary unit of granularity in search results to be the book, and the pages to be only a secondary layer. I also want to be able to do proximity searches that bridge page boundaries, have relevance ranking consider the whole book text and not just that page, etc.: i.e. treat the text as continuous for searching purposes. So I gain a lot by treating the book as the SolrDocument; I just need that extra bit of work to resolve the page positions to have it all.
          Hide
          Peter Binkley added a comment -

          formatted the xml for clarity

          Show
          Peter Binkley added a comment - formatted the xml for clarity
          Peter Binkley made changes -
          Description "Paged-Text" FieldType for Solr

          A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

          The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

          At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

          <lst name="pages">
                  <lst name="doc1">
                          <int name="pageid">234</int>
                          <int name="pageid">236</int>
                  </lst>
                  <lst name="doc2">
                          <int name="pageid">19</int>
                  </lst>
          </lst>
          <lst name="hitpos">
                  <lst name="doc1">
                          <lst name="234">
                                  <int name="pos">14325</int>
                          </lst>
                  </lst>
                  ...
          </lst>
          "Paged-Text" FieldType for Solr

          A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

          The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

          At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

          <lst name="pages">
          &nbsp;&nbsp;<lst name="doc1">
          &nbsp;&nbsp;&nbsp;&nbsp; <int name="pageid">234</int>
          &nbsp;&nbsp;&nbsp;&nbsp; <int name="pageid">236</int>
          &nbsp;&nbsp; </lst>
          &nbsp;&nbsp; <lst name="doc2">
          &nbsp;&nbsp;&nbsp;&nbsp; <int name="pageid">19</int>
          &nbsp;&nbsp; </lst>
          </lst>
          <lst name="hitpos">
          &nbsp;&nbsp; <lst name="doc1">
          &nbsp;&nbsp;&nbsp;&nbsp; <lst name="234">
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <int name="pos">14325</int>
          &nbsp;&nbsp;&nbsp;&nbsp; </lst>
          &nbsp;&nbsp; </lst>
          &nbsp;&nbsp; ...
          </lst>
          Hide
          Pieter Berkel added a comment -

          There was a recent discussion surrounding a similar problem on solr-user:
          http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390

          The idea was to use dynamic fields (e.g. page_1, page_2, page_3... page_N) to store the text of each page in a single document. The problem is that currently Solr does not support "glob" style field expansion in query parameters (e.g. qf=page_* ) so you would end up having to specify the entire list of page fields in your query, which is impractical. There is already an open issue related to this particular problem (SOLR-247) but nobody has had time to look into it.

          In terms of returning term position information, this seems somehow (albeit loosely) related to highlighting, is there any way you could use the existing functionality to achieve your goal? (definitely would be a hack though)

          Show
          Pieter Berkel added a comment - There was a recent discussion surrounding a similar problem on solr-user: http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 The idea was to use dynamic fields (e.g. page_1, page_2, page_3... page_N) to store the text of each page in a single document. The problem is that currently Solr does not support "glob" style field expansion in query parameters (e.g. qf=page_* ) so you would end up having to specify the entire list of page fields in your query, which is impractical. There is already an open issue related to this particular problem ( SOLR-247 ) but nobody has had time to look into it. In terms of returning term position information, this seems somehow (albeit loosely) related to highlighting, is there any way you could use the existing functionality to achieve your goal? (definitely would be a hack though)
          Hide
          Erik Hatcher added a comment -

          > The idea was to use dynamic fields (e.g. page_1, page_2, page_3... page_N) to store the text of each page in a single document. The problem is that currently Solr does not support "glob" style field expansion in query parameters (e.g.
          > qf=page_* ) so you would end up having to specify the entire list of page fields in your query, which is impractical. There is already an open issue related to this particular problem (SOLR-247) but nobody has had time to look into it.

          In this case, a copyField from page_* into an unstored "contents" would do the trick, which would also facilitate querying across pages. A position increment gap could also prohibit phrase queries across "pages", optionally.

          Show
          Erik Hatcher added a comment - > The idea was to use dynamic fields (e.g. page_1, page_2, page_3... page_N) to store the text of each page in a single document. The problem is that currently Solr does not support "glob" style field expansion in query parameters (e.g. > qf=page_* ) so you would end up having to specify the entire list of page fields in your query, which is impractical. There is already an open issue related to this particular problem ( SOLR-247 ) but nobody has had time to look into it. In this case, a copyField from page_* into an unstored "contents" would do the trick, which would also facilitate querying across pages. A position increment gap could also prohibit phrase queries across "pages", optionally.
          Hide
          Peter Binkley added a comment -

          Both these methods (page_* fields or unstored "contents" field) would make it difficult to discover from the search results which pages matched the query, though, wouldn't they? They would both need extra work to populate a structure like the "pages" and "hitpos" elements in the sample xml above. Would that extra work be more efficient than the document-map approach we've proposed above?

          The highlighting functionality is definitely the model to follow for handling term positions.

          Show
          Peter Binkley added a comment - Both these methods (page_* fields or unstored "contents" field) would make it difficult to discover from the search results which pages matched the query, though, wouldn't they? They would both need extra work to populate a structure like the "pages" and "hitpos" elements in the sample xml above. Would that extra work be more efficient than the document-map approach we've proposed above? The highlighting functionality is definitely the model to follow for handling term positions.
          Hide
          Tricia Jenkins added a comment - - edited

          The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical. The number of pages of the monographs we index vary greatly (10s to 1000s of pages). So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples. Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field. If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results.

          In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets. In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
          http://tinyurl.com/3xdshk
          (essentially shows the parameters and their values for this example – pay attention to the hl.fl parameter)
          gives the normal results, with the following at the end:

          <lst name="highlighting">
           <lst name="News.EFP.186500">
            <arr name="fulltext_1">
             <str>
               was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three
             </str>
            </arr>
            <arr name="fulltext_4">
             <str>
               ^f 6r-Ke.w¥eaf!fl': Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year
             </str>
            </arr>
            <arr name="fulltext_6">
             <str>
               <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He
             </str>
            </arr>
            <arr name="fulltext_7">
             <str>
               . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th>e Detroit river between
             </str>
            </arr>
           </lst>
          </lst>

          You will notice that only the pages with hits on them appear in the highlight section. From this point it would take a little work to parse the /arr[@name] to get the * from fulltext_* for each document match.

          I agree that the highlighter is a good model of what we want to do. But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching.

          I don't think defining a FieldType will allow us to do this. The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried.

          Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?

          Show
          Tricia Jenkins added a comment - - edited The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical. The number of pages of the monographs we index vary greatly (10s to 1000s of pages). So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples. Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field. If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results. In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets. In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext: http://tinyurl.com/3xdshk (essentially shows the parameters and their values for this example – pay attention to the hl.fl parameter) gives the normal results, with the following at the end: <lst name="highlighting">  <lst name="News.EFP.186500">   <arr name="fulltext_1">    <str>      was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three    </str>   </arr>   <arr name="fulltext_4">    <str>      ^ f 6r-Ke.w ¥eaf!fl': Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year    </str>   </arr>   <arr name="fulltext_6">    <str>      <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He    </str>   </arr>   <arr name="fulltext_7">    <str>      . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th>e Detroit river between    </str>   </arr>  </lst> </lst> You will notice that only the pages with hits on them appear in the highlight section. From this point it would take a little work to parse the /arr [@name] to get the * from fulltext_* for each document match. I agree that the highlighter is a good model of what we want to do. But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching. I don't think defining a FieldType will allow us to do this. The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried. Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?
          Hide
          Peter Binkley added a comment -

          Thanks for clarifying how the highlighting would let you see the page numbers. On that model, all we would need would be to enhance the highlighting report to make it show the term positions rather than (or as well a) the terms.

          But I'm not ready to give up on the map idea yet. I hadn't dug far enough into FieldTypes, evidently. Could we maybe index the text in the normal way, with a token filter that ignores the milestones, and then copyfield the text to a FieldType whose only job is to build and store the map? Provided that the two were tokenizing and filtering in the same way, the position counts would remain in sync; the mapping FieldType would just require a final filter that counted the incoming tokens and took note of the milestones, and generated the map as a series of tokens in whatever format we decide to store the map in.

          (And Tricia, would you mind tinyfying that url, so the page doesn't get stretched?)

          Show
          Peter Binkley added a comment - Thanks for clarifying how the highlighting would let you see the page numbers. On that model, all we would need would be to enhance the highlighting report to make it show the term positions rather than (or as well a) the terms. But I'm not ready to give up on the map idea yet. I hadn't dug far enough into FieldTypes, evidently. Could we maybe index the text in the normal way, with a token filter that ignores the milestones, and then copyfield the text to a FieldType whose only job is to build and store the map? Provided that the two were tokenizing and filtering in the same way, the position counts would remain in sync; the mapping FieldType would just require a final filter that counted the incoming tokens and took note of the milestones, and generated the map as a series of tokens in whatever format we decide to store the map in. (And Tricia, would you mind tinyfying that url, so the page doesn't get stretched?)
          Hide
          Mike Klaas added a comment -

          In my opinion the best solution is to create one solr document per page and denormalize the container data across each page.

          If I had to implement it the other way, I would probably index the pages as a multivalued field with a large position increment gap (say 1000), store term vectors, and use the position information from the term vectors to determine the page hits (e.g., pos 4668 -> page 5; pos 668 -> page 1; pos 9999 -> page 10). Assumes < 1000 tokens per page, of course.

          Incidentally, this discussion doesn't really belong here. It would be better to sketch out ideas on solr-user, then move to JIRA to track a resulting patch (if it gets that far). I actually don't think that there is anything to add to Solr here--it seems more of a question of customization.

          Show
          Mike Klaas added a comment - In my opinion the best solution is to create one solr document per page and denormalize the container data across each page. If I had to implement it the other way, I would probably index the pages as a multivalued field with a large position increment gap (say 1000), store term vectors, and use the position information from the term vectors to determine the page hits (e.g., pos 4668 -> page 5; pos 668 -> page 1; pos 9999 -> page 10). Assumes < 1000 tokens per page, of course. Incidentally, this discussion doesn't really belong here. It would be better to sketch out ideas on solr-user, then move to JIRA to track a resulting patch (if it gets that far). I actually don't think that there is anything to add to Solr here--it seems more of a question of customization.
          Hide
          Peter Binkley added a comment -

          OK, taking the discussion to solr-user until we nail down what we're doing.

          Show
          Peter Binkley added a comment - OK, taking the discussion to solr-user until we nail down what we're doing.
          Hide
          Tricia Jenkins added a comment -

          This is a draft. Note that Payload and Token classes in particular have changed since lucene-core-2.2.0.jar. Users of this patch will need to replace lucene-core-2.2.0.jar with lucene-core-2.3-dev.jar. I have created a test for XmlPayloadCharTokenizer but not attached it here because LuceneTestCase is not in SOLR's classpath in any form and it will break the build.

          The code works in theory and passes tests to that effect. However, in practice when I deploy the war created from the "dist" ant target several problems result from adding documents (which seems to work using a <![CDATA[...]]> to contain the structured document and post.jar):

          • after adding a XmlPayload tokenized document, q=: causes 500 error: HTTP Status 500 - read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:153) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:408) at org.apache.lucene.index.MultiSegmentReader.document(MultiSegmentReader.java:129) at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at ...
          • use of the highlight fields produces the same error.
          • queries that should match a XmlPayload tokenized document do not ( //result[@numFound='0'])-- though queries matching un-XmlPayload tokenized document continue to return the expected results.
          • trying to view the index using Luke (Lucene Index Toolbox, v 0.7.1 (2007-06-20) ) returns: Unknown format version: -4
          • Solr Statistics confirm that all the documents have been added.

          I will continue to finish this functionality but any suggestions or other input are welcomed. You will see how the functionality is intended to be used in src/test/org/apache/solr/highlight/XmlPayloadTest.java

          Show
          Tricia Jenkins added a comment - This is a draft. Note that Payload and Token classes in particular have changed since lucene-core-2.2.0.jar. Users of this patch will need to replace lucene-core-2.2.0.jar with lucene-core-2.3-dev.jar. I have created a test for XmlPayloadCharTokenizer but not attached it here because LuceneTestCase is not in SOLR's classpath in any form and it will break the build. The code works in theory and passes tests to that effect. However, in practice when I deploy the war created from the "dist" ant target several problems result from adding documents (which seems to work using a <![CDATA [...] ]> to contain the structured document and post.jar): after adding a XmlPayload tokenized document, q= : causes 500 error: HTTP Status 500 - read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:153) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:408) at org.apache.lucene.index.MultiSegmentReader.document(MultiSegmentReader.java:129) at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at ... use of the highlight fields produces the same error. queries that should match a XmlPayload tokenized document do not ( //result [@numFound='0'] )-- though queries matching un-XmlPayload tokenized document continue to return the expected results. trying to view the index using Luke (Lucene Index Toolbox, v 0.7.1 (2007-06-20) ) returns: Unknown format version: -4 Solr Statistics confirm that all the documents have been added. I will continue to finish this functionality but any suggestions or other input are welcomed. You will see how the functionality is intended to be used in src/test/org/apache/solr/highlight/XmlPayloadTest.java
          Tricia Jenkins made changes -
          Attachment lucene-core-2.3-dev.jar [ 12369349 ]
          Attachment SOLR-380-XmlPayload.patch [ 12369348 ]
          Hide
          Tricia Jenkins added a comment -

          Functionality is improved. Tests are more complete. I have included an example (much like the example included in solr) which demonstrates the changes needed to solrconfig.xml and schema.xml. As well as some xml documents to start playing with.

          TODO:

          • Still have to track down what happens when filters are applied to the Tokenizer.
          • Implement error handling for bad xml input.
          Show
          Tricia Jenkins added a comment - Functionality is improved. Tests are more complete. I have included an example (much like the example included in solr) which demonstrates the changes needed to solrconfig.xml and schema.xml. As well as some xml documents to start playing with. TODO: Still have to track down what happens when filters are applied to the Tokenizer. Implement error handling for bad xml input.
          Tricia Jenkins made changes -
          Attachment SOLR-380-XmlPayload.patch [ 12369631 ]
          Tricia Jenkins made changes -
          Link This issue depends on SOLR-386 [ SOLR-386 ]
          Tricia Jenkins made changes -
          Attachment lucene-core-2.3-dev.jar [ 12369349 ]
          Hide
          Tricia Jenkins added a comment -

          After a lengthy absence I've returned to this issue with a bit of a new perspective. I recognize what we have described really is a customization of Solr (albeit one I have seen in at least two organizations) and as such should be built as a plug-in (http://wiki.apache.org/solr/SolrPlugins) which can reside in your solr.home lib directory. Now that Solr has lucene 2.3 and payloads my solution is much easier to apply than before.

          I'll try to explain it here and then attach the src, deployable jar, and example for your use/reuse.

          I assume that your structured document can be represented by xml:

          <book title="One, Two, Three">
             <page label="1">one</page>
             <page label="2">two</page>
             <page label="3">three</page>
          </book>
          

          But we don't have a tokenizer that can make sense of xml. So I wrote a tokenizer which parallels the existing WhitespaceTokenizer called XmlPayloadWhitespaceTokenizer. XmlPayloadWhitespaceTokenizer extends XmlPayloadCharTokenizer which does the same things as CharTokenizer in Lucene, but expects that the content is wrapped in xml tags. The tokenizer keeps track of the xpath associated with each token and stores this as a payload.

          To use my Tokenizer in Solr I add the deployable jar I created containing XmlPayloadWhitespaceTokenizer in my solr.home lib director and add a structure text field type "text_st" to my schema.xml:

          <!-- A text field that uses the XmlPayloadWhitespaceTokenizer to store xpath info about the structured document -->
            <fieldType name="text_st" class="solr.TextField" positionIncrementGap="100">
              <analyzer type="index">
                <tokenizer class="solr.XmlPayloadWhitespaceTokenizerFactory"/>
                <!-- in this example, we will only use synonyms at query time
                <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
                -->
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
              </analyzer>
              <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
              </analyzer>
            </fieldType>
          

          I also add a field "fulltext_st" of type "text_st".

          We can visualize what happens to the input text above using the Solr Admin web-app analysis.jsp modified by SOLR-522.

          term position 1 2 3
          term text one two three
          term type word word word
          source start,end 3,6 7,10 11,16
          payload /book[title='One, Two, Three']/page[label='1'] /book[title='One, Two, Three']/page[label='2'] /book[title='One, Two, Three']/page[label='3']

          Note that I've removed the hex representation of the payload for clarity

          The other side of this problem is how to present the results in a meaningful way. Taking FacetComponent and HighlightComponent as my muse, I created a plugable SearchComponent called PayloadComponent. This component recognizes two parameters: "payload" and "payload.fl". If payload=true, the component will find the terms from your query in the payload.fl field, retrieve the payload in these tokens, and re-combine this information to display the xpath of a search result in a give document and the number of times that term occurs in the given xpath.

          Again, to use my SearchComponent in Solr I add the deployable jar I created containing PayloadComponent in my solr.home lib director and add a search component "payload" to my solrconfig.xml:

          <searchComponent name="payload" class="org.apache.solr.handler.component.PayloadComponent"/>
           
            <requestHandler name="/search" class="org.apache.solr.handler.component.SearchHandler">
              <lst name="defaults">
                <str name="echoParams">explicit</str>
              </lst>
              <arr name="last-components">
                <str>payload</str>
              </arr>
            </requestHandler>
          

          Then the result of http://localhost:8983/solr/search?q=came&payload=true&payload.fl=fulltext_st includes something like this:

          <lst name="payload">
           <lst name="payload_context">
            <lst name="Book.IA.0001">
             <lst name="fulltext_st">
              <int name="/book[title='Crooked Man'][url='http://ia310931.us.archive.org//load_djvu_applet.cgi?file=0/items/crookedmanotherr00newyiala/crookedmanotherr00newyiala.djvu'][author='unknown']/page[id='3']">1</int>
             </lst>
            </lst>
            <lst name="Book.IA.37729">
             <lst name="fulltext_st">
              <int name="/book[title='Charles Dicken's A Christmas Carol'][url=''][author='Dickens, Charles']/stave[title='Marley's Ghost'][id='One']/page[id='13']">1</int>
             </lst>
            </lst>
            <lst name="Book.IA.0002">
             <lst name="fulltext_st">
              <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='2']">1</int>
              <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='4']">1</int>
              <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='6']">1</int>
              <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='7']">1</int>
              <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='13']">1</int>
             </lst>
            </lst>
           </lst>
          </lst>
          

          The documents here are borrowed from the Internet Archive and can be found in the xmlpayload-example.zip attached to this issue

          Then you have everything you need to write an xsl which will take your normal Solr results and supplement them with context from your structured document.

          There may be some issues with filters that aren't payload aware. The only one that concerned me to this point is the WordDelimiterFilter. You can find a quick and easy patch at SOLR-532.

          The other thing that you might run into if you use curl or post.jar is that the XmlUpdateRequestHandler is a bit anal about well formed xml, and throws an exception if it finds anything but the expected <doc> and <field> tags. To work around either escape your structured document's xml like this:

          <add>
           <doc>
            <field name="id">0001</field>
            <field name="title">One, Two, Three</field>
            <field name="fulltext_st">
             &lt;book title="One, Two, Three"&gt;
              &lt;page label="1"&gt;one&lt;/page&gt;
              &lt;page label="2"&gt;two&lt;/page&gt;
              &lt;page label="3"&gt;three&lt;/page&gt;
             &lt;/book&gt;
            </field>
           </doc>
          </add>
          

          or hack XmlUpdateRequestHandler to accept your "unexpected XML tag doc/".

          Cool?

          Show
          Tricia Jenkins added a comment - After a lengthy absence I've returned to this issue with a bit of a new perspective. I recognize what we have described really is a customization of Solr (albeit one I have seen in at least two organizations) and as such should be built as a plug-in ( http://wiki.apache.org/solr/SolrPlugins ) which can reside in your solr.home lib directory. Now that Solr has lucene 2.3 and payloads my solution is much easier to apply than before. I'll try to explain it here and then attach the src, deployable jar, and example for your use/reuse. I assume that your structured document can be represented by xml: <book title= "One, Two, Three" > <page label= "1" > one </page> <page label= "2" > two </page> <page label= "3" > three </page> </book> But we don't have a tokenizer that can make sense of xml. So I wrote a tokenizer which parallels the existing WhitespaceTokenizer called XmlPayloadWhitespaceTokenizer. XmlPayloadWhitespaceTokenizer extends XmlPayloadCharTokenizer which does the same things as CharTokenizer in Lucene, but expects that the content is wrapped in xml tags. The tokenizer keeps track of the xpath associated with each token and stores this as a payload. To use my Tokenizer in Solr I add the deployable jar I created containing XmlPayloadWhitespaceTokenizer in my solr.home lib director and add a structure text field type "text_st" to my schema.xml: <!-- A text field that uses the XmlPayloadWhitespaceTokenizer to store xpath info about the structured document --> <fieldType name= "text_st" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.XmlPayloadWhitespaceTokenizerFactory" /> <!-- in this example, we will only use synonyms at query time <filter class= "solr.SynonymFilterFactory" synonyms= "index_synonyms.txt" ignoreCase= "true" expand= "false" /> --> <filter class= "solr.StopFilterFactory" ignoreCase= "true" words= "stopwords.txt" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" splitOnCaseChange= "1" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.EnglishPorterFilterFactory" protected= "protwords.txt" /> <filter class= "solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.SynonymFilterFactory" synonyms= "synonyms.txt" ignoreCase= "true" expand= "true" /> <filter class= "solr.StopFilterFactory" ignoreCase= "true" words= "stopwords.txt" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "0" catenateNumbers= "0" catenateAll= "0" splitOnCaseChange= "1" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.EnglishPorterFilterFactory" protected= "protwords.txt" /> <filter class= "solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> </fieldType> I also add a field "fulltext_st" of type "text_st". We can visualize what happens to the input text above using the Solr Admin web-app analysis.jsp modified by SOLR-522 . term position 1 2 3 term text one two three term type word word word source start,end 3,6 7,10 11,16 payload /book [title='One, Two, Three'] /page [label='1'] /book [title='One, Two, Three'] /page [label='2'] /book [title='One, Two, Three'] /page [label='3'] Note that I've removed the hex representation of the payload for clarity The other side of this problem is how to present the results in a meaningful way. Taking FacetComponent and HighlightComponent as my muse, I created a plugable SearchComponent called PayloadComponent. This component recognizes two parameters: "payload" and "payload.fl". If payload=true, the component will find the terms from your query in the payload.fl field, retrieve the payload in these tokens, and re-combine this information to display the xpath of a search result in a give document and the number of times that term occurs in the given xpath. Again, to use my SearchComponent in Solr I add the deployable jar I created containing PayloadComponent in my solr.home lib director and add a search component "payload" to my solrconfig.xml: <searchComponent name= "payload" class= "org.apache.solr.handler.component.PayloadComponent" /> <requestHandler name= "/search" class= "org.apache.solr.handler.component.SearchHandler" > <lst name= "defaults" > <str name= "echoParams" > explicit </str> </lst> <arr name= "last-components" > <str> payload </str> </arr> </requestHandler> Then the result of http://localhost:8983/solr/search?q=came&payload=true&payload.fl=fulltext_st includes something like this: <lst name= "payload" > <lst name= "payload_context" > <lst name= "Book.IA.0001" > <lst name= "fulltext_st" > <int name= "/book[title='Crooked Man'][url='http://ia310931.us.archive.org//load_djvu_applet.cgi?file=0/items/crookedmanotherr00newyiala/crookedmanotherr00newyiala.djvu'][author='unknown']/page[id='3']" > 1 </int> </lst> </lst> <lst name= "Book.IA.37729" > <lst name= "fulltext_st" > <int name= "/book[title='Charles Dicken's A Christmas Carol'][url=''][author='Dickens, Charles']/stave[title='Marley's Ghost'][id='One']/page[id='13']" > 1 </int> </lst> </lst> <lst name= "Book.IA.0002" > <lst name= "fulltext_st" > <int name= "/book[title='Jack and Jill and Old Dame Gill']/page[id='2']" > 1 </int> <int name= "/book[title='Jack and Jill and Old Dame Gill']/page[id='4']" > 1 </int> <int name= "/book[title='Jack and Jill and Old Dame Gill']/page[id='6']" > 1 </int> <int name= "/book[title='Jack and Jill and Old Dame Gill']/page[id='7']" > 1 </int> <int name= "/book[title='Jack and Jill and Old Dame Gill']/page[id='13']" > 1 </int> </lst> </lst> </lst> </lst> The documents here are borrowed from the Internet Archive and can be found in the xmlpayload-example.zip attached to this issue Then you have everything you need to write an xsl which will take your normal Solr results and supplement them with context from your structured document. There may be some issues with filters that aren't payload aware. The only one that concerned me to this point is the WordDelimiterFilter. You can find a quick and easy patch at SOLR-532 . The other thing that you might run into if you use curl or post.jar is that the XmlUpdateRequestHandler is a bit anal about well formed xml, and throws an exception if it finds anything but the expected <doc> and <field> tags. To work around either escape your structured document's xml like this: <add> <doc> <field name= "id" > 0001 </field> <field name= "title" > One, Two, Three </field> <field name= "fulltext_st" > &lt;book title= "One, Two, Three" &gt; &lt;page label= "1" &gt;one&lt;/page&gt; &lt;page label= "2" &gt;two&lt;/page&gt; &lt;page label= "3" &gt;three&lt;/page&gt; &lt;/book&gt; </field> </doc> </add> or hack XmlUpdateRequestHandler to accept your "unexpected XML tag doc/". Cool?
          Hide
          Tricia Jenkins added a comment -

          xmlpayload-src.jar contains the source files and junit test and ant build file for these plugins.

          jar xf xmlpayload-src.jar
          

          will unpack this.

          Show
          Tricia Jenkins added a comment - xmlpayload-src.jar contains the source files and junit test and ant build file for these plugins. jar xf xmlpayload-src.jar will unpack this.
          Tricia Jenkins made changes -
          Attachment xmlpayload-src.jar [ 12380806 ]
          Hide
          Tricia Jenkins added a comment -

          xmlpayload.jar is the deployable jar that can be dropped into your solr.home lib directory (it contains only .class files)

          Show
          Tricia Jenkins added a comment - xmlpayload.jar is the deployable jar that can be dropped into your solr.home lib directory (it contains only .class files)
          Tricia Jenkins made changes -
          Attachment xmlpayload.jar [ 12380807 ]
          Hide
          Tricia Jenkins added a comment -

          xmlpayload-example.zip contains a specialized version of the Solr example to demonstrate the plugins.

          Show
          Tricia Jenkins added a comment - xmlpayload-example.zip contains a specialized version of the Solr example to demonstrate the plugins.
          Tricia Jenkins made changes -
          Attachment xmlpayload-example.zip [ 12380808 ]
          Tricia Jenkins made changes -
          Link This issue depends on SOLR-386 [ SOLR-386 ]
          Tricia Jenkins made changes -
          Link This issue relates to SOLR-532 [ SOLR-532 ]
          Hide
          Tricia Jenkins added a comment -

          SOLR-532 deals with the WordDelimiterFilter that is not payload aware.
          SOLR-522 improves analysis.jsp to visualize payload savvy tokenizers and tokenfilters.

          Show
          Tricia Jenkins added a comment - SOLR-532 deals with the WordDelimiterFilter that is not payload aware. SOLR-522 improves analysis.jsp to visualize payload savvy tokenizers and tokenfilters.
          Tricia Jenkins made changes -
          Link This issue relates to SOLR-522 [ SOLR-522 ]
          Hide
          Erik Hatcher added a comment -

          Cool?

          Very! Wow Tricia - thanks for documenting that so thoroughly. This particular feature is sure to be of great interest to many.

          Show
          Erik Hatcher added a comment - Cool? Very! Wow Tricia - thanks for documenting that so thoroughly. This particular feature is sure to be of great interest to many.
          Shalin Shekhar Mangar made changes -
          Fix Version/s 1.4 [ 12313351 ]
          Hide
          Laurent Hoss added a comment -

          Hi Tricia
          Looks nice, I've been searching for such a feature for years in lucene (and solr)!
          But before getting too excited, i better try to ask the correct questions before doing a real test .. as we don't even use solr yet (though I really want to

          In fact we currently have our home grown solution for similar problem:
          In our case we want to restrain boolean searches to paragraphs or sentences of a document, and implemented this (like many others) indexing extra docs for paragraphs etc. (with duplication of many meta-data fields of the parent document)
          Besides multiplying index size, the mapping from the found paragraphs to their base documents involved a lot of custom coding.. and only recently we have at least implemented a fast counting of the base docs for the found paragraph docs, by using a 'baseDocId'-FieldCache (essentialy a 'group by' In SQL lingo)

          This leads to following requirements and questions:

          • What is the performance of your PayloadComponent, compared to the standard SearchHandler?
            We especially need very fast count functionality, to dynamically compute statistics/charts needing 100's of queries.
            For this we just need the hitsCount of documents/paragraphs without the xpath payload info, which would generate a really big XML response for 100K docs resultset!

          Do you think this is a good option for us?
          ps: We should probably put up some Wiki page for this topic, after I've seen at least 10 people asking for the possible solutions.. ok, maybe often with slightly different requirements!

          One whole other way solving this would be using the SpanQuery package together with the nicelooking Qsol (http://myhardshadow.com/about.php), allthough I'm not sure about its performance especially with (really) long boolean queries!

          Show
          Laurent Hoss added a comment - Hi Tricia Looks nice, I've been searching for such a feature for years in lucene (and solr)! But before getting too excited, i better try to ask the correct questions before doing a real test .. as we don't even use solr yet (though I really want to In fact we currently have our home grown solution for similar problem: In our case we want to restrain boolean searches to paragraphs or sentences of a document, and implemented this (like many others) indexing extra docs for paragraphs etc. (with duplication of many meta-data fields of the parent document) Besides multiplying index size, the mapping from the found paragraphs to their base documents involved a lot of custom coding.. and only recently we have at least implemented a fast counting of the base docs for the found paragraph docs, by using a 'baseDocId'-FieldCache (essentialy a 'group by' In SQL lingo) This leads to following requirements and questions: What is the performance of your PayloadComponent, compared to the standard SearchHandler? We especially need very fast count functionality, to dynamically compute statistics/charts needing 100's of queries. For this we just need the hitsCount of documents/paragraphs without the xpath payload info, which would generate a really big XML response for 100K docs resultset! We want to find only documents where a (boolean) query matches within one of the paragraph_* fields, and not if the query matches over the combined content of multiple paragraphs, as discussed here: http://www.nabble.com/Redundant-indexing-*-4-only-solution-(for-par-sen-and-case-sensitivity)-td13684315.html#a13685041 and http://www.nabble.com/What-is-the-best-way-to-index-xml-data-preserving-the-mark-up--td13641104.html#a13657470 > The problem is that a search for sentence:foo AND sentence:bar is matching if foo matches in any sentence of the paragraph, and bar also matches in any sentence of the paragraph. Do you think this is a good option for us? ps: We should probably put up some Wiki page for this topic, after I've seen at least 10 people asking for the possible solutions.. ok, maybe often with slightly different requirements! One whole other way solving this would be using the SpanQuery package together with the nicelooking Qsol ( http://myhardshadow.com/about.php ), allthough I'm not sure about its performance especially with (really) long boolean queries!
          Hide
          Tricia Jenkins added a comment -

          Hi Laurent,

          Thanks for your interest in my Solr PayloadComponent plugin. I want to address all of the questions you pose in your comment, but won't have time until early February. I apologize for the inconvenience but my priorities lay elsewhere right now. Feel free to look at the code and play in the meantime. The code that's up there is basically proof of concept. I've been slowly working at improving the robustness of the code and improving performance so hopefully there will be a improved version before the end of March.

          I'm sure there would be many people who would appreciate a Wiki page for this topic. Why don't you go ahead and set that up? I'll be happy to add my two cents when I'm available.

          All the best,
          Tricia

          Show
          Tricia Jenkins added a comment - Hi Laurent, Thanks for your interest in my Solr PayloadComponent plugin. I want to address all of the questions you pose in your comment, but won't have time until early February. I apologize for the inconvenience but my priorities lay elsewhere right now. Feel free to look at the code and play in the meantime. The code that's up there is basically proof of concept. I've been slowly working at improving the robustness of the code and improving performance so hopefully there will be a improved version before the end of March. I'm sure there would be many people who would appreciate a Wiki page for this topic. Why don't you go ahead and set that up? I'll be happy to add my two cents when I'm available. All the best, Tricia
          Hide
          Shalin Shekhar Mangar added a comment -

          Marking for 1.5

          Show
          Shalin Shekhar Mangar added a comment - Marking for 1.5
          Shalin Shekhar Mangar made changes -
          Fix Version/s 1.5 [ 12313566 ]
          Fix Version/s 1.4 [ 12313351 ]
          Hide
          Shairon Toledo added a comment -

          I have a project that involves words extracted by OCR, each page has words, each word has its geometry to blink a highlight to end user.
          I've been trying represent this document structure by xml

          <document>
             <page num="1">
              <term top='111' bottom='222' right='333' left='444'>foo</term> 
              <term top='211' bottom='322' right='833' left='944'>bar</term> 
              <term top='311' bottom='422' right='733' left='144'>baz</term> 
              <term top='411' bottom='522' right='633' left='244'>qux</term> 
             </page>
             <page num="2">
          	<term .... />
             </page>
             
          </document>
          
          

          Using the field 'fulltext_st' ,

          <field name="fulltext_st">
          	&lt;document &gt;
          	&lt;page top='111' bottom='222' right='333' left='444' word='foo' num='1'&gt;foo&lt;/page&gt;
          	&lt;page top='211' bottom='322' right='833' left='944' word='bar' num='1'&gt;bar&lt;/page&gt;
          	&lt;page top='311' bottom='422' right='733' left='144' word='baz' num='1'&gt;baz&lt;/page&gt;
          	&lt;page top='411' bottom='522' right='633' left='244' word='qux' num='1'&gt;qux&lt;/page&gt;
          	&lt;/document&gt;
          </field>
          

          I can get all terms in my search result with them payloads.
          But if I do search using phrase query I can't fetch any result.

          Example:

          search?q=foo

          <lst name="fulltext_st">
          	<int name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
          </lst>
          

          search?q=foo+bar

          <lst name="fulltext_st">
          	<int name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
          	<int name="/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']">1</int>
          </lst>
          

          /search?q="foo bar"

          *nothing*
          

          I was wondering if I could get your thoughts if xmlpayload supports sort of the things or how easy is I update the code to provide a solution for do that.

          thank you in advance

          Show
          Shairon Toledo added a comment - I have a project that involves words extracted by OCR, each page has words, each word has its geometry to blink a highlight to end user. I've been trying represent this document structure by xml <document> <page num= "1" > <term top='111' bottom='222' right='333' left='444'> foo </term> <term top='211' bottom='322' right='833' left='944'> bar </term> <term top='311' bottom='422' right='733' left='144'> baz </term> <term top='411' bottom='522' right='633' left='244'> qux </term> </page> <page num= "2" > <term .... /> </page> </document> Using the field 'fulltext_st' , <field name= "fulltext_st" > &lt;document &gt; &lt;page top='111' bottom='222' right='333' left='444' word='foo' num='1'&gt;foo&lt;/page&gt; &lt;page top='211' bottom='322' right='833' left='944' word='bar' num='1'&gt;bar&lt;/page&gt; &lt;page top='311' bottom='422' right='733' left='144' word='baz' num='1'&gt;baz&lt;/page&gt; &lt;page top='411' bottom='522' right='633' left='244' word='qux' num='1'&gt;qux&lt;/page&gt; &lt;/document&gt; </field> I can get all terms in my search result with them payloads. But if I do search using phrase query I can't fetch any result. Example: search?q=foo <lst name= "fulltext_st" > <int name= "/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']" > 1 </int> </lst> search?q=foo+bar <lst name= "fulltext_st" > <int name= "/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']" > 1 </int> <int name= "/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']" > 1 </int> </lst> /search?q="foo bar" *nothing* I was wondering if I could get your thoughts if xmlpayload supports sort of the things or how easy is I update the code to provide a solution for do that. thank you in advance
          Hide
          Lance Norskog added a comment -

          Please ask this on solr-user. Issues are for discussing implementations.

          Lucene payloads are supported by Solr, and a rectangle per term can be stored as a payload. This allows the text to be indexed as a text field, and all queries including phrases will work as normal.

          Show
          Lance Norskog added a comment - Please ask this on solr-user. Issues are for discussing implementations. Lucene payloads are supported by Solr, and a rectangle per term can be stored as a payload. This allows the text to be indexed as a text field, and all queries including phrases will work as normal.
          Hide
          Lance Norskog added a comment -

          Please ask this on solr-user. Issues are for discussing implementations.

          Lucene payloads are supported by Solr, and a rectangle per term can be stored as a payload. This allows the text to be indexed as a text field, and all queries including phrases will work as normal.

          Show
          Lance Norskog added a comment - Please ask this on solr-user. Issues are for discussing implementations. Lucene payloads are supported by Solr, and a rectangle per term can be stored as a payload. This allows the text to be indexed as a text field, and all queries including phrases will work as normal.
          Hide
          Chris Harris added a comment -

          This is an interesting patch. One current limitation seems to be that proximity search queries (PhraseQueries and SpanQueries) may result in false positives. For example, if I query

          "audit trail"~10

          then I think I'd expect Solr to return only the page #s where audit and trail are near one another. (Yes, what I just said leaves some wiggle room for implementation details.) The current code, in contrast, looks like it will report all the pages where "audit" and "trail" occur, regardless of proximity to the other term.

          Has anyone thought about how to add proximity awareness?

          Show
          Chris Harris added a comment - This is an interesting patch. One current limitation seems to be that proximity search queries (PhraseQueries and SpanQueries) may result in false positives. For example, if I query "audit trail"~10 then I think I'd expect Solr to return only the page #s where audit and trail are near one another. (Yes, what I just said leaves some wiggle room for implementation details.) The current code, in contrast, looks like it will report all the pages where "audit" and "trail" occur, regardless of proximity to the other term. Has anyone thought about how to add proximity awareness?
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hoss Man made changes -
          Fix Version/s Next [ 12315093 ]
          Fix Version/s 1.5 [ 12313566 ]
          Hoss Man made changes -
          Fix Version/s 3.2 [ 12316172 ]
          Fix Version/s Next [ 12315093 ]
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Robert Muir made changes -
          Fix Version/s 3.3 [ 12316471 ]
          Fix Version/s 3.2 [ 12316172 ]
          Robert Muir made changes -
          Fix Version/s 3.4 [ 12316683 ]
          Fix Version/s 4.0 [ 12314992 ]
          Fix Version/s 3.3 [ 12316471 ]
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Robert Muir made changes -
          Fix Version/s 3.5 [ 12317876 ]
          Fix Version/s 3.4 [ 12316683 ]
          Simon Willnauer made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Fix Version/s 3.5 [ 12317876 ]
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hoss Man made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321141 ]
          Fix Version/s 4.0 [ 12314992 ]
          Mark Miller made changes -
          Fix Version/s 4.2 [ 12323893 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.1 [ 12321141 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324128 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.2 [ 12323893 ]
          Tricia Jenkins made changes -
          Link This issue relates to SOLR-4722 [ SOLR-4722 ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324324 ]
          Fix Version/s 4.3 [ 12324128 ]
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Fix Version/s 4.4 [ 12324324 ]
          Adrien Grand made changes -
          Fix Version/s 4.6 [ 12325000 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Uwe Schindler made changes -
          Fix Version/s 4.7 [ 12325573 ]
          Fix Version/s 4.6 [ 12325000 ]
          David Smiley made changes -
          Fix Version/s 4.8 [ 12326254 ]
          Fix Version/s 4.7 [ 12325573 ]
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326731 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.8 [ 12326254 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Tricia Jenkins
            • Votes:
              7 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development