Solr
  1. Solr
  2. SOLR-380

There's no way to convert search results into page-level hits of a "structured document".

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, Trunk
    • Component/s: search
    • Labels:
      None

      Description

      "Paged-Text" FieldType for Solr

      A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

      The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

      At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

      <lst name="pages">
        <lst name="doc1">
           <int name="pageid">234</int>
           <int name="pageid">236</int>
         </lst>
         <lst name="doc2">
           <int name="pageid">19</int>
         </lst>
      </lst>
      <lst name="hitpos">
         <lst name="doc1">
           <lst name="234">
             <int name="pos">14325</int>
           </lst>
         </lst>
         ...
      </lst>

      1. SOLR-380-XmlPayload.patch
        92 kB
        Tricia Jenkins
      2. SOLR-380-XmlPayload.patch
        155 kB
        Tricia Jenkins
      3. xmlpayload.jar
        10 kB
        Tricia Jenkins
      4. xmlpayload-example.zip
        8.55 MB
        Tricia Jenkins
      5. xmlpayload-src.jar
        5.74 MB
        Tricia Jenkins

        Issue Links

          Activity

          Tricia Jenkins created issue -
          Tricia Jenkins made changes -
          Field Original Value New Value
          Summary The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. There's no way to convert search results into page-level hits of a "structured document".
          Description "Paged-Text" FieldType for Solr
          >
          > A chance to dig into the guts of Solr. The problem: If we index a
          > monograph in Solr, there's no way to convert search results into
          > page-level hits. The solution: have a "paged-text" fieldtype which keeps
          > track of page divisions as it indexes, and reports page-level hits in the
          > search results.
          >
          > The input would contain page milestones: <page id="234"/>. As Solr
          > processed the tokens (using its standard tokenizers and filters), it would
          > concurrently build a structural map of the item, indicating which term
          > position marked the beginning of which page: <page id="234"
          > firstterm="14324"/>. This map would be stored in an unindexed field in
          > some efficient format.
          >
          > At search time, Solr would retrieve term positions for all hits that are
          > returned in the current request, and use the stored map to determine page
          > ids for each term position. The results would imitate the results for
          > highlighting, something like:
          >
          > <lst name="pages">
          > <lst name="doc1">
          > <int name="pageid">234</int>
          > <int name="pageid">236</int>
          > </lst>
          > <lst name="doc2">
          > <int name="pageid">19</int>
          > </lst>
          > </lst>
          > <lst name="hitpos">
          > <lst name="doc1">
          > <lst name="234">
          > <int name="pos">14325</int>
          > </lst>
          > </lst>
          > ...
          > </lst>
          "Paged-Text" FieldType for Solr

          A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

          The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

          At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

          <lst name="pages">
                  <lst name="doc1">
                          <int name="pageid">234</int>
                          <int name="pageid">236</int>
                  </lst>
                  <lst name="doc2">
                          <int name="pageid">19</int>
                  </lst>
          </lst>
          <lst name="hitpos">
                  <lst name="doc1">
                          <lst name="234">
                                  <int name="pos">14325</int>
                          </lst>
                  </lst>
                  ...
          </lst>
          Peter Binkley made changes -
          Description "Paged-Text" FieldType for Solr

          A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

          The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

          At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

          <lst name="pages">
                  <lst name="doc1">
                          <int name="pageid">234</int>
                          <int name="pageid">236</int>
                  </lst>
                  <lst name="doc2">
                          <int name="pageid">19</int>
                  </lst>
          </lst>
          <lst name="hitpos">
                  <lst name="doc1">
                          <lst name="234">
                                  <int name="pos">14325</int>
                          </lst>
                  </lst>
                  ...
          </lst>
          "Paged-Text" FieldType for Solr

          A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

          The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

          At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

          <lst name="pages">
          &nbsp;&nbsp;<lst name="doc1">
          &nbsp;&nbsp;&nbsp;&nbsp; <int name="pageid">234</int>
          &nbsp;&nbsp;&nbsp;&nbsp; <int name="pageid">236</int>
          &nbsp;&nbsp; </lst>
          &nbsp;&nbsp; <lst name="doc2">
          &nbsp;&nbsp;&nbsp;&nbsp; <int name="pageid">19</int>
          &nbsp;&nbsp; </lst>
          </lst>
          <lst name="hitpos">
          &nbsp;&nbsp; <lst name="doc1">
          &nbsp;&nbsp;&nbsp;&nbsp; <lst name="234">
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <int name="pos">14325</int>
          &nbsp;&nbsp;&nbsp;&nbsp; </lst>
          &nbsp;&nbsp; </lst>
          &nbsp;&nbsp; ...
          </lst>
          Tricia Jenkins made changes -
          Attachment lucene-core-2.3-dev.jar [ 12369349 ]
          Attachment SOLR-380-XmlPayload.patch [ 12369348 ]
          Tricia Jenkins made changes -
          Attachment SOLR-380-XmlPayload.patch [ 12369631 ]
          Tricia Jenkins made changes -
          Link This issue depends on SOLR-386 [ SOLR-386 ]
          Tricia Jenkins made changes -
          Attachment lucene-core-2.3-dev.jar [ 12369349 ]
          Tricia Jenkins made changes -
          Attachment xmlpayload-src.jar [ 12380806 ]
          Tricia Jenkins made changes -
          Attachment xmlpayload.jar [ 12380807 ]
          Tricia Jenkins made changes -
          Attachment xmlpayload-example.zip [ 12380808 ]
          Tricia Jenkins made changes -
          Link This issue depends on SOLR-386 [ SOLR-386 ]
          Tricia Jenkins made changes -
          Link This issue relates to SOLR-532 [ SOLR-532 ]
          Tricia Jenkins made changes -
          Link This issue relates to SOLR-522 [ SOLR-522 ]
          Shalin Shekhar Mangar made changes -
          Fix Version/s 1.4 [ 12313351 ]
          Shalin Shekhar Mangar made changes -
          Fix Version/s 1.5 [ 12313566 ]
          Fix Version/s 1.4 [ 12313351 ]
          Hoss Man made changes -
          Fix Version/s Next [ 12315093 ]
          Fix Version/s 1.5 [ 12313566 ]
          Hoss Man made changes -
          Fix Version/s 3.2 [ 12316172 ]
          Fix Version/s Next [ 12315093 ]
          Robert Muir made changes -
          Fix Version/s 3.3 [ 12316471 ]
          Fix Version/s 3.2 [ 12316172 ]
          Robert Muir made changes -
          Fix Version/s 3.4 [ 12316683 ]
          Fix Version/s 4.0 [ 12314992 ]
          Fix Version/s 3.3 [ 12316471 ]
          Robert Muir made changes -
          Fix Version/s 3.5 [ 12317876 ]
          Fix Version/s 3.4 [ 12316683 ]
          Simon Willnauer made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Fix Version/s 3.5 [ 12317876 ]
          Hoss Man made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321141 ]
          Fix Version/s 4.0 [ 12314992 ]
          Mark Miller made changes -
          Fix Version/s 4.2 [ 12323893 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.1 [ 12321141 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324128 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.2 [ 12323893 ]
          Tricia Jenkins made changes -
          Link This issue relates to SOLR-4722 [ SOLR-4722 ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324324 ]
          Fix Version/s 4.3 [ 12324128 ]
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Fix Version/s 4.4 [ 12324324 ]
          Adrien Grand made changes -
          Fix Version/s 4.6 [ 12325000 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Uwe Schindler made changes -
          Fix Version/s 4.7 [ 12325573 ]
          Fix Version/s 4.6 [ 12325000 ]
          David Smiley made changes -
          Fix Version/s 4.8 [ 12326254 ]
          Fix Version/s 4.7 [ 12325573 ]
          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326731 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.8 [ 12326254 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Tricia Jenkins
            • Votes:
              4 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development