Solr
  1. Solr
  2. SOLR-3439

Make SolrCell easier to use out of the box

    Details

      Description

      Currently, SolrCell is configured to map Tika "content" (the main body of a document) to the "text" field which is the indexed-only (not stored) catch-all for default queries. That searches fine, but doesn't show the document content in the results, sometimes leading users to think that something is wrong. Sure, the user can easily add the field (and this is documented), but it would be a better user experience to have such a basic feature work right out of the box without any config editing and without the need for the user to read the fine print in the documentation.

      I propose that we add the "content" field to the example schema in the section of fields already defined to support SolrCell metadata.

      1. SOLR-3439.patch
        53 kB
        Jan Høydahl
      2. filetypes.zip
        107 kB
        Jan Høydahl
      3. SOLR-3439.patch
        53 kB
        Jan Høydahl
      4. SOLR-3439.patch
        19 kB
        Jan Høydahl
      5. SOLR-3439.patch
        18 kB
        Jan Høydahl
      6. SOLR-3439.patch
        16 kB
        Jan Høydahl
      7. SOLR-3439.patch
        16 kB
        Jan Høydahl
      8. SOLR-3439.patch
        2 kB
        Jack Krupansky
      9. Lincoln-Gettysburg-Address.pdf
        196 kB
        Jack Krupansky
      10. Lincoln-Gettysburg-Address.docx
        12 kB
        Jack Krupansky

        Issue Links

          Activity

          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Jan Høydahl added a comment -

          Any suggestions for what we should tell people to index to test SolrCell? I think the most fun is indexing my own docs folders I was thinking instead of bundling some synthetic docs in exampledocs, we could use a dump of the web site/wiki, javadocs or some other real docs?

          Show
          Jan Høydahl added a comment - Any suggestions for what we should tell people to index to test SolrCell? I think the most fun is indexing my own docs folders I was thinking instead of bundling some synthetic docs in exampledocs, we could use a dump of the web site/wiki, javadocs or some other real docs?
          Jan Høydahl made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Jan Høydahl added a comment -

          Committed r1369433 to trunk and r1369478 to branch_4x

          Show
          Jan Høydahl added a comment - Committed r1369433 to trunk and r1369478 to branch_4x
          Jan Høydahl committed 1369478 (162 files)
          Reviews: none

          SOLR-3439: Make SolrCell easier to use out of the box (merge from trunk)

          Lucene branch_4x
          Jan Høydahl committed 1369476 (1 file)
          Jan Høydahl committed 1369433 (102 files)
          Reviews: none

          SOLR-3439: Make SolrCell easier to use out of the box

          Lucene trunk
          Jan Høydahl made changes -
          Attachment SOLR-3439.patch [ 12539129 ]
          Hide
          Jan Høydahl added a comment -

          Updated patch

          • Adds a "url" field to schema intended for HTML/web docs. Displayed in result if found
          • If "url" field is filled, it is used as href on the title link, else fallback to file:///resourcename or to plain "id"
          • Detects file type from content_type field with fallback to filename suffix

          Will commit shortly

          Show
          Jan Høydahl added a comment - Updated patch Adds a "url" field to schema intended for HTML/web docs. Displayed in result if found If "url" field is filled, it is used as href on the title link, else fallback to file:///resourcename or to plain "id" Detects file type from content_type field with fallback to filename suffix Will commit shortly
          Jan Høydahl made changes -
          Attachment filetypes.zip [ 12537926 ]
          Hide
          Jan Høydahl added a comment -

          If you want to apply the patch with the beautiful icons then unzip filetypes.zip in your root and it will place the imgs in ./solr/webapp/web/img/filetypes/

          Show
          Jan Høydahl added a comment - If you want to apply the patch with the beautiful icons then unzip filetypes.zip in your root and it will place the imgs in ./solr/webapp/web/img/filetypes/
          Jan Høydahl made changes -
          Attachment SOLR-3439.patch [ 12537922 ]
          Hide
          Jan Høydahl added a comment - - edited

          Found another file icon set in the public domain which I've included in /solr/img/filetypes. They are smaller and nicer (only 721 kb alltogether), see http://www.splitbrain.org/projects/file_icons

          Any other comments before commit?

          Show
          Jan Høydahl added a comment - - edited Found another file icon set in the public domain which I've included in /solr/img/filetypes. They are smaller and nicer (only 721 kb alltogether), see http://www.splitbrain.org/projects/file_icons Any other comments before commit?
          Hide
          Erik Hatcher added a comment -

          Anyone have a problem with sourcing the file-type icons over the internet via http?

          Yes. We shouldn't be pulling in things remotely for the UI. Very often this stuff is used behind firewalls or offline.

          I haven't applied this patch to see what these icons are exactly, but surely there are freely available ones we can use.

          Show
          Erik Hatcher added a comment - Anyone have a problem with sourcing the file-type icons over the internet via http? Yes. We shouldn't be pulling in things remotely for the UI. Very often this stuff is used behind firewalls or offline. I haven't applied this patch to see what these icons are exactly, but surely there are freely available ones we can use.
          Hide
          Jan Høydahl added a comment -

          Anyone have a problem with sourcing the file-type icons over the internet via http?

          Best would be to include in "webapp/img/fileicons", but I'm not sure we're allowed to distribute them since they are under the AGPL: https://github.com/teambox/Free-file-icons

          Show
          Jan Høydahl added a comment - Anyone have a problem with sourcing the file-type icons over the internet via http? Best would be to include in "webapp/img/fileicons", but I'm not sure we're allowed to distribute them since they are under the AGPL: https://github.com/teambox/Free-file-icons
          Jan Høydahl made changes -
          Attachment SOLR-3439.patch [ 12537783 ]
          Hide
          Jan Høydahl added a comment -

          Cosmetic fixes to icons for docx and pptx, as well as file:/// prefix for solrcell files

          Think this is getting ready for committing?

          Show
          Jan Høydahl added a comment - Cosmetic fixes to icons for docx and pptx, as well as file:/// prefix for solrcell files Think this is getting ready for committing?
          Jan Høydahl made changes -
          Link This issue relates to SOLR-3672 [ SOLR-3672 ]
          Jan Høydahl made changes -
          Attachment SOLR-3439.patch [ 12537567 ]
          Hide
          Jan Høydahl added a comment -

          New patch:

          • Uses filename from resource.name -> new field "resourcename", with copyField to "text" and included in qf
          • Handles HTML escaping of toggle all fields
          • Show file type icon before title, first detected from filename, then from contenttype
          • Do not show author, content_type, resourcename if empty
          • Refactored the "toggle explain/allFields" section into own file
          Show
          Jan Høydahl added a comment - New patch: Uses filename from resource.name -> new field "resourcename", with copyField to "text" and included in qf Handles HTML escaping of toggle all fields Show file type icon before title, first detected from filename, then from contenttype Do not show author, content_type, resourcename if empty Refactored the "toggle explain/allFields" section into own file
          Hide
          Jan Høydahl added a comment -

          1. Any reason to limit it to 5.0 and not backport to 4.0?

          It is already marked with 4.0 and 5.0

          The one hope I hold out is that maybe we should modify the post tool to recognize that the file type is not ".xml" and then send rich documents to SolrCell with an explicit literal to initialize the "filename" field - which itself needs to be added.

          I have a new patch using the result from resource.name which is the official way to send file name to ERH. It propagates out as Tika metadata resourceName, which is then lowercased to field resourcename.

          It would be nice to include my sample Word and PDF documents, or other equivalent sample rich documents

          Agree. There should be an exampledocs folder with rich docs. Or that we simply describe in the tutorial how to index Solr's documentation as PDFs and JavaDocs from HTML.

          Show
          Jan Høydahl added a comment - 1. Any reason to limit it to 5.0 and not backport to 4.0? It is already marked with 4.0 and 5.0 The one hope I hold out is that maybe we should modify the post tool to recognize that the file type is not ".xml" and then send rich documents to SolrCell with an explicit literal to initialize the "filename" field - which itself needs to be added. I have a new patch using the result from resource.name which is the official way to send file name to ERH. It propagates out as Tika metadata resourceName, which is then lowercased to field resourcename . It would be nice to include my sample Word and PDF documents, or other equivalent sample rich documents Agree. There should be an exampledocs folder with rich docs. Or that we simply describe in the tutorial how to index Solr's documentation as PDFs and JavaDocs from HTML.
          Hide
          Jack Krupansky added a comment -

          I haven't actually tried the new patch (and may not be able to until next weekend), but looking at the patch itself, overall it looks like it is headed in the right direction. A couple of quick comments:

          1. Any reason to limit it to 5.0 and not backport to 4.0?
          2. I beat my head against the wall hoping to get the file name automatically, but if you stream a file via curl or something similar that info is not passed along. The one hope I hold out is that maybe we should modify the post tool to recognize that the file type is not ".xml" and then send rich documents to SolrCell with an explicit literal to initialize the "filename" field - which itself needs to be added.
          3. Feel free to give yourself equal attribution since you have done so much additional work.
          4. It would be nice to include my sample Word and PDF documents, or other equivalent sample rich documents to include in the exampledocs directory since we don't have any readily accessible rich documents (a couple may be "hidden" elsewhere.)
          5. I'll volunteer to do at least some of the wiki update once the patch is committed (or there is at least agreement to commit it.)

          Show
          Jack Krupansky added a comment - I haven't actually tried the new patch (and may not be able to until next weekend), but looking at the patch itself, overall it looks like it is headed in the right direction. A couple of quick comments: 1. Any reason to limit it to 5.0 and not backport to 4.0? 2. I beat my head against the wall hoping to get the file name automatically, but if you stream a file via curl or something similar that info is not passed along. The one hope I hold out is that maybe we should modify the post tool to recognize that the file type is not ".xml" and then send rich documents to SolrCell with an explicit literal to initialize the "filename" field - which itself needs to be added. 3. Feel free to give yourself equal attribution since you have done so much additional work. 4. It would be nice to include my sample Word and PDF documents, or other equivalent sample rich documents to include in the exampledocs directory since we don't have any readily accessible rich documents (a couple may be "hidden" elsewhere.) 5. I'll volunteer to do at least some of the wiki update once the patch is committed (or there is at least agreement to commit it.)
          Jan Høydahl made changes -
          Attachment SOLR-3439.patch [ 12537532 ]
          Hide
          Jan Høydahl added a comment -

          New patch with hl.encoder=html which fixes the html encoding of fallback field

          Show
          Jan Høydahl added a comment - New patch with hl.encoder=html which fixes the html encoding of fallback field
          Jan Høydahl made changes -
          Fix Version/s 5.0 [ 12321664 ]
          Jan Høydahl made changes -
          Attachment SOLR-3439.patch [ 12537531 ]
          Hide
          Jan Høydahl added a comment -

          New patch with these improvements:

          • The new "content" field is now indexed="false", for performance reasons - you can always search using "text"
          • Included changes to /browse RH:
            • Added the SolrCell fields to qf
            • Added facets for author and content_type
            • Turned on highlighting for "content"
          • Changes to Velocity templates
            • Detects whether result doc is product, product-join doc or rich-text doc
            • The richtext display shows the "title" instead of "name", with fallback to ID if title is missing
            • We display a nice little icon for PDF, DOC, PPT, XLS
            • For rich-text, we display highlighted content field, with HTML-encoded fallback if not hits
            • Fixed #field() macro to display all snippets of highlighting and to HTML-encode fallback result
            • Hide facets for which there are no results

          I have tested with a mix of office docs and the other example docs and it looks nice here. Please test it.

          Todo:

          • It would be natural to display file name for SolrCell docs - where should we pick it from?
          • Should fix SOLR-2730 to avoid HTMLencoding hack in template
          • Should download the filetype graphics locally instead of linking to github..
          Show
          Jan Høydahl added a comment - New patch with these improvements: The new "content" field is now indexed="false", for performance reasons - you can always search using "text" Included changes to /browse RH: Added the SolrCell fields to qf Added facets for author and content_type Turned on highlighting for "content" Changes to Velocity templates Detects whether result doc is product, product-join doc or rich-text doc The richtext display shows the "title" instead of "name", with fallback to ID if title is missing We display a nice little icon for PDF, DOC, PPT, XLS For rich-text, we display highlighted content field, with HTML-encoded fallback if not hits Fixed #field() macro to display all snippets of highlighting and to HTML-encode fallback result Hide facets for which there are no results I have tested with a mix of office docs and the other example docs and it looks nice here. Please test it. Todo: It would be natural to display file name for SolrCell docs - where should we pick it from? Should fix SOLR-2730 to avoid HTMLencoding hack in template Should download the filetype graphics locally instead of linking to github..
          Jan Høydahl made changes -
          Summary Add "content" field to example schema to make SolrCell easier to use out of the box Make SolrCell easier to use out of the box
          Description Currently, SolrCell is configured to map Tika "content" (the main body of a document) to the "text" field which is the indexed-only (not stored) catch-all for default queries. That searches fine, but doesn't show the document content in the results, sometimes leading users to think that something is wrong. Sure, the user can easily add the field (and this is documented), but it would be a better user experience to have such a basic feature work right out of the box without any config editing and without the need for the user to read the fine print in the documentation.

          I propose that we add the "content" field to the example schema in the section of fields already defined to support SolrCell metadata. It would be stored and indexed.

          I further propose that a copyField be added for the "title", "description", (and maybe a couple of others) and "content" fields to add them to the "text" field for searching. Again, trying to improve the out of the box user experience. It also simplifies testing - less setup.
          Currently, SolrCell is configured to map Tika "content" (the main body of a document) to the "text" field which is the indexed-only (not stored) catch-all for default queries. That searches fine, but doesn't show the document content in the results, sometimes leading users to think that something is wrong. Sure, the user can easily add the field (and this is documented), but it would be a better user experience to have such a basic feature work right out of the box without any config editing and without the need for the user to read the fine print in the documentation.

          I propose that we add the "content" field to the example schema in the section of fields already defined to support SolrCell metadata.
          Jan Høydahl made changes -
          Assignee Jan Høydahl [ janhoy ]
          Hide
          Jan Høydahl added a comment -

          I propose we do not wait for SOLR-3442 but use the solution from the proposed patch. It hugely improves the ootb experience for indexing ordinary full-text documents, and it is non-disruptive to the "products" example.

          Show
          Jan Høydahl added a comment - I propose we do not wait for SOLR-3442 but use the solution from the proposed patch. It hugely improves the ootb experience for indexing ordinary full-text documents, and it is non-disruptive to the "products" example.
          Hoss Man made changes -
          Fix Version/s 4.0 [ 12322455 ]
          Fix Version/s 4.0-ALPHA [ 12314992 ]
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Jack Krupansky added a comment -

          Thinking about the overall intent of example, including raw performance, and the issues of trying to make it one-size-fits-all, I've semi-convinced convinced myself to semi-withdraw this proposal. I still think it's a good idea, but it does have drawbacks that make it less appealing than I first thought. So, unless more voices cry out for it, I'll abandon it.

          I might offer up a revised proposal to add a commented out field definition in example to indicate what is needed to make example fully functional for SolrCell.

          Show
          Jack Krupansky added a comment - Thinking about the overall intent of example, including raw performance, and the issues of trying to make it one-size-fits-all, I've semi-convinced convinced myself to semi-withdraw this proposal. I still think it's a good idea, but it does have drawbacks that make it less appealing than I first thought. So, unless more voices cry out for it, I'll abandon it. I might offer up a revised proposal to add a commented out field definition in example to indicate what is needed to make example fully functional for SolrCell.
          Hide
          Jack Krupansky added a comment -

          Based on the discussion here and on SOLR-3442, I would offer two alternative proposals:

          1. If SOLR-3442 is implemented (default user query parser in example becomes edismax), add the "content" field as stored and indexed, add "content" to the edismax "qf", but don't add the copyField(s).

          2. If SOLR-3442 is NOT implemented, add the "content" field as stored but NOT indexed, and add the copyField ("content" to "text"). Regardless of query parser, this will assure that "content" is both searchable and returnable, but without "double indexing".

          I'll wait a bit to see how SOLR-3442 evolves. But if it doesn't look likely in a reasonable timeframe, I'll revise my patch for alternative #2 which provides the desired functionality with minimal impact.

          But for now, I'll assume that SOLR-3442 is the more likely and preferable approach.

          Show
          Jack Krupansky added a comment - Based on the discussion here and on SOLR-3442 , I would offer two alternative proposals: 1. If SOLR-3442 is implemented (default user query parser in example becomes edismax), add the "content" field as stored and indexed, add "content" to the edismax "qf", but don't add the copyField(s). 2. If SOLR-3442 is NOT implemented, add the "content" field as stored but NOT indexed, and add the copyField ("content" to "text"). Regardless of query parser, this will assure that "content" is both searchable and returnable, but without "double indexing". I'll wait a bit to see how SOLR-3442 evolves. But if it doesn't look likely in a reasonable timeframe, I'll revise my patch for alternative #2 which provides the desired functionality with minimal impact. But for now, I'll assume that SOLR-3442 is the more likely and preferable approach.
          Hide
          Jan Høydahl added a comment -

          That said, I am a little reluctant to change the overall pattern/approach simply to add one field. Maybe the pattern change should be a separate issue.

          SOLR-3442

          Show
          Jan Høydahl added a comment - That said, I am a little reluctant to change the overall pattern/approach simply to add one field. Maybe the pattern change should be a separate issue. SOLR-3442
          Hide
          Jack Krupansky added a comment -

          The concept of copyField is implicitly a judgment that a query of the merged fields is significantly better than the dismax query of the separate fields. But, is that really the case?

          And it is common to boost various document components differently, such as the title.

          That said, I am a little reluctant to change the overall pattern/approach simply to add one field. Maybe the pattern change should be a separate issue.

          Show
          Jack Krupansky added a comment - The concept of copyField is implicitly a judgment that a query of the merged fields is significantly better than the dismax query of the separate fields. But, is that really the case? And it is common to boost various document components differently, such as the title. That said, I am a little reluctant to change the overall pattern/approach simply to add one field. Maybe the pattern change should be a separate issue.
          Hide
          Jan Høydahl added a comment -

          Really, the copyField thing in todays example schema is an anti pattern since we teach people to duplicate all their content while most people would be better off using DisMax. I have had several customers who build their whole search on the model from example schema and then get into performance problems due to the 2x index increase.

          How would you feel if we instead get rid of all the copyFields and configure the default handler with &defType=edismax&qf=name,features,manu,content.... Then we can leave a copyField section commented out in the schema with an explanation of what use cases it is good for.

          Show
          Jan Høydahl added a comment - Really, the copyField thing in todays example schema is an anti pattern since we teach people to duplicate all their content while most people would be better off using DisMax. I have had several customers who build their whole search on the model from example schema and then get into performance problems due to the 2x index increase. How would you feel if we instead get rid of all the copyFields and configure the default handler with &defType=edismax&qf=name,features,manu,content.... Then we can leave a copyField section commented out in the schema with an explanation of what use cases it is good for.
          Jack Krupansky made changes -
          Attachment SOLR-3439.patch [ 12525780 ]
          Hide
          Jack Krupansky added a comment -

          Preliminary patch. "content" is both stored and indexed, with multiple copy fields.

          Show
          Jack Krupansky added a comment - Preliminary patch. "content" is both stored and indexed, with multiple copy fields.
          Hide
          Jack Krupansky added a comment -

          Right, so if it is the double indexing that is a serious concern, maybe having "content" stored but not indexed is a reasonable compromise. It would be searchable due to the CopyField but not double-indexed. This would still give a reasonablly friendly out of the box experience (default search works and content is returned), and obviously they can hand-tune for more specific control.

          But if "content" is stored but not indexed, the user can't simply add "content" to "qf" - they need to make it indexed, which is what my preliminary patch does.

          Show
          Jack Krupansky added a comment - Right, so if it is the double indexing that is a serious concern, maybe having "content" stored but not indexed is a reasonable compromise. It would be searchable due to the CopyField but not double-indexed. This would still give a reasonablly friendly out of the box experience (default search works and content is returned), and obviously they can hand-tune for more specific control. But if "content" is stored but not indexed, the user can't simply add "content" to "qf" - they need to make it indexed, which is what my preliminary patch does.
          Hide
          Yonik Seeley added a comment -

          For non-SolrCell applications, will copyField of the empty "content" field be a significant performance drag?

          No, but if it's used, it can be a big performance drag (indexing content twice). I'm not sure how important it is to be searched by "default"... i.e. with edismax, someone would just need to add "content" to the qf parameter.

          Show
          Yonik Seeley added a comment - For non-SolrCell applications, will copyField of the empty "content" field be a significant performance drag? No, but if it's used, it can be a big performance drag (indexing content twice). I'm not sure how important it is to be searched by "default"... i.e. with edismax, someone would just need to add "content" to the qf parameter.
          Hide
          Jack Krupansky added a comment -

          We could have the copyFields default to being commented out, but then the "content" would not be searched by default. Or we could not index the "content" field, but then it can't be searched by itself.

          For non-SolrCell applications, will copyField of the empty "content" field be a significant performance drag?

          Or is it only the apps that use SolrCell where there are concerns about the copyField impact?

          I agree that performance should be a consideration, but I suspect that these couple of copyFields(I'll post the preliminary patch as soon as the tests finish running) are small potatoes in the overall performance picture.

          Show
          Jack Krupansky added a comment - We could have the copyFields default to being commented out, but then the "content" would not be searched by default. Or we could not index the "content" field, but then it can't be searched by itself. For non-SolrCell applications, will copyField of the empty "content" field be a significant performance drag? Or is it only the apps that use SolrCell where there are concerns about the copyField impact? I agree that performance should be a consideration, but I suspect that these couple of copyFields(I'll post the preliminary patch as soon as the tests finish running) are small potatoes in the overall performance picture.
          Hide
          Yonik Seeley added a comment -

          I agree with adding a stored content field, but I don't think we should add any more copyFields.
          One of the biggest "out of the box" experience items that people make their decision based on is performance - so we shouldn't make the example schema/config slower.

          Show
          Yonik Seeley added a comment - I agree with adding a stored content field, but I don't think we should add any more copyFields. One of the biggest "out of the box" experience items that people make their decision based on is performance - so we shouldn't make the example schema/config slower.
          Jack Krupansky made changes -
          Attachment Lincoln-Gettysburg-Address.pdf [ 12525777 ]
          Attachment Lincoln-Gettysburg-Address.docx [ 12525776 ]
          Hide
          Jack Krupansky added a comment -

          Test documents for SolrCell. Both have a bunch of metadata fields defined. The PDF was generated from the Word doc.

          We can consider them for inclusion in exampledocs, but for now they are posted here for reference and anybody wanting to test this issue.

          Show
          Jack Krupansky added a comment - Test documents for SolrCell. Both have a bunch of metadata fields defined. The PDF was generated from the Word doc. We can consider them for inclusion in exampledocs, but for now they are posted here for reference and anybody wanting to test this issue.
          Hide
          Jack Krupansky added a comment -

          I'll post a preliminary patch tomorrow.

          Show
          Jack Krupansky added a comment - I'll post a preliminary patch tomorrow.
          Jan Høydahl made changes -
          Field Original Value New Value
          Fix Version/s 4.0 [ 12314992 ]
          Hide
          Jan Høydahl added a comment -

          I agree that this makes sense, and will not have any cost.

          We could also make the Velocity GUI smart enough to detect whether the document is a "product" document, and output name, manufacturer, price, inStock etc.. OR whether it is a Tika doc or HTML in which case it prints the title, dynamic teaser, document size, document type/MIME etc.

          Finally we could add some PDFs to the exampledocs folder!

          Do you want to attempt a first patch?

          Show
          Jan Høydahl added a comment - I agree that this makes sense, and will not have any cost. We could also make the Velocity GUI smart enough to detect whether the document is a "product" document, and output name, manufacturer, price, inStock etc.. OR whether it is a Tika doc or HTML in which case it prints the title, dynamic teaser, document size, document type/MIME etc. Finally we could add some PDFs to the exampledocs folder! Do you want to attempt a first patch?
          Jack Krupansky created issue -

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Jack Krupansky
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development