Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: update
    • Labels:
      None

      Description

      I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.

      There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments

      1. libs.zip
        4.78 MB
        Eric Pugh
      2. rich.patch
        81 kB
        Chris Harris
      3. rich.patch
        79 kB
        Chris Harris
      4. rich.patch
        79 kB
        Chris Harris
      5. rich.patch
        79 kB
        Chris Harris
      6. rich.patch
        404 kB
        Chris Harris
      7. rich.patch
        68 kB
        Chris Harris
      8. rich.patch
        4 kB
        Eric Pugh
      9. schema_update.patch
        3 kB
        Yonik Seeley
      10. SOLR-284.patch
        5 kB
        Grant Ingersoll
      11. SOLR-284.patch
        33 kB
        Yonik Seeley
      12. SOLR-284.patch
        124 kB
        Chris Harris
      13. SOLR-284.patch
        127 kB
        Chris Harris
      14. SOLR-284.patch
        124 kB
        Chris Harris
      15. SOLR-284.patch
        123 kB
        Chris Harris
      16. SOLR-284.patch
        123 kB
        Chris Harris
      17. SOLR-284.patch
        134 kB
        Grant Ingersoll
      18. SOLR-284.patch
        127 kB
        Grant Ingersoll
      19. SOLR-284-no-key-gen.patch
        6 kB
        Grant Ingersoll
      20. solr-word.pdf
        21 kB
        Grant Ingersoll
      21. source.zip
        17 kB
        Eric Pugh
      22. test.zip
        8 kB
        Eric Pugh
      23. test-files.zip
        1022 kB
        Chris Harris
      24. test-files.zip
        1.01 MB
        Eric Pugh
      25. un-hardcode-id.diff
        4 kB
        Chris Harris

        Issue Links

          Activity

          Hide
          Eric Pugh added a comment -

          Patch file for adding new handler and test cases.

          Show
          Eric Pugh added a comment - Patch file for adding new handler and test cases.
          Hide
          Eric Pugh added a comment -

          test files to go in test/test-files for unit testing.

          Show
          Eric Pugh added a comment - test files to go in test/test-files for unit testing.
          Hide
          Ryan McKinley added a comment -

          I haven't run this patch, but have a few questions...

          What is the general approach to extract a lucene document (list of fields) from a PDF? Word? Powerpoint?

          Is this just access to a few common fields like author, keywords, text, etc? Is this something that realistically would need to be custom for each case?

          Perhaps it makes sense to add a contrib section for this sort of stuff. It seems weird to add 10 library dependencies to the core distribution. How does nutch handle this?

          Show
          Ryan McKinley added a comment - I haven't run this patch, but have a few questions... What is the general approach to extract a lucene document (list of fields) from a PDF? Word? Powerpoint? Is this just access to a few common fields like author, keywords, text, etc? Is this something that realistically would need to be custom for each case? Perhaps it makes sense to add a contrib section for this sort of stuff. It seems weird to add 10 library dependencies to the core distribution. How does nutch handle this?
          Hide
          Eric Pugh added a comment -

          new jars to go in trunk/lib for pdf and office parsing...

          Show
          Eric Pugh added a comment - new jars to go in trunk/lib for pdf and office parsing...
          Hide
          Eric Pugh added a comment -

          So, I was not attempting to "boil the ocean" and provide the ultimate solution. Our need was just to take all the raw text and index it in a field, and pass in a bunch of other data fields to be indexed.

          We are parsing a large number of unstructured documents, that may or may not have common fields populated, but fortunately we don't really need them. Our users aren't searching by author, but by content.

          I think there are only 5 additional libraries, and one (poi-scratchpad) may be able to be removed...

          Yonik also mentioned using Tika, as a framework for creating a common interface to these types of rich documents, but Tika is still in incubation and has no code in it!

          I originally had separate handlers for each data type, and that was really icky, so I condensed it into the RichDocumentRequestHandler. I could also merge in the CSVRequestHandler into it as well, by just taking out the logic for parsing CSV and putting it into a CSVParser. However, the CSVRequestHandler has very complex and rich semantics that these unstructured documents don't really need.

          Show
          Eric Pugh added a comment - So, I was not attempting to "boil the ocean" and provide the ultimate solution. Our need was just to take all the raw text and index it in a field, and pass in a bunch of other data fields to be indexed. We are parsing a large number of unstructured documents, that may or may not have common fields populated, but fortunately we don't really need them. Our users aren't searching by author, but by content. I think there are only 5 additional libraries, and one (poi-scratchpad) may be able to be removed... Yonik also mentioned using Tika, as a framework for creating a common interface to these types of rich documents, but Tika is still in incubation and has no code in it! I originally had separate handlers for each data type, and that was really icky, so I condensed it into the RichDocumentRequestHandler. I could also merge in the CSVRequestHandler into it as well, by just taking out the logic for parsing CSV and putting it into a CSVParser. However, the CSVRequestHandler has very complex and rich semantics that these unstructured documents don't really need.
          Hide
          Eric Pugh added a comment -

          Updated patch file, properly handling missing stream.types, and cleaning up error messages a bit.

          Show
          Eric Pugh added a comment - Updated patch file, properly handling missing stream.types, and cleaning up error messages a bit.
          Hide
          Eric Pugh added a comment -

          Updated to SVN revision 555996

          Show
          Eric Pugh added a comment - Updated to SVN revision 555996
          Hide
          Eric Pugh added a comment -

          Update patches for revision 572774

          Show
          Eric Pugh added a comment - Update patches for revision 572774
          Hide
          Eric Pugh added a comment -

          Java Source code for RichDocumentRequestHandler and friends.

          Show
          Eric Pugh added a comment - Java Source code for RichDocumentRequestHandler and friends.
          Hide
          Eric Pugh added a comment -

          add the test code for richdocumenthandler.

          Show
          Eric Pugh added a comment - add the test code for richdocumenthandler.
          Hide
          Eric Pugh added a comment -

          test code, this time with granted license!

          Show
          Eric Pugh added a comment - test code, this time with granted license!
          Hide
          Grant Ingersoll added a comment -

          In regards to Tika not having any code, you may also find http://aperture.sourceforge.net does many of the same things for handling different file formats, etc.

          Show
          Grant Ingersoll added a comment - In regards to Tika not having any code, you may also find http://aperture.sourceforge.net does many of the same things for handling different file formats, etc.
          Hide
          Juri Kuehn added a comment -

          Hi Eric, thank you for this handler, works like a charm!
          I need to use non-numeric ids which are fine with solr but are rejected by RichDocumentRequestHandler. I'm not familiar with the solr-code, i patched RichDocumentRequestHandler.java to not to convert id to int, which didn't cause trouble so far:

          RichDocumentRequestHandler.java.patch
          Index: RichDocumentRequestHandler.java
          ===================================================================
          --- RichDocumentRequestHandler.java	(revision 0)
          +++ RichDocumentRequestHandler.java	(working copy)
          @@ -133,7 +133,7 @@
           	String streamFieldname;
           	String[] fieldnames;
           	SchemaField[] fields;
          -	int id;
          +	String id;
           	  
           	final AddUpdateCommand templateAdd;
           
          @@ -153,7 +153,7 @@
           	    String fn = params.get(FIELDNAMES);
           	    fieldnames = fn != null ? commaSplit.split(fn,-1) : null;
           	    
          -	    id = params.getInt(ID);
          +	    id = params.get(ID);
           
           		templateAdd = new AddUpdateCommand();
           		templateAdd.allowDups = false;
          @@ -202,7 +202,7 @@
           	 * @param desc
           	 *            TODO
           	 */
          -	void doAdd(int id, String text, DocumentBuilder builder, AddUpdateCommand template)
          +	void doAdd(String id, String text, DocumentBuilder builder, AddUpdateCommand template)
           	throws IOException {
           
           	  // first, create the lucene document
          @@ -225,7 +225,7 @@
           	  handler.addDoc(template);
           	}
           
          -	void addDoc(int id, String text) throws IOException {
          +	void addDoc(String id, String text) throws IOException {
           		templateAdd.indexedId = null;
           		doAdd(id, text, builder, templateAdd);
           	}
          

          Tests were ok, maybe you can apply it to your sources.

          Best regards,
          Juri

          Show
          Juri Kuehn added a comment - Hi Eric, thank you for this handler, works like a charm! I need to use non-numeric ids which are fine with solr but are rejected by RichDocumentRequestHandler. I'm not familiar with the solr-code, i patched RichDocumentRequestHandler.java to not to convert id to int, which didn't cause trouble so far: RichDocumentRequestHandler.java.patch Index: RichDocumentRequestHandler.java =================================================================== --- RichDocumentRequestHandler.java (revision 0) +++ RichDocumentRequestHandler.java (working copy) @@ -133,7 +133,7 @@ String streamFieldname; String [] fieldnames; SchemaField[] fields; - int id; + String id; final AddUpdateCommand templateAdd; @@ -153,7 +153,7 @@ String fn = params.get(FIELDNAMES); fieldnames = fn != null ? commaSplit.split(fn,-1) : null ; - id = params.getInt(ID); + id = params.get(ID); templateAdd = new AddUpdateCommand(); templateAdd.allowDups = false ; @@ -202,7 +202,7 @@ * @param desc * TODO */ - void doAdd( int id, String text, DocumentBuilder builder, AddUpdateCommand template) + void doAdd( String id, String text, DocumentBuilder builder, AddUpdateCommand template) throws IOException { // first, create the lucene document @@ -225,7 +225,7 @@ handler.addDoc(template); } - void addDoc( int id, String text) throws IOException { + void addDoc( String id, String text) throws IOException { templateAdd.indexedId = null ; doAdd(id, text, builder, templateAdd); } Tests were ok, maybe you can apply it to your sources. Best regards, Juri
          Hide
          Eric Pugh added a comment -

          Juri,

          Thanks for the vote on the issue! The next time I update this patch to work with the latest code, I'll apply your change. Since this is still a pending patch, I am not actively maintaining it. Thanks for voting for this patch, there is only one other patch with more votes, hopefully it will be added soon. I'd love to hear what the use case you have for this patch is.

          https://issues.apache.org/jira/browse/SOLR?report=com.atlassian.jira.plugin.system.project:popularissues-panel

          Eric

          Show
          Eric Pugh added a comment - Juri, Thanks for the vote on the issue! The next time I update this patch to work with the latest code, I'll apply your change. Since this is still a pending patch, I am not actively maintaining it. Thanks for voting for this patch, there is only one other patch with more votes, hopefully it will be added soon. I'd love to hear what the use case you have for this patch is. https://issues.apache.org/jira/browse/SOLR?report=com.atlassian.jira.plugin.system.project:popularissues-panel Eric
          Hide
          Jonathan Hipkiss added a comment -

          This is crucial functionaility if Solr is to be accepted as a solution in any organisation. A search engine that can't parse Microsoft or other closed formats is useless to most organisations.
          This is a MUST!

          Show
          Jonathan Hipkiss added a comment - This is crucial functionaility if Solr is to be accepted as a solution in any organisation. A search engine that can't parse Microsoft or other closed formats is useless to most organisations. This is a MUST!
          Hide
          Pompo Stenberg added a comment -

          I wrote a simple patch for RichDocumentUpdateHandler to accept multivalued fields. Just POST the same field name multiple times, e.g. category=TVs&category=Radios

          RichDocumentRequestHandler.java.patch
          Index: RichDocumentRequestHandler.java
          ===================================================================
          --- RichDocumentRequestHandler.java	(revision 0)
          +++ RichDocumentRequestHandler.java	(working copy)
          @@ -211,7 +211,10 @@
           	  for (int i =0; i < fields.length;i++){
           	    String fieldName = fields[i].getName();
              
          -  	    builder.addField(fieldName,params.get(fieldName),1.0f);
          +           String[] values = params.getParams(fieldName);
          +           for(String value : values) {
          +             	    builder.addField(fieldName,value,1.0f);
          +           }
           	      
           	  }
          

          Seems to work for me.

          Best Regards,
          Pompo

          Show
          Pompo Stenberg added a comment - I wrote a simple patch for RichDocumentUpdateHandler to accept multivalued fields. Just POST the same field name multiple times, e.g. category=TVs&category=Radios RichDocumentRequestHandler.java.patch Index: RichDocumentRequestHandler.java =================================================================== --- RichDocumentRequestHandler.java (revision 0) +++ RichDocumentRequestHandler.java (working copy) @@ -211,7 +211,10 @@ for ( int i =0; i < fields.length;i++){ String fieldName = fields[i].getName(); - builder.addField(fieldName,params.get(fieldName),1.0f); + String [] values = params.getParams(fieldName); + for ( String value : values) { + builder.addField(fieldName,value,1.0f); + } } Seems to work for me. Best Regards, Pompo
          Hide
          Chris Harris added a comment -

          I'm thinking it would be handy if RichDocumentRequestHandler could support indexing text and HTML files, in addition to the fancier formats (pdf, doc, etc.). That way I could use RichDocumentRequestHandler for all my indexing needs (except commits and optimizes), rather than use it for for some doc types but still have to use XmlUpdateRequestHandler for text and HTML docs. Would anyone else find this useful?

          I skimmed the source, and adding support for text files looks trivial. (It's just a pass-through.) And if you had this, then I guess you'd have at least one version of HTML support for free; in particular, you could upload your HTML file to RichDocumentRequestHandler, telling the handler that the document is in plain text format, and then strip off the HTML tags later by using the HTMLStripStandardTokenizer in your schema.xml.

          Alternatively, RichDocumentRequestHandler could provide its own explicit HTML to text conversion. There would probably be some advantages to this, but I'm not sure exactly what they would be. One, I guess, would be that you could use tokenizers that didn't make use of HTMLStripReader.

          Show
          Chris Harris added a comment - I'm thinking it would be handy if RichDocumentRequestHandler could support indexing text and HTML files, in addition to the fancier formats (pdf, doc, etc.). That way I could use RichDocumentRequestHandler for all my indexing needs (except commits and optimizes), rather than use it for for some doc types but still have to use XmlUpdateRequestHandler for text and HTML docs. Would anyone else find this useful? I skimmed the source, and adding support for text files looks trivial. (It's just a pass-through.) And if you had this, then I guess you'd have at least one version of HTML support for free; in particular, you could upload your HTML file to RichDocumentRequestHandler, telling the handler that the document is in plain text format, and then strip off the HTML tags later by using the HTMLStripStandardTokenizer in your schema.xml. Alternatively, RichDocumentRequestHandler could provide its own explicit HTML to text conversion. There would probably be some advantages to this, but I'm not sure exactly what they would be. One, I guess, would be that you could use tokenizers that didn't make use of HTMLStripReader.
          Hide
          Eric Pugh added a comment -

          Chris, I like what you are thinking... Really this is sort of becoming the AllDocumentsUnderTheSunRequestHandler, but what that highlights is that the current solution really doesn't do what we need, which is making it dirt simple to add new handlers...

          While there are some efforts under way to do that, to provide the "uber" solution, I think adding another hack/method to RichDocumentRequestHandler is cool with me. Since it's just a patch file, feel free to take it, munge it, and post it back as the "current" patch. If you do, make sure to add to the docs on the wiki at http://wiki.apache.org/solr/UpdateRichDocuments.

          Heck, you may want to rip in Pompo's fix as well!

          Show
          Eric Pugh added a comment - Chris, I like what you are thinking... Really this is sort of becoming the AllDocumentsUnderTheSunRequestHandler, but what that highlights is that the current solution really doesn't do what we need, which is making it dirt simple to add new handlers... While there are some efforts under way to do that, to provide the "uber" solution, I think adding another hack/method to RichDocumentRequestHandler is cool with me. Since it's just a patch file, feel free to take it, munge it, and post it back as the "current" patch. If you do, make sure to add to the docs on the wiki at http://wiki.apache.org/solr/UpdateRichDocuments . Heck, you may want to rip in Pompo's fix as well!
          Hide
          Eric Pugh added a comment -

          Oh, and don't forget to vote for it as well:

          https://issues.apache.org/jira/browse/SOLR?report=com.atlassian.jira.plugin.system.project:popularissues-panel

          It's the current leading vote getter!

          Show
          Eric Pugh added a comment - Oh, and don't forget to vote for it as well: https://issues.apache.org/jira/browse/SOLR?report=com.atlassian.jira.plugin.system.project:popularissues-panel It's the current leading vote getter!
          Hide
          Kristoffer Dyrkorn added a comment -

          Very handy!

          It could be beneficial to have an option to save the extracted text as xml (so it can be stored) just before adding it to the Solr index. Thus, if the Solr schema needs to be changed (in a way that triggers a full reindex) the content can then be quickly re-fed from a "near source".

          Show
          Kristoffer Dyrkorn added a comment - Very handy! It could be beneficial to have an option to save the extracted text as xml (so it can be stored) just before adding it to the Solr index. Thus, if the Solr schema needs to be changed (in a way that triggers a full reindex) the content can then be quickly re-fed from a "near source".
          Hide
          Chris Harris added a comment -

          Replacing rich.patch. The new one:

          1) Rolls together into one handy package all of these:

          • the old rich.patch
          • the contents of source.zip and test.zip
          • Pompo's multivalued fields patch.

          Note: It does not include the contents of libs.zip or test-files.zip. I'm not sure what the protocol is around those larger files.

          Note: The old rich.patch included a change to Config.java that searched for an alternative config file in "src/test/test-files/solr/conf/". I've removed that change because I think it's debugging code that we don't want in an official patch. Let me know if I'm wrong, though.

          2) Makes things work against the latest revision in trunk, r646483. (It had stopped working with the latest version.)

          I haven't added any new test cases, but the old ones all pass.

          I grant my modifications to ASF according to the Apache License. Someone might want to check that the underlying contributions have been appropriately licensed as well.

          Show
          Chris Harris added a comment - Replacing rich.patch. The new one: 1) Rolls together into one handy package all of these: the old rich.patch the contents of source.zip and test.zip Pompo's multivalued fields patch. Note: It does not include the contents of libs.zip or test-files.zip. I'm not sure what the protocol is around those larger files. Note: The old rich.patch included a change to Config.java that searched for an alternative config file in "src/test/test-files/solr/conf/". I've removed that change because I think it's debugging code that we don't want in an official patch. Let me know if I'm wrong, though. 2) Makes things work against the latest revision in trunk, r646483. (It had stopped working with the latest version.) I haven't added any new test cases, but the old ones all pass. I grant my modifications to ASF according to the Apache License. Someone might want to check that the underlying contributions have been appropriately licensed as well.
          Hide
          Michel Benevento added a comment -

          Hi, just new here, I am working on Rich Document support for solr-ruby and acts_as_solr. If you are interested, see prelim results at: http://wiki.apache.org/solr/solr-ruby/BrainStorming

          For acts_as_solr I need the ID field to be a String, same as Juri Kuehn above who supplied the fix for this.

          Is there a specific reason it was not added to the latest rich.patch? I would appreciate it.

          Thanks,
          MIchel

          Show
          Michel Benevento added a comment - Hi, just new here, I am working on Rich Document support for solr-ruby and acts_as_solr. If you are interested, see prelim results at: http://wiki.apache.org/solr/solr-ruby/BrainStorming For acts_as_solr I need the ID field to be a String, same as Juri Kuehn above who supplied the fix for this. Is there a specific reason it was not added to the latest rich.patch? I would appreciate it. Thanks, MIchel
          Hide
          Chris Harris added a comment -

          Here's a new version of rich.patch. My previous attempt didn't actually include all the necessary files! (Curses upon you, TortoiseSVN.) This one also includes preliminary support for plaintext and HTML files. (HTML support is done by running the input through the HTMLStripReader.)

          Show
          Chris Harris added a comment - Here's a new version of rich.patch. My previous attempt didn't actually include all the necessary files! (Curses upon you, TortoiseSVN.) This one also includes preliminary support for plaintext and HTML files. (HTML support is done by running the input through the HTMLStripReader.)
          Hide
          Chris Harris added a comment -

          New version of test-files.zip. Contains new file, simple.txt, that is used by a new unit test for plaintext files.

          Show
          Chris Harris added a comment - New version of test-files.zip. Contains new file, simple.txt, that is used by a new unit test for plaintext files.
          Hide
          Grant Ingersoll added a comment -

          Why not just use Tika (or Aperture, but it's license isn't as friendly)? Doesn't make sense to reinvent the wheel here.

          Show
          Grant Ingersoll added a comment - Why not just use Tika (or Aperture, but it's license isn't as friendly)? Doesn't make sense to reinvent the wheel here.
          Hide
          Chris Harris added a comment -

          I'm not sure this patch entirely reinvents the wheel, as it does most of the heavy lifting with preexisting components, namely PDFBox, POI, and Solr's own HTMLStripReader. It also has the advantage of already existing, whereas tying Solr to Tika or Aperture would take additional effort.

          Tika or Aperture do look really nice, though. The most obvious advantage these projects have over this patch is that they can already extract text from more file formats than this patch, and that the developers will probably continue to add more file formats over time. Are you thinking of additional advantages on top of this, Grant? Do you have any cool ideas about how Tika/Aperture's metadata extraction facilities might be integrated into Solr? Is there a potentially interesting interface between Aperture's crawling facilities and Solr?

          Show
          Chris Harris added a comment - I'm not sure this patch entirely reinvents the wheel, as it does most of the heavy lifting with preexisting components, namely PDFBox, POI, and Solr's own HTMLStripReader. It also has the advantage of already existing, whereas tying Solr to Tika or Aperture would take additional effort. Tika or Aperture do look really nice, though. The most obvious advantage these projects have over this patch is that they can already extract text from more file formats than this patch, and that the developers will probably continue to add more file formats over time. Are you thinking of additional advantages on top of this, Grant? Do you have any cool ideas about how Tika/Aperture's metadata extraction facilities might be integrated into Solr? Is there a potentially interesting interface between Aperture's crawling facilities and Solr?
          Hide
          Grant Ingersoll added a comment -

          I think Tika will actually take less effort, as you only need one
          interface, as I understand it. You don't need separate handlers for
          each type, we just need to write the interface between Solr and Tika.

          Nutch is already using Tika.

          +1

          Yes, someone else maintains the code. We just maintain the interface
          and upgrade when appropriate.

          well, metadata makes for nice fields to sort, filter and facet on,
          right?

          I think it is more likely that you will see Nutch integration w/ Solr
          (in fact, there is already a patch for it), but yeah, I think it makes
          sense to consider Solr as a sink for any crawler.

          Some of this also overlaps w/ the Data Import Request Handler on
          SOLR-469. I don't think we want to get Solr into the crawling game,
          but we also shouldn't prevent it from playing nicely with crawlers
          (not saying it doesn't already)

          --------------------------
          Grant Ingersoll

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - I think Tika will actually take less effort, as you only need one interface, as I understand it. You don't need separate handlers for each type, we just need to write the interface between Solr and Tika. Nutch is already using Tika. +1 Yes, someone else maintains the code. We just maintain the interface and upgrade when appropriate. well, metadata makes for nice fields to sort, filter and facet on, right? I think it is more likely that you will see Nutch integration w/ Solr (in fact, there is already a patch for it), but yeah, I think it makes sense to consider Solr as a sink for any crawler. Some of this also overlaps w/ the Data Import Request Handler on SOLR-469 . I don't think we want to get Solr into the crawling game, but we also shouldn't prevent it from playing nicely with crawlers (not saying it doesn't already) -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Chris Harris added a comment -

          Attaching another patch revision. I've been totally asleep at the wheel today, and my previous one contained not only the feature described in this JIRA issue but also the Data Import RequestHandler patch (SOLR-469). Hopefully I've finally made a patch that's actually correct. I can at least promise that the unit tests pass when applied to r654253.

          Show
          Chris Harris added a comment - Attaching another patch revision. I've been totally asleep at the wheel today, and my previous one contained not only the feature described in this JIRA issue but also the Data Import RequestHandler patch ( SOLR-469 ). Hopefully I've finally made a patch that's actually correct. I can at least promise that the unit tests pass when applied to r654253.
          Hide
          Otis Gospodnetic added a comment -

          +1 for Tika
          But also +1 for committing this in the mean time – wow, lots of watchers and voters!

          Show
          Otis Gospodnetic added a comment - +1 for Tika But also +1 for committing this in the mean time – wow, lots of watchers and voters!
          Hide
          Grant Ingersoll added a comment -

          I don't agree on committing it. If Tika is the right solution, then
          we should work towards Tika. Not saying this isn't good, just saying
          it's going to create more maintenance than we want and then we just
          end up deprecating it in the near future.

          Show
          Grant Ingersoll added a comment - I don't agree on committing it. If Tika is the right solution, then we should work towards Tika. Not saying this isn't good, just saying it's going to create more maintenance than we want and then we just end up deprecating it in the near future.
          Hide
          Chris Harris added a comment -

          I'm on the fence about whether this patch makes sense to include in Solr right now. One thing I'm wondering, though: Can we assess the odds at this point whether it could make sense for a Tika-based handler to offer the same public interface that the handler in this patch presents? That is, even if the underlying implementation were switched to Tika at some point, could we avoid changing the URL schema and such that Solr clients would use to interact with it?

          If it's likely that the public interface could indeed remain the same for the first Tika-based handler release (or at least more or less the same), would this alleviate any of Grant's concerns?

          Also, would putting this handler into a contrib directory rather than in the main code base, as has been mentioned on the mailing list, make committing it any less problematic?

          Show
          Chris Harris added a comment - I'm on the fence about whether this patch makes sense to include in Solr right now. One thing I'm wondering, though: Can we assess the odds at this point whether it could make sense for a Tika-based handler to offer the same public interface that the handler in this patch presents? That is, even if the underlying implementation were switched to Tika at some point, could we avoid changing the URL schema and such that Solr clients would use to interact with it? If it's likely that the public interface could indeed remain the same for the first Tika-based handler release (or at least more or less the same), would this alleviate any of Grant's concerns? Also, would putting this handler into a contrib directory rather than in the main code base, as has been mentioned on the mailing list, make committing it any less problematic?
          Hide
          Mike Klaas added a comment -

          Removing from 1.3. No committer has taken ownership.

          (It might make sense as a contrib, but I can see the argument for not duplicating tika)

          Show
          Mike Klaas added a comment - Removing from 1.3. No committer has taken ownership. (It might make sense as a contrib, but I can see the argument for not duplicating tika)
          Hide
          Rogério Pereira Araújo added a comment -

          Who is working on tika based handler? The work on tika based handler can be started or it isn't mature enougth?

          Show
          Rogério Pereira Araújo added a comment - Who is working on tika based handler? The work on tika based handler can be started or it isn't mature enougth?
          Hide
          Otis Gospodnetic added a comment -

          I don't think anyone is working on it (publicly), so you are welcome to contribute it.

          Show
          Otis Gospodnetic added a comment - I don't think anyone is working on it (publicly), so you are welcome to contribute it.
          Hide
          Chris Harris added a comment -

          Trivial update to merge cleanly against r685275.

          Show
          Chris Harris added a comment - Trivial update to merge cleanly against r685275.
          Hide
          Chris Harris added a comment -

          The patch, as currently stands, treats a field called "id" as a special case. First, it is a required field. Second, unlike any other field, you don't need to declare it in the fieldnames parameter. Finally, since the fieldSolrParams.getInt(), that field is required to be an int.

          This special-case treatment seems a little too particular to me; not everyone wants to have a field called "id", and not everyone who does wants that field to be an int. So what I propose is to eliminate the special treatment of "id". See un-hardcode-id.diff for what this might mean in particular. (That file is not complete; to correctly make this change, I'd have to update the test cases.)

          This is a breaking change, because if you are using an id field, you'll now have to specifically indicate that fact in the fieldnames parameter. Thus, instead of

          http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=subject,author&subject=mysubject&author=eric

          you'll have to put

          http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=id,subject,author&subject=mysubject&author=eric

          I think asking users of this patch to make this slight change in their client code is not an unreasonable burden, but I'm curious what Eric and others have to say.

          Show
          Chris Harris added a comment - The patch, as currently stands, treats a field called "id" as a special case. First, it is a required field. Second, unlike any other field, you don't need to declare it in the fieldnames parameter. Finally, since the fieldSolrParams.getInt(), that field is required to be an int. This special-case treatment seems a little too particular to me; not everyone wants to have a field called "id", and not everyone who does wants that field to be an int. So what I propose is to eliminate the special treatment of "id". See un-hardcode-id.diff for what this might mean in particular. (That file is not complete; to correctly make this change, I'd have to update the test cases.) This is a breaking change, because if you are using an id field, you'll now have to specifically indicate that fact in the fieldnames parameter. Thus, instead of http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=subject,author&subject=mysubject&author=eric you'll have to put http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=id,subject,author&subject=mysubject&author=eric I think asking users of this patch to make this slight change in their client code is not an unreasonable burden, but I'm curious what Eric and others have to say.
          Hide
          Eric Pugh added a comment -

          So, in typical open source fashion, I wrote the original patch to scratch my own itch. Which meant that it was okay to make id be hardcoded. However, even when I first posted the patch to this JIRA issue, I felt a little "icky" about the id field. It seemed like a code smell to have this magic id! So, from that standpoint, I think the changes that Chris has posted look great.

          I think it's a good example of a patch getting better and better everytime someone else uses it!

          Now, if only this almost 14 month old patch could be applied! With 28 votes, and 16 active watches, clearly somebody out there finds this useful!

          And at this point it is miles better then what I first posted! Keep up the great work and great contributions back!

          Show
          Eric Pugh added a comment - So, in typical open source fashion, I wrote the original patch to scratch my own itch. Which meant that it was okay to make id be hardcoded. However, even when I first posted the patch to this JIRA issue, I felt a little "icky" about the id field. It seemed like a code smell to have this magic id! So, from that standpoint, I think the changes that Chris has posted look great. I think it's a good example of a patch getting better and better everytime someone else uses it! Now, if only this almost 14 month old patch could be applied! With 28 votes, and 16 active watches, clearly somebody out there finds this useful! And at this point it is miles better then what I first posted! Keep up the great work and great contributions back!
          Hide
          Chris Harris added a comment -

          While we're on the subject of breaking changes, I'm now seeing some merit in replacing the fieldnames parameter with a field-specifying prefix.

          Currently when you want to set a non-body field, you introduce the field name in the fieldnames parameter and then specify its value in another parameter, like so:

          /update/rich/...fieldnames=f1,f2,f3&f1=val1&f2=val2&f3=val3

          The alternative would be to to signal the fields f1, f2, and f3 by a field prefix, like so:

          /update/rich/...f.f1=val1&f.f2=val2&f.f3=val3

          Because the f prefix says "this is a field", there's no need for the fieldnames parameter.

          This isn't an Earth-shattering improvement, but there are three things I like about it:

          1. The URLs are shorter

          2. If you rename a field (e.g. rename f3 to g3), you can't accidentally half-update the URL in the client code, like this:

          /update/rich/...fieldnames=f1,f2,g3&f1=val1&f2=val2&f3=val3

          3. Currently there are certain reserved words (e.g. "fieldnames", "commit") that you can't use, because they have special meaning to the handler. But with this change they become legitimate field names. For example, maybe I want each of my documents to have a "commit" field that describes who made the most recent relevant commit in a version control system.

          /update/rich/...commit=true&f.commit=chris

          I can't think of any downsides right now, other than breaking people's code. (I do admit that is a downside.)

          Any comments?

          Show
          Chris Harris added a comment - While we're on the subject of breaking changes, I'm now seeing some merit in replacing the fieldnames parameter with a field-specifying prefix. Currently when you want to set a non-body field, you introduce the field name in the fieldnames parameter and then specify its value in another parameter, like so: /update/rich/...fieldnames=f1,f2,f3&f1=val1&f2=val2&f3=val3 The alternative would be to to signal the fields f1, f2, and f3 by a field prefix, like so: /update/rich/...f.f1=val1&f.f2=val2&f.f3=val3 Because the f prefix says "this is a field", there's no need for the fieldnames parameter. This isn't an Earth-shattering improvement, but there are three things I like about it: 1. The URLs are shorter 2. If you rename a field (e.g. rename f3 to g3), you can't accidentally half-update the URL in the client code, like this: /update/rich/...fieldnames=f1,f2,g3&f1=val1&f2=val2&f3=val3 3. Currently there are certain reserved words (e.g. "fieldnames", "commit") that you can't use, because they have special meaning to the handler. But with this change they become legitimate field names. For example, maybe I want each of my documents to have a "commit" field that describes who made the most recent relevant commit in a version control system. /update/rich/...commit=true&f.commit=chris I can't think of any downsides right now, other than breaking people's code. (I do admit that is a downside.) Any comments?
          Hide
          Chris Harris added a comment -

          A couple of Tika things:

          I glanced at Tika yesterday, and it looks like switching this patch over to it wouldn't be too hard. (The only thing half-worthy of note is that org.apache.tika.parser.Parser.parse outputs XHTML [via a SAX interface], which we would probably then need to turn into plaintext.) I haven't yet looked into Eric's code to see if it does anything special that Tika doesn't do.

          I also noticed something else, though. Earlier comments say that Nutch uses Tika, but when I looked through Nutch trunk this seemed to only sort of be the case. In particular, Nutch definitely uses the stuff in the org.apache.tika.mime namepsace, to do things like auto-detect content types, but it doesn't seem to use the stuff in org.apache.tika.parser to do the actual document parsing; instead, it uses its own separate org.apache.nutch.parse.Parser class (and subclasses thereof). For example, org.apache.nutch.parse.html.HtmlParser does not delegate to org.apache.tika.parser.html.HtmlParser but rather does its own direct manipulation of the tagsoup and/or nekohtml libraries. (Things are similar with the Nutch PDF parser.) Nor does there seem to be an alternative class along the lines of org.apache.nutch.parse.TikaBasedParserThatCanParseLotsOfDifferentContentTypesIncludingHtml. And the string "org.apache.tika.parser" doesn't seem to occur in the Nutch source.

          I'm wondering if anyone knows why Nutch does not seem to make use of all of Tika's functionality. Are they planning to switch everything over to Tika eventually?

          Show
          Chris Harris added a comment - A couple of Tika things: I glanced at Tika yesterday, and it looks like switching this patch over to it wouldn't be too hard. (The only thing half-worthy of note is that org.apache.tika.parser.Parser.parse outputs XHTML [via a SAX interface] , which we would probably then need to turn into plaintext.) I haven't yet looked into Eric's code to see if it does anything special that Tika doesn't do. I also noticed something else, though. Earlier comments say that Nutch uses Tika, but when I looked through Nutch trunk this seemed to only sort of be the case. In particular, Nutch definitely uses the stuff in the org.apache.tika.mime namepsace, to do things like auto-detect content types, but it doesn't seem to use the stuff in org.apache.tika.parser to do the actual document parsing; instead, it uses its own separate org.apache.nutch.parse.Parser class (and subclasses thereof). For example, org.apache.nutch.parse.html.HtmlParser does not delegate to org.apache.tika.parser.html.HtmlParser but rather does its own direct manipulation of the tagsoup and/or nekohtml libraries. (Things are similar with the Nutch PDF parser.) Nor does there seem to be an alternative class along the lines of org.apache.nutch.parse.TikaBasedParserThatCanParseLotsOfDifferentContentTypesIncludingHtml. And the string "org.apache.tika.parser" doesn't seem to occur in the Nutch source. I'm wondering if anyone knows why Nutch does not seem to make use of all of Tika's functionality. Are they planning to switch everything over to Tika eventually?
          Hide
          Chris Harris added a comment -

          This update is just to make a tiny refactoring, bringing all the handler's parsing classes under

          src\java\org\apache\solr\handler\rich

          and all the testing classes under

          src\test\org\apache\solr\handler\rich

          All tests pass.

          Show
          Chris Harris added a comment - This update is just to make a tiny refactoring, bringing all the handler's parsing classes under src\java\org\apache\solr\handler\rich and all the testing classes under src\test\org\apache\solr\handler\rich All tests pass.
          Hide
          Chris Harris added a comment -

          THIS IS A BREAKING CHANGE TO RICH.PATCH! CLIENT URLs NEED TO BE UPDATED!

          All unit tests pass.

          Changes:

          • As suggested earlier, the "id" parameter is no longer treated as a special case; it is not required, and it does not need to be an int. If you do use a field called "id", you must now declare it in the fieldnames parameter, as you would any other field
          • Do updates with with UpdateRequestProcessor and SolrInputDocument, rather than UpdateHandler and DocumentBuilder. (The latter pair appear to be obsolete.)
          • Previously if you declared a field in the fieldnames parameter but did not then did not specify a value for that field, you would get a NullPointerException. Now you can specify any nonnegative number of values for a declared field, including zero. (I've added a unit test for this.)
          • In SolrPDFParser, properly close PDDocument when PDF parsing throws an exception
          • Log the stream type in the solr log, rather than on the console
          • Some not-very-thorough conversion of tabs to spaces

          As an aside, I've noticed that I failed in my earlier efforts to incorporate Juri Kuehn's change to allow the id field to be non-integer. Sorry about that, Juri; that was not at all intentional.

          Show
          Chris Harris added a comment - THIS IS A BREAKING CHANGE TO RICH.PATCH! CLIENT URLs NEED TO BE UPDATED! All unit tests pass. Changes: As suggested earlier, the "id" parameter is no longer treated as a special case; it is not required, and it does not need to be an int. If you do use a field called "id", you must now declare it in the fieldnames parameter, as you would any other field Do updates with with UpdateRequestProcessor and SolrInputDocument, rather than UpdateHandler and DocumentBuilder. (The latter pair appear to be obsolete.) Previously if you declared a field in the fieldnames parameter but did not then did not specify a value for that field, you would get a NullPointerException. Now you can specify any nonnegative number of values for a declared field, including zero. (I've added a unit test for this.) In SolrPDFParser, properly close PDDocument when PDF parsing throws an exception Log the stream type in the solr log, rather than on the console Some not-very-thorough conversion of tabs to spaces As an aside, I've noticed that I failed in my earlier efforts to incorporate Juri Kuehn's change to allow the id field to be non-integer. Sorry about that, Juri; that was not at all intentional.
          Hide
          Grant Ingersoll added a comment -

          FYI, I intend to integrate Tika now that it has graduated from incubation and is a full-fledged Lucene sub-project. I will do my best to be back-compatible with this patch, but make no guarantees as of know, since I have not reviewed this patch in a long time.

          Show
          Grant Ingersoll added a comment - FYI, I intend to integrate Tika now that it has graduated from incubation and is a full-fledged Lucene sub-project. I will do my best to be back-compatible with this patch, but make no guarantees as of know, since I have not reviewed this patch in a long time.
          Hide
          Grant Ingersoll added a comment -

          Some initial thoughts on moving forward:

          I think we can add some generic functionality here via the request params:

          1. Tika can provide a lot of metadata about a document. By metadata, I mean things like the actual author, pages, etc. as provided by the document, not the hardcoded metadata in the http://wiki.apache.org/solr/UpdateRichDocuments. The hardcoded metadata is also useful and should be retained. With these, we then need a way to map fields from Tika's metadata to Solr fields. If no mapping is specified, it tries to use the Tika metadata name as the field name. If that doesn't exist, then we can rely on dynamic fields or we can allow for a param that passes in the name of a default field to map to.

          2. We can auto detect the mime type or allow for it to be passed in. Thus, stream.type becomes optional, but is still useful.

          3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in. I will likely make this pluggable such that people can provide there own. I think this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????)

          I should have a start of a patch today or tomorrow.

          Show
          Grant Ingersoll added a comment - Some initial thoughts on moving forward: I think we can add some generic functionality here via the request params: 1. Tika can provide a lot of metadata about a document. By metadata, I mean things like the actual author, pages, etc. as provided by the document, not the hardcoded metadata in the http://wiki.apache.org/solr/UpdateRichDocuments . The hardcoded metadata is also useful and should be retained. With these, we then need a way to map fields from Tika's metadata to Solr fields. If no mapping is specified, it tries to use the Tika metadata name as the field name. If that doesn't exist, then we can rely on dynamic fields or we can allow for a param that passes in the name of a default field to map to. 2. We can auto detect the mime type or allow for it to be passed in. Thus, stream.type becomes optional, but is still useful. 3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in. I will likely make this pluggable such that people can provide there own. I think this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????) I should have a start of a patch today or tomorrow.
          Hide
          Grant Ingersoll added a comment -

          3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in. I will likely make this pluggable such that people can provide there own. I think this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????)

          Now that I'm digging in more, this actually isn't needed. The ProcessorChain can be used for this stuff

          Show
          Grant Ingersoll added a comment - 3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in. I will likely make this pluggable such that people can provide there own. I think this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????) Now that I'm digging in more, this actually isn't needed. The ProcessorChain can be used for this stuff
          Hide
          Eric Pugh added a comment -

          Grant, I am really excited that you are looking at this patch!

          While I am proud of it, and very proud of the number of organizations that have used it, and the people who have improved it (Thanks Chris!); it was just written to scratch an itch, and feel free to rip it apart to come up with a better solution for Solr. The ability for Solr to injest more formats I think is key aspect, not how this patch works.

          Show
          Eric Pugh added a comment - Grant, I am really excited that you are looking at this patch! While I am proud of it, and very proud of the number of organizations that have used it, and the people who have improved it (Thanks Chris!); it was just written to scratch an itch, and feel free to rip it apart to come up with a better solution for Solr. The ability for Solr to injest more formats I think is key aspect, not how this patch works.
          Hide
          Grant Ingersoll added a comment -

          I'll separate out my two patches.

          Show
          Grant Ingersoll added a comment - I'll separate out my two patches.
          Hide
          Grant Ingersoll added a comment -

          Question for the people watching this:

          Would you prefer a new wiki page and keep the old one for those using Chris/Eric's patch, or would you rather I overwrite/edit the current one?

          FWIW, some of the parameters will be the same, but I'm also adding in quite a bit more: boosting, XPath expression support (Tika returns everything as XHTML, so it then becomes possible to restrict down what parts you want to pay attention to), extraction only (i.e. no indexing), support for metadata extraction and indexing, support for sending in "literals" which are like the current fieldnames parameter and likely some other pieces.

          FYI: Out of the box, Tika has support for: http://incubator.apache.org/tika/formats.html and I know they are adding more things as well, like Flash, etc.

          It should also be noted, that if you are just indexing metadata about a file, it makes more sense to do the work on the client side.

          Show
          Grant Ingersoll added a comment - Question for the people watching this: Would you prefer a new wiki page and keep the old one for those using Chris/Eric's patch, or would you rather I overwrite/edit the current one? FWIW, some of the parameters will be the same, but I'm also adding in quite a bit more: boosting, XPath expression support (Tika returns everything as XHTML, so it then becomes possible to restrict down what parts you want to pay attention to), extraction only (i.e. no indexing), support for metadata extraction and indexing, support for sending in "literals" which are like the current fieldnames parameter and likely some other pieces. FYI: Out of the box, Tika has support for: http://incubator.apache.org/tika/formats.html and I know they are adding more things as well, like Flash, etc. It should also be noted, that if you are just indexing metadata about a file, it makes more sense to do the work on the client side.
          Hide
          Erik Hatcher added a comment -

          I'd rather see the old (err, current) wiki page replaced/renamed, and kept current with the latest patch/commit from this issue. Nice work Grant!

          Show
          Erik Hatcher added a comment - I'd rather see the old (err, current) wiki page replaced/renamed, and kept current with the latest patch/commit from this issue. Nice work Grant!
          Hide
          Chris Harris added a comment -

          Grant,

          I don't really care if you take over the old wiki page's name or start a new one; maybe it depends on if the updated handler is still going to have a similar name or be called something else. I do think, though, that it might be handy nice to have some wiki page (and maybe some JIRA issue) to maintain the older patch on a temporary basis.

          Thanks,
          Chris

          Show
          Chris Harris added a comment - Grant, I don't really care if you take over the old wiki page's name or start a new one; maybe it depends on if the updated handler is still going to have a similar name or be called something else. I do think, though, that it might be handy nice to have some wiki page (and maybe some JIRA issue) to maintain the older patch on a temporary basis. Thanks, Chris
          Hide
          Grant Ingersoll added a comment -

          OK, I've created http://wiki.apache.org/solr/ExtractingRequestHandler and linked it from the old page. I will have a preliminary patch up today.

          Show
          Grant Ingersoll added a comment - OK, I've created http://wiki.apache.org/solr/ExtractingRequestHandler and linked it from the old page. I will have a preliminary patch up today.
          Hide
          Grant Ingersoll added a comment -

          First crack at this. You'll need to download http://people.apache.org/~gsingers/extraction-libs.tar as it is too big to fit in JIRA.

          There's probably lots wrong with it, so be gentle! See http://wiki.apache.org/solr/ExtractingRequestHandler to get started.

          Show
          Grant Ingersoll added a comment - First crack at this. You'll need to download http://people.apache.org/~gsingers/extraction-libs.tar as it is too big to fit in JIRA. There's probably lots wrong with it, so be gentle! See http://wiki.apache.org/solr/ExtractingRequestHandler to get started.
          Hide
          Grant Ingersoll added a comment -

          Things to do:

          1. Documentation
          2. Way more testing, esp. unit tests of the various parameters
          3. Update NOTICES and LICENSE.txt for the new dependencies.

          Show
          Grant Ingersoll added a comment - Things to do: 1. Documentation 2. Way more testing, esp. unit tests of the various parameters 3. Update NOTICES and LICENSE.txt for the new dependencies.
          Hide
          Grant Ingersoll added a comment -

          Captured fields weren't being indexed properly.

          Show
          Grant Ingersoll added a comment - Captured fields weren't being indexed properly.
          Hide
          Grant Ingersoll added a comment -

          Fix issue with literal mapping

          Show
          Grant Ingersoll added a comment - Fix issue with literal mapping
          Hide
          Grant Ingersoll added a comment -

          Separated out ID generation to make it easier to override.

          I think this is pretty close to being ready to commit, so please review. I'm wrapped up next week, so I probably won't commit until the end of next week (after 11/21) so please review and provide feedback. Also, Tika is about to release 0.2, so I may just wait to add that in.

          Added in NOTICE and LICENSE information.

          Show
          Grant Ingersoll added a comment - Separated out ID generation to make it easier to override. I think this is pretty close to being ready to commit, so please review. I'm wrapped up next week, so I probably won't commit until the end of next week (after 11/21) so please review and provide feedback. Also, Tika is about to release 0.2, so I may just wait to add that in. Added in NOTICE and LICENSE information.
          Hide
          Grant Ingersoll added a comment -

          Still to do:

          1. More unit tests

          2. We need to do the crypto notice for Solr once this is committed. See https://issues.apache.org/jira/browse/NUTCH-621 for examples. I will link a new issue for this so as not to hold up this patch from being committed. It just needs to be done before releasing 1.4

          Show
          Grant Ingersoll added a comment - Still to do: 1. More unit tests 2. We need to do the crypto notice for Solr once this is committed. See https://issues.apache.org/jira/browse/NUTCH-621 for examples. I will link a new issue for this so as not to hold up this patch from being committed. It just needs to be done before releasing 1.4
          Hide
          Grant Ingersoll added a comment -

          Let's name the patch right, eh?

          Show
          Grant Ingersoll added a comment - Let's name the patch right, eh?
          Hide
          Grant Ingersoll added a comment -

          Fix an issue w/ XPath and extract only. See http://tika.markmail.org/message/kknu3hw7argwiqin

          Show
          Grant Ingersoll added a comment - Fix an issue w/ XPath and extract only. See http://tika.markmail.org/message/kknu3hw7argwiqin
          Hide
          Chris Harris added a comment -

          Is the latest patch supposed to contain a file "solr-word.pdf"? I don't see one, and my "ant test" is failing along these lines:

          org.apache.solr.common.SolrException: java.io.FileNotFoundException: solr-word.pdf (The system cannot find the file specified)
          at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
          at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
          at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:331)
          at org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:97)
          at org.apache.solr.handler.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:27)

          Show
          Chris Harris added a comment - Is the latest patch supposed to contain a file "solr-word.pdf"? I don't see one, and my "ant test" is failing along these lines: org.apache.solr.common.SolrException: java.io.FileNotFoundException: solr-word.pdf (The system cannot find the file specified) at org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313) at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:331) at org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:97) at org.apache.solr.handler.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:27)
          Hide
          Chris Harris added a comment -

          A few comment on the ExtractingDocumentLoader:

          • I think I like where this is going.
          • Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, which means that if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. (That doesn't sound plausible, though I admit I haven't looked into it.) Even in that case, I think I'd rather not have to set up a mapping for every possible field name in order to get started with this handler. Would true perhaps be a better default?
          • ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, Grant? The example in the patch is of routing text from <div> tags to one Solr field while routing text from other tags to a different Solr field. I'm kind of curious when this would be useful, especially keeping in mind that, in general, Tika source documents are not HTML, and so when <div> tags are generated they're as much artifacts of Tika as reflecting anything in the underlying document. (You could maybe ask a similar question about ext.inx.attr / INDEX_ATTRIBUTES.)
          Show
          Chris Harris added a comment - A few comment on the ExtractingDocumentLoader: I think I like where this is going. Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, which means that if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. (That doesn't sound plausible, though I admit I haven't looked into it.) Even in that case, I think I'd rather not have to set up a mapping for every possible field name in order to get started with this handler. Would true perhaps be a better default? ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, Grant? The example in the patch is of routing text from <div> tags to one Solr field while routing text from other tags to a different Solr field. I'm kind of curious when this would be useful, especially keeping in mind that, in general, Tika source documents are not HTML, and so when <div> tags are generated they're as much artifacts of Tika as reflecting anything in the underlying document. (You could maybe ask a similar question about ext.inx.attr / INDEX_ATTRIBUTES.)
          Hide
          Grant Ingersoll added a comment -

          Here's the solr-word PDF.

          Show
          Grant Ingersoll added a comment - Here's the solr-word PDF.
          Hide
          Grant Ingersoll added a comment -

          I think I like where this is going.

          Great! I think the nice thing is as Tika grows, we'll get many more formats all for free. For instance, I saw someone working on a Flash extractor.

          Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, which means that if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. (That doesn't sound plausible, though I admit I haven't looked into it.) Even in that case, I think I'd rather not have to set up a mapping for every possible field name in order to get started with this handler. Would true perhaps be a better default?

          I guess I was thinking that most people will probably start out with this by sending their docs through the engine and see what happens. I think an exception helps them see sooner what they are missing. That being said, I don't feel particularly strong about it. It's easy enough to set it to true in the request handler mappings. From what I see of Tika, though, the possible values for metadata is fixed within a version. Perhaps the bigger issue is what happens when someone updates Tika to a newer version with newer Metadata options.

          ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, Grant? The example in the patch is of routing text from <div> tags to one Solr field while routing text from other tags to a different Solr field. I'm kind of curious when this would be useful, especially keeping in mind that, in general, Tika source documents are not HTML, and so when <div> tags are generated they're as much artifacts of Tika as reflecting anything in the underlying document. (You could maybe ask a similar question about ext.inx.attr / INDEX_ATTRIBUTES.)

          For capture fields, it's similar to a copy field function. Say, for example, you want a whole document in one field, but also to be able to search within paragraphs. Then, you could use a capture field on a <p> tag to do that. Thus, you get the best of both worlds. The Tika output, is XHTML.

          Also, since extraction is happening on the server side, I want to make sure we have lots of options for dealing with the content. I don't know where else one would have options to muck with the content post-extraction, but pre-indexing. Hooking into the processor chain is too late, since then the Tika structure is gone. That's my reasoning, anyway.

          Similarly, for index attributes. When extracting from an HTML file, and it comes across anchor tags (<a>) it will provide the attributes of the tags as XML attributes. So, one may want to extract out the links separately from the main content and put them into a separate field.

          Show
          Grant Ingersoll added a comment - I think I like where this is going. Great! I think the nice thing is as Tika grows, we'll get many more formats all for free. For instance, I saw someone working on a Flash extractor. Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, which means that if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. (That doesn't sound plausible, though I admit I haven't looked into it.) Even in that case, I think I'd rather not have to set up a mapping for every possible field name in order to get started with this handler. Would true perhaps be a better default? I guess I was thinking that most people will probably start out with this by sending their docs through the engine and see what happens. I think an exception helps them see sooner what they are missing. That being said, I don't feel particularly strong about it. It's easy enough to set it to true in the request handler mappings. From what I see of Tika, though, the possible values for metadata is fixed within a version. Perhaps the bigger issue is what happens when someone updates Tika to a newer version with newer Metadata options. ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, Grant? The example in the patch is of routing text from <div> tags to one Solr field while routing text from other tags to a different Solr field. I'm kind of curious when this would be useful, especially keeping in mind that, in general, Tika source documents are not HTML, and so when <div> tags are generated they're as much artifacts of Tika as reflecting anything in the underlying document. (You could maybe ask a similar question about ext.inx.attr / INDEX_ATTRIBUTES.) For capture fields, it's similar to a copy field function. Say, for example, you want a whole document in one field, but also to be able to search within paragraphs. Then, you could use a capture field on a <p> tag to do that. Thus, you get the best of both worlds. The Tika output, is XHTML. Also, since extraction is happening on the server side, I want to make sure we have lots of options for dealing with the content. I don't know where else one would have options to muck with the content post-extraction, but pre-indexing. Hooking into the processor chain is too late, since then the Tika structure is gone. That's my reasoning, anyway. Similarly, for index attributes. When extracting from an HTML file, and it comes across anchor tags (<a>) it will provide the attributes of the tags as XML attributes. So, one may want to extract out the links separately from the main content and put them into a separate field.
          Hide
          Hoss Man added a comment -

          if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names.

          I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention – either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.

          Show
          Hoss Man added a comment - if Tika returns a metadata field and you haven't made an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will throw an exception and your document add will fail. This doesn't seem sound very robust for a production environment, unless Tika will only ever use a finite list of metadata field names. I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention – either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.
          Hide
          Grant Ingersoll added a comment -

          I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention - either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.

          No, they don't, but that is a good idea for Tika.

          Show
          Grant Ingersoll added a comment - I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention - either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared. No, they don't, but that is a good idea for Tika.
          Hide
          Chris Harris added a comment -

          The 2008-11-15 01:12 PM version of SOLR-284.patch contains modifications to client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java related to date handling. That's not intentional, is it?

          Show
          Chris Harris added a comment - The 2008-11-15 01:12 PM version of SOLR-284 .patch contains modifications to client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java related to date handling. That's not intentional, is it?
          Hide
          Grant Ingersoll added a comment -

          The 2008-11-15 01:12 PM version of SOLR-284.patch contains modifications to client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java related to date handling. That's not intentional, is it?

          Yes, it is intentional. The user will need to be able to pass in/configure their own Date formats for their documents and the implementation has to be able to map those to Solr's canonical date format. Thus, I moved the date handling stuff to a "common" DateUtils class (and deprecated it in ClientUtils) because it is needed on the server side too. Unfortunately, it looks like I did some reformatting on the class as a whole, too. Sorry 'bout that.

          Show
          Grant Ingersoll added a comment - The 2008-11-15 01:12 PM version of SOLR-284 .patch contains modifications to client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java related to date handling. That's not intentional, is it? Yes, it is intentional. The user will need to be able to pass in/configure their own Date formats for their documents and the implementation has to be able to map those to Solr's canonical date format. Thus, I moved the date handling stuff to a "common" DateUtils class (and deprecated it in ClientUtils) because it is needed on the server side too. Unfortunately, it looks like I did some reformatting on the class as a whole, too. Sorry 'bout that.
          Hide
          Grant Ingersoll added a comment -

          I like how Erik has given names to contribs, etc.: Flare, Celeritas, etc. So, I thought I would give one too:

          I was typing the javadocs and wrote "Solr Content Extraction Library". Which then lead me to "Solr Cell" as the project name? http://en.wikipedia.org/wiki/Solar_cell It's also nice, b/c a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for "raw" content of a document to something usable by Solr.

          I know, I know, get a life... Still, it beats "ExtractingRequestHandler" as a name!

          Show
          Grant Ingersoll added a comment - I like how Erik has given names to contribs, etc.: Flare, Celeritas, etc. So, I thought I would give one too: I was typing the javadocs and wrote "Solr Content Extraction Library". Which then lead me to "Solr Cell" as the project name? http://en.wikipedia.org/wiki/Solar_cell It's also nice, b/c a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for "raw" content of a document to something usable by Solr. I know, I know, get a life... Still, it beats "ExtractingRequestHandler" as a name!
          Hide
          Erik Hatcher added a comment -

          I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention - either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared.

          Tika doesn't need to do this explicitly.... you know all fields coming out of your call to the Tika API will be Tika fields. Solar Cell (I'm on board with that nickname, Grant - now you're catching on - thus we could map all Tika output fields to tika_* where * is the Tika outputted field name. And with field name mapping this default would be overridden, say tika_title mapped to "title". Just some off the cuff thoughts.

          Show
          Erik Hatcher added a comment - I'm not familiar with the state of the patch, but i'm assuming that (by default) all of the metadata fields produced by tika have a common naming convention - either in terms of a common prefix or a common suffix. in which case people can always make a dynamicField declaration to ignore all metadata fields not already explicitly declared. Tika doesn't need to do this explicitly.... you know all fields coming out of your call to the Tika API will be Tika fields. Solar Cell (I'm on board with that nickname, Grant - now you're catching on - thus we could map all Tika output fields to tika_* where * is the Tika outputted field name. And with field name mapping this default would be overridden, say tika_title mapped to "title". Just some off the cuff thoughts.
          Hide
          Grant Ingersoll added a comment -

          Tika doesn't need to do this explicitly.... you know all fields coming out of your call to the Tika API will be Tika fields. Solar Cell could map all Tika output fields to tika_* where * is the Tika outputted field name. And with field name mapping this default would be overridden, say tika_title mapped to "title".

          I can add in an option to have it do this mapping.

          Show
          Grant Ingersoll added a comment - Tika doesn't need to do this explicitly.... you know all fields coming out of your call to the Tika API will be Tika fields. Solar Cell could map all Tika output fields to tika_* where * is the Tika outputted field name. And with field name mapping this default would be overridden, say tika_title mapped to "title". I can add in an option to have it do this mapping.
          Hide
          Chris Harris added a comment -

          The 2008-11-15 01:12 PM SOLR-284.patch wasn't applying cleanly to trunk r720403 for me. (One of the hunks for client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java wouldn't apply.) With this very small update, it does apply cleanly.

          Show
          Chris Harris added a comment - The 2008-11-15 01:12 PM SOLR-284 .patch wasn't applying cleanly to trunk r720403 for me. (One of the hunks for client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java wouldn't apply.) With this very small update, it does apply cleanly.
          Hide
          Chris Harris added a comment -

          On r720403B, I'm noticing that before I apply this patch tests pass, whereas after I apply this patch the following tests fail:

          solr.client.solrj.embedded.JettyWebappTest
          solr.client.solrj.embedded.LargeVolumeJettyTest
          solr.client.solrj.embedded.SolrExampleJettyTest
          solr.client.solrj.response.TestSpellCheckResponse

          In each case Solr outputs this exception: "On Solr startup: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.ExtractingRequestHandler'"

          I'm not sure the best way to get the ExtractingRequestHandler into the classpath here.

          Sort of related, I've noticed that ExtractingRequestHandler doesn't currently get built into the .war file when you run "ant example", in contrast to DataImportHandler, which does get put into the .war by means of this target in its build.xml (among other targets):

          <target name="dist" depends="build">
          <copy todir="../../build/web">
          <fileset dir="src/main/webapp" includes="**" />
          </copy>
          <mkdir dir="../../build/web/WEB-INF/lib"/>
          <copy file="target/$

          {fullnamever}.jar" todir="${solr-path}/build/web/WEB-INF/lib"></copy>
          <copy file="target/${fullnamever}

          .jar" todir="$

          {solr-path}/dist"></copy>
          </target>

          Should ExtractingRequestHandler's build.xml perhaps have an analogous "dist" target, along these lines:

          <target name="dist" depends="build">
          <mkdir dir="../../build/web/WEB-INF/lib"/>
          <copy file="build/${fullnamever}.jar" todir="${solr-path}

          /build/web/WEB-INF/lib"></copy>
          <copy file="build/$

          {fullnamever}

          .jar" todir="$

          {solr-path}

          /dist"></copy>
          </target>

          Show
          Chris Harris added a comment - On r720403B, I'm noticing that before I apply this patch tests pass, whereas after I apply this patch the following tests fail: solr.client.solrj.embedded.JettyWebappTest solr.client.solrj.embedded.LargeVolumeJettyTest solr.client.solrj.embedded.SolrExampleJettyTest solr.client.solrj.response.TestSpellCheckResponse In each case Solr outputs this exception: "On Solr startup: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.ExtractingRequestHandler'" I'm not sure the best way to get the ExtractingRequestHandler into the classpath here. Sort of related, I've noticed that ExtractingRequestHandler doesn't currently get built into the .war file when you run "ant example", in contrast to DataImportHandler, which does get put into the .war by means of this target in its build.xml (among other targets): <target name="dist" depends="build"> <copy todir="../../build/web"> <fileset dir="src/main/webapp" includes="**" /> </copy> <mkdir dir="../../build/web/WEB-INF/lib"/> <copy file="target/$ {fullnamever}.jar" todir="${solr-path}/build/web/WEB-INF/lib"></copy> <copy file="target/${fullnamever} .jar" todir="$ {solr-path}/dist"></copy> </target> Should ExtractingRequestHandler's build.xml perhaps have an analogous "dist" target, along these lines: <target name="dist" depends="build"> <mkdir dir="../../build/web/WEB-INF/lib"/> <copy file="build/${fullnamever}.jar" todir="${solr-path} /build/web/WEB-INF/lib"></copy> <copy file="build/$ {fullnamever} .jar" todir="$ {solr-path} /dist"></copy> </target>
          Hide
          Chris Harris added a comment -

          Small change to the 2008-11-26 09:18 AM SOLR-284.patch (my previous one), this time adding an "example" ant target to contrib/javascript/build.xml. (Without this top-level "ant example" was failing.)

          Show
          Chris Harris added a comment - Small change to the 2008-11-26 09:18 AM SOLR-284 .patch (my previous one), this time adding an "example" ant target to contrib/javascript/build.xml. (Without this top-level "ant example" was failing.)
          Hide
          Chris Harris added a comment -

          This should be the last change for today.

          This change adds a resource.name parameter that you can pass to the handler. (I'm guessing you'll probably typically pass a filename, though Tika does use the more general term "resource name".) If you provide it, Tika can take advantage of it when applying its heuristics to determine the MIME type.

          Affected files:

          • ExtractingParams.java
          • ExtractingDocumentLoader.java
          Show
          Chris Harris added a comment - This should be the last change for today. This change adds a resource.name parameter that you can pass to the handler. (I'm guessing you'll probably typically pass a filename, though Tika does use the more general term "resource name".) If you provide it, Tika can take advantage of it when applying its heuristics to determine the MIME type. Affected files: ExtractingParams.java ExtractingDocumentLoader.java
          Hide
          Grant Ingersoll added a comment -

          Sort of related, I've noticed that ExtractingRequestHandler doesn't currently get built into the .war file when you run "ant example", in contrast to DataImportHandler, which does get put into the .war by means of this target in its build.xml (among other targets):

          Yes, it does NOT get put into the WAR on purpose. Unfortunately, I think the DIH does this wrong (but it's probably too late now). A contrib should be optional, as not everyone wants it/needs it. Solr Cell works solely by putting it into the Solr Home lib directory and then hooking it into the config.

          Show
          Grant Ingersoll added a comment - Sort of related, I've noticed that ExtractingRequestHandler doesn't currently get built into the .war file when you run "ant example", in contrast to DataImportHandler, which does get put into the .war by means of this target in its build.xml (among other targets): Yes, it does NOT get put into the WAR on purpose. Unfortunately, I think the DIH does this wrong (but it's probably too late now). A contrib should be optional, as not everyone wants it/needs it. Solr Cell works solely by putting it into the Solr Home lib directory and then hooking it into the config.
          Hide
          Chris Harris added a comment -

          Changes since my previous upload:

          • sync CHANGES.txt with trunk
          • test cases for adding plain text data
          • you aren't forced to map a field if you use the resource.name parameter
          Show
          Chris Harris added a comment - Changes since my previous upload: sync CHANGES.txt with trunk test cases for adding plain text data you aren't forced to map a field if you use the resource.name parameter
          Hide
          Chris Harris added a comment -

          As I mentioned before, tests for these

          solr.client.solrj.embedded.JettyWebappTest
          solr.client.solrj.embedded.LargeVolumeJettyTest
          solr.client.solrj.embedded.SolrExampleJettyTest
          solr.client.solrj.embedded.TestSpellCheckResponse

          were failing, with Solr giving a classnotfoundexception for one of the extracting document loader (ie Solr Cell) classes.

          This revision fixes this by removing all references to this Tika handler from /trunk/example/conf/solrconfig.xml and /trunk/example/conf/schema.xml. Note that these references still exist (and are still used for testing) in /trunk/contrib/extraction/src/test/resources/solr/conf.

          There are probably other ways to make these tests pass, perhaps involving changing the setUp() methods for the above mentioned tests' java files. (For example, maybe you could fiddle with the path parameter passed to the WebAppContext constructor in JettyWebappTest.java? I don't really know anything about this embedded stuff.) I like the current approach, though, because it avoids further changes to code that's logically independent of this handler.

          Show
          Chris Harris added a comment - As I mentioned before, tests for these solr.client.solrj.embedded.JettyWebappTest solr.client.solrj.embedded.LargeVolumeJettyTest solr.client.solrj.embedded.SolrExampleJettyTest solr.client.solrj.embedded.TestSpellCheckResponse were failing, with Solr giving a classnotfoundexception for one of the extracting document loader (ie Solr Cell) classes. This revision fixes this by removing all references to this Tika handler from /trunk/example/conf/solrconfig.xml and /trunk/example/conf/schema.xml. Note that these references still exist (and are still used for testing) in /trunk/contrib/extraction/src/test/resources/solr/conf. There are probably other ways to make these tests pass, perhaps involving changing the setUp() methods for the above mentioned tests' java files. (For example, maybe you could fiddle with the path parameter passed to the WebAppContext constructor in JettyWebappTest.java? I don't really know anything about this embedded stuff.) I like the current approach, though, because it avoids further changes to code that's logically independent of this handler.
          Hide
          Chris Harris added a comment -

          Currently this patch deploys the Tika libs to /trunk/example/solr/lib. I'm curious where the Tika handler's lib/ directory is supposed to go in a multicore deployment. I created my own multicore setup more or less like this:

          • ant example
          • Copy /trunk/example to /trunk/solr-10000
          • Copy /trunk/solr-10000/multicore/* to /trunk/solr-10000/solr.

          (Solr-10000 means "copy of Solr I plan to run on port 10000.")

          This seems to be the easiest way to set things up so that I can cd to /trunk/solr-10000 and run start.jar to get multicore Solr running.

          Or rather, that would get multicore Solr running, except that Solr gets a can't-find-the-Tika-classes exception. So I guess /trunk/solr-10000/solr/lib is not where the lib directory goes for multicore deployment.

          So I tried putting Tika libs instead in /trunk/solr-10000/solr/core0/lib, and that loaded fine. That doesn't seem like the right place for the directory, though; it seems like each core shouldn't have to have its own separate copy of the Tika libs.

          So where do the Tika libs go?

          Show
          Chris Harris added a comment - Currently this patch deploys the Tika libs to /trunk/example/solr/lib. I'm curious where the Tika handler's lib/ directory is supposed to go in a multicore deployment. I created my own multicore setup more or less like this: ant example Copy /trunk/example to /trunk/solr-10000 Copy /trunk/solr-10000/multicore/* to /trunk/solr-10000/solr. (Solr-10000 means "copy of Solr I plan to run on port 10000.") This seems to be the easiest way to set things up so that I can cd to /trunk/solr-10000 and run start.jar to get multicore Solr running. Or rather, that would get multicore Solr running, except that Solr gets a can't-find-the-Tika-classes exception. So I guess /trunk/solr-10000/solr/lib is not where the lib directory goes for multicore deployment. So I tried putting Tika libs instead in /trunk/solr-10000/solr/core0/lib, and that loaded fine. That doesn't seem like the right place for the directory, though; it seems like each core shouldn't have to have its own separate copy of the Tika libs. So where do the Tika libs go?
          Hide
          Grant Ingersoll added a comment -

          I think in multicore you can specify a shared library in the solr.xml directory, so you could put the tika stuff in that dir.

          As for the tests, I didn't know the tests had a dependency on the example directory. That doesn't seem good. I'm with a client all this week, but will try to get to it this weekend.

          Show
          Grant Ingersoll added a comment - I think in multicore you can specify a shared library in the solr.xml directory, so you could put the tika stuff in that dir. As for the tests, I didn't know the tests had a dependency on the example directory. That doesn't seem good. I'm with a client all this week, but will try to get to it this weekend.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 723977.

          Committed Chris' patch, w/ the modification that I put the ext prefix on the resource.name and stream.type.

          I also added a ext.metadata.prefix option, which can be used to map the Tika metadata to a dynamic Field, as Erik described above.

          See the Wiki page for details: http://wiki.apache.org/solr/ExtractingRequestHandler

          Thanks for everyone's input and work!

          Show
          Grant Ingersoll added a comment - Committed revision 723977. Committed Chris' patch, w/ the modification that I put the ext prefix on the resource.name and stream.type. I also added a ext.metadata.prefix option, which can be used to map the Tika metadata to a dynamic Field, as Erik described above. See the Wiki page for details: http://wiki.apache.org/solr/ExtractingRequestHandler Thanks for everyone's input and work!
          Hide
          Ryan McKinley added a comment -

          Looks like there are a bunch of duplicate .jar files in lib. You could remove these and use the ones that are already in /lib

          Index: contrib/extraction/lib/commons-io-1.4.jar
          Index: contrib/extraction/lib/commons-codec-1.3.jar
          Index: contrib/extraction/lib/commons-lang-2.1.jar
          Index: contrib/extraction/lib/commons-logging-1.0.4.jar
          Index: contrib/extraction/lib/junit-3.8.1.jar

          Show
          Ryan McKinley added a comment - Looks like there are a bunch of duplicate .jar files in lib. You could remove these and use the ones that are already in /lib Index: contrib/extraction/lib/commons-io-1.4.jar Index: contrib/extraction/lib/commons-codec-1.3.jar Index: contrib/extraction/lib/commons-lang-2.1.jar Index: contrib/extraction/lib/commons-logging-1.0.4.jar Index: contrib/extraction/lib/junit-3.8.1.jar
          Hide
          Grant Ingersoll added a comment -

          Thanks, Ryan, I will remove them.

          Show
          Grant Ingersoll added a comment - Thanks, Ryan, I will remove them.
          Hide
          Grant Ingersoll added a comment -

          Forgot a couple of things on this:

          1. To hook into the release/javadoc mechanism.
          2. In order to facilitate separation of the javadocs and other things, I'm going to move the code to o.a.s.handler.extraction package.
          3. Need to publish the Maven artifacts.

          Show
          Grant Ingersoll added a comment - Forgot a couple of things on this: 1. To hook into the release/javadoc mechanism. 2. In order to facilitate separation of the javadocs and other things, I'm going to move the code to o.a.s.handler.extraction package. 3. Need to publish the Maven artifacts.
          Hide
          Rogério Pereira Araújo added a comment -

          Grant, lemme know how can I help.

          Show
          Rogério Pereira Araújo added a comment - Grant, lemme know how can I help.
          Hide
          Grant Ingersoll added a comment -

          OK, I just committed:

          1. Upgraded to Tika 0.2 official release
          2. Put in POM support
          3. Hooked in various other build things.

          Show
          Grant Ingersoll added a comment - OK, I just committed: 1. Upgraded to Tika 0.2 official release 2. Put in POM support 3. Hooked in various other build things.
          Hide
          Lance Norskog added a comment -

          The ExtractingRequestHandler has its own UUID generation code.

          Should the schema designer just use the UUID field type or decide to have no unique key field? This seems more modular and follows other aspects of the design.

          Show
          Lance Norskog added a comment - The ExtractingRequestHandler has its own UUID generation code. Should the schema designer just use the UUID field type or decide to have no unique key field? This seems more modular and follows other aspects of the design.
          Hide
          Grant Ingersoll added a comment -

          Should the schema designer just use the UUID field type or decide to have no unique key field? This seems more modular and follows other aspects of the design.

          I guess I usually prefer having a unique key field, as it always gives you that one last handle to grab on to to find a specific document. However, I'm not sure I follow what you mean by having no uniq. field being more modular.

          I put in the code b/c I figured it was better to generate an ID than to outright reject the document, since unlike when adding XML, sending large files can be really expensive, so I wanted it to handle as many edge cases as possible and still accept a document.

          Here's the code:

          SchemaField uniqueField = schema.getUniqueKeyField();
              if (uniqueField != null) {
                String uniqueFieldName = uniqueField.getName();
                SolrInputField uniqFld = document.getField(uniqueFieldName);
                if (uniqFld == null) {
                  String uniqId = generateId(uniqueField);
                  if (uniqId != null) {
                    document.addField(uniqueFieldName, uniqId);
                  }
                }
              }
          
          Show
          Grant Ingersoll added a comment - Should the schema designer just use the UUID field type or decide to have no unique key field? This seems more modular and follows other aspects of the design. I guess I usually prefer having a unique key field, as it always gives you that one last handle to grab on to to find a specific document. However, I'm not sure I follow what you mean by having no uniq. field being more modular. I put in the code b/c I figured it was better to generate an ID than to outright reject the document, since unlike when adding XML, sending large files can be really expensive, so I wanted it to handle as many edge cases as possible and still accept a document. Here's the code: SchemaField uniqueField = schema.getUniqueKeyField(); if (uniqueField != null ) { String uniqueFieldName = uniqueField.getName(); SolrInputField uniqFld = document.getField(uniqueFieldName); if (uniqFld == null ) { String uniqId = generateId(uniqueField); if (uniqId != null ) { document.addField(uniqueFieldName, uniqId); } } }
          Hide
          Hoss Man added a comment -

          I put in the code b/c I figured it was better to generate an ID than to outright reject the document,

          Hmmm ... that means that if i have a schema with a uniqueKey field, and i forget to specify a uniqueKey value when indexing my document, the handler will "silently succeed" in adding a document with a key i have no control over instead of failing in a way that will make me aware of my mistake – and i have no way of configuring solr to prevent that kind of "silent success"

          If i wanted that behavior, i could configure the schema with a UUIDFIeld as the uniqueKey and take advantage of the default. but as it is now i have no way to prevent it.

          I would think consistency and flexibility is more important, and remove that "generateId" functionality along the lines of Lance's suggestion.

          Show
          Hoss Man added a comment - I put in the code b/c I figured it was better to generate an ID than to outright reject the document, Hmmm ... that means that if i have a schema with a uniqueKey field, and i forget to specify a uniqueKey value when indexing my document, the handler will "silently succeed" in adding a document with a key i have no control over instead of failing in a way that will make me aware of my mistake – and i have no way of configuring solr to prevent that kind of "silent success" If i wanted that behavior, i could configure the schema with a UUIDFIeld as the uniqueKey and take advantage of the default. but as it is now i have no way to prevent it. I would think consistency and flexibility is more important, and remove that "generateId" functionality along the lines of Lance's suggestion.
          Hide
          Grant Ingersoll added a comment -

          Hmmm ... that means that if i have a schema with a uniqueKey field, and i forget to specify a uniqueKey value when indexing my document, the handler will "silently succeed" in adding a document with a key i have no control over instead of failing in a way that will make me aware of my mistake - and i have no way of configuring solr to prevent that kind of "silent success"

          Actually, there is a mechanism for avoiding it, and it is documented on in http://wiki.apache.org/solr/ExtractingRequestHandler#head-6cda7b8832bb2ccaf6b0b57a6ef524b553db489e

          I could, however, see adding a flag to specify whether one wants "silent success" or not. I think the use case for content extraction is different than the normal XML message path. Often times, these files are quite large and the cost of sending them to the system is significant.

          Another thing that might be interesting to do is to actually return in the the response the generated id.

          Show
          Grant Ingersoll added a comment - Hmmm ... that means that if i have a schema with a uniqueKey field, and i forget to specify a uniqueKey value when indexing my document, the handler will "silently succeed" in adding a document with a key i have no control over instead of failing in a way that will make me aware of my mistake - and i have no way of configuring solr to prevent that kind of "silent success" Actually, there is a mechanism for avoiding it, and it is documented on in http://wiki.apache.org/solr/ExtractingRequestHandler#head-6cda7b8832bb2ccaf6b0b57a6ef524b553db489e I could, however, see adding a flag to specify whether one wants "silent success" or not. I think the use case for content extraction is different than the normal XML message path. Often times, these files are quite large and the cost of sending them to the system is significant. Another thing that might be interesting to do is to actually return in the the response the generated id.
          Hide
          Chris Harris added a comment -

          I could, however, see adding a flag to specify whether one wants "silent success" or not. I think the use case for content extraction is different than the normal XML message path. Often times, these files are quite large and the cost of sending them to the system is significant.

          In my own use case of the handler, I imagine the fail-on-missing-key policy would be the more helpful policy. This is because I want to be in control of my own key, and if Solr fails as soon as I don't provide one, that's going to help me find the bug in my indexing code right away, whereas "silent success" will allow that bug to fester. I'm not sure there would be significant countervailing advantages to the other policy. It's true that transferring a large file when you're just going to get an error message wastes some time, but I feel like in debugging there's potential to waste a lot more time.

          My first choice would be for fail-on-missing-key to be the default, followed by having an easy-to-set flag. In any case, though, it would be nice not to have to create a custom SolrContentHandler just to get this one sanity check.

          Show
          Chris Harris added a comment - I could, however, see adding a flag to specify whether one wants "silent success" or not. I think the use case for content extraction is different than the normal XML message path. Often times, these files are quite large and the cost of sending them to the system is significant. In my own use case of the handler, I imagine the fail-on-missing-key policy would be the more helpful policy. This is because I want to be in control of my own key, and if Solr fails as soon as I don't provide one, that's going to help me find the bug in my indexing code right away, whereas "silent success" will allow that bug to fester. I'm not sure there would be significant countervailing advantages to the other policy. It's true that transferring a large file when you're just going to get an error message wastes some time, but I feel like in debugging there's potential to waste a lot more time. My first choice would be for fail-on-missing-key to be the default, followed by having an easy-to-set flag. In any case, though, it would be nice not to have to create a custom SolrContentHandler just to get this one sanity check.
          Hide
          Grant Ingersoll added a comment -

          I guess I'm fine with it. So, should we remove key generation all together?

          Show
          Grant Ingersoll added a comment - I guess I'm fine with it. So, should we remove key generation all together?
          Hide
          Grant Ingersoll added a comment -

          Remove Key Generation. Will commit shortly

          Show
          Grant Ingersoll added a comment - Remove Key Generation. Will commit shortly
          Hide
          Grant Ingersoll added a comment -

          I removed the auto key generation: Committed revision 741907. I think this can officially close out this patch.

          Show
          Grant Ingersoll added a comment - I removed the auto key generation: Committed revision 741907. I think this can officially close out this patch.
          Hide
          Yonik Seeley added a comment -

          Not sure if I should open a new issue or keep improvements here.
          I think we need to improve the OOTB experience with this...
          http://search.lucidimagination.com/search/document/302440b8a2451908/solr_cell

          Ideas for improvement:

          • auto-mapping names of the form Last-Modified to a more solrish field name like last_modified
          • drop "ext." from parameter names, and revisit naming to try and unify with other update handlers like CSV
            note: in the future, one could see generic functionality like boosting fields, setting field value defaults, etc, being handled by a generic component or update processor... all the better reason to drop the ext prefix.
          • I imagine that metadata is normally useful, so we should
            1. predefine commonly used metadata fields in the example schema... there's really no cost to this
            2. use mappings to normalize any metadata names (if such normalization isn't already done in Tika)
            3. ignore or drop fields that have little use
            4. provide a way to handle new attributes w/o dropping them or throwing an error
          • enable the handler by default - lazy to avoid a dependency on having all the tika libs available
          Show
          Yonik Seeley added a comment - Not sure if I should open a new issue or keep improvements here. I think we need to improve the OOTB experience with this... http://search.lucidimagination.com/search/document/302440b8a2451908/solr_cell Ideas for improvement: auto-mapping names of the form Last-Modified to a more solrish field name like last_modified drop "ext." from parameter names, and revisit naming to try and unify with other update handlers like CSV note: in the future, one could see generic functionality like boosting fields, setting field value defaults, etc, being handled by a generic component or update processor... all the better reason to drop the ext prefix. I imagine that metadata is normally useful, so we should 1. predefine commonly used metadata fields in the example schema... there's really no cost to this 2. use mappings to normalize any metadata names (if such normalization isn't already done in Tika) 3. ignore or drop fields that have little use 4. provide a way to handle new attributes w/o dropping them or throwing an error enable the handler by default - lazy to avoid a dependency on having all the tika libs available
          Hide
          Eric Pugh added a comment -

          I am out of the office 6/29 - 6/30. For urgent issues, please contact
          Jason Hull at jhull@opensourceconnections.com or phone at (434)
          409-8451.

          Show
          Eric Pugh added a comment - I am out of the office 6/29 - 6/30. For urgent issues, please contact Jason Hull at jhull@opensourceconnections.com or phone at (434) 409-8451.
          Hide
          Yonik Seeley added a comment -

          Oops, there is a "ext.metadata.prefix" that I missed on the first pass. This should be defaulted to handle unknown attributes.

          Show
          Yonik Seeley added a comment - Oops, there is a "ext.metadata.prefix" that I missed on the first pass. This should be defaulted to handle unknown attributes.
          Hide
          Yonik Seeley added a comment -

          ext.capture seems problematic in that one needs a separate ext.map statement to move what you capture... but it doesn't seem to work well if you already have fieldnames that might match something you are trying to capture.

          perhaps something of the form
          capture.targetfield=expression
          would work better?

          Show
          Yonik Seeley added a comment - ext.capture seems problematic in that one needs a separate ext.map statement to move what you capture... but it doesn't seem to work well if you already have fieldnames that might match something you are trying to capture. perhaps something of the form capture.targetfield=expression would work better?
          Hide
          Yonik Seeley added a comment -

          I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.
          Perhaps we don't even need this option if we map attributes to an ignored_ field that is ignored?
          In any case, the default seems like it should generate / index attributes.

          Show
          Yonik Seeley added a comment - I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF. Perhaps we don't even need this option if we map attributes to an ignored_ field that is ignored? In any case, the default seems like it should generate / index attributes.
          Hide
          Yonik Seeley added a comment -

          Another comment on parameter naming: period is more like a scoping operator, and less like a word separator. Hence ext.ignore.und.fl is more readable as ext.ignore_undefined or something.

          Show
          Yonik Seeley added a comment - Another comment on parameter naming: period is more like a scoping operator, and less like a word separator. Hence ext.ignore.und.fl is more readable as ext.ignore_undefined or something.
          Hide
          Yonik Seeley added a comment -

          Apologies for not reviewing this sooner after it was committed - but this is the last/best chance to improve the interface before 1.4 is released (and this is very important new functionality).

          Since the "ext." seems unnecessary and removing is already a name change, we might as well revisit the names themselves anyway. Here are my first thoughts on it:

          //////// generic type stuff that could be reused by other update handlers
          boost.myfield=2.3
          literal.myfield=Hello
          map.origfield=newfield
          uprefix=attr_ 
            // map any unknown fields using a standard prefix... good for
            // dynamic field mapping.
          
          //////// more solr cell specific
          capture.target_field=div
            // does capture + field-map in single step... avoids name clashes
          xpath=xpath_expr
            // future: could do xpath.targetfield=xpath_expr
          extract_only=true  // period's aren't word separators, but scoping operators
           // in the future, this could be replaced with a generic update operation
           // to return the document(s) instead of indexing them.
          resource.name=test.pdf
          
          New idea:
            nicenames=true // Last-Modified -> last_modified
          
          
          REMOVED:
          ext.ignore.und.fl 
            // throwing an exception when a field-type doesn't exist is generic
            // and not needed.  we should never silently ignore.
          ext.idx.attr
            // do we ever want this to be false?  we can ignore all attributes
            // with field mappings if we want to
          ext.metadata.prefix
            // seems like we only want to map unknown fields, not all fields
          ext.def.fl 
            // we can use a standard field name for indexing main content
            // and use map to move it if desired. "content"? 
          

          Do people view this as an improvement?

          Show
          Yonik Seeley added a comment - Apologies for not reviewing this sooner after it was committed - but this is the last/best chance to improve the interface before 1.4 is released (and this is very important new functionality). Since the "ext." seems unnecessary and removing is already a name change, we might as well revisit the names themselves anyway. Here are my first thoughts on it: //////// generic type stuff that could be reused by other update handlers boost.myfield=2.3 literal.myfield=Hello map.origfield=newfield uprefix=attr_ // map any unknown fields using a standard prefix... good for // dynamic field mapping. //////// more solr cell specific capture.target_field=div // does capture + field-map in single step... avoids name clashes xpath=xpath_expr // future : could do xpath.targetfield=xpath_expr extract_only= true // period's aren't word separators, but scoping operators // in the future , this could be replaced with a generic update operation // to return the document(s) instead of indexing them. resource.name=test.pdf New idea: nicenames= true // Last-Modified -> last_modified REMOVED: ext.ignore.und.fl // throwing an exception when a field-type doesn't exist is generic // and not needed. we should never silently ignore. ext.idx.attr // do we ever want this to be false ? we can ignore all attributes // with field mappings if we want to ext.metadata.prefix // seems like we only want to map unknown fields, not all fields ext.def.fl // we can use a standard field name for indexing main content // and use map to move it if desired. "content" ? Do people view this as an improvement?
          Hide
          Grant Ingersoll added a comment -

          I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.

          This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML.

          Show
          Grant Ingersoll added a comment - I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF. This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML.
          Hide
          Grant Ingersoll added a comment -

          I will review your comments more tomorrow. Still waist deep in boxes from the move!

          Show
          Grant Ingersoll added a comment - I will review your comments more tomorrow. Still waist deep in boxes from the move!
          Hide
          Chris Harris added a comment -

          Apologies for not reviewing this sooner after it was committed - but this is the last/best chance to improve the interface before 1.4 is released (and this is very important new functionality).

          My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata. Please leave it out of my Solr index." In my current use case I already know all the metadata I want, and including the Tika-parsed fields would result in index bloat. (My temptation would be to make excluding Tika-parsed fields the default, though it sounds like other people have the opposite inclination.)

          Show
          Chris Harris added a comment - Apologies for not reviewing this sooner after it was committed - but this is the last/best chance to improve the interface before 1.4 is released (and this is very important new functionality). My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata. Please leave it out of my Solr index." In my current use case I already know all the metadata I want, and including the Tika-parsed fields would result in index bloat. (My temptation would be to make excluding Tika-parsed fields the default, though it sounds like other people have the opposite inclination.)
          Hide
          Grant Ingersoll added a comment -

          ext.ignore.und.fl

          I think this should be kept and this is a case where we should silently ignore. Parsing rich data is a different beast than normal Solr XML or other structured content. There are a lot of times where you only want to get specific fields and there can be a large number of fields. It is burdensome to have to add the ignores for all the metadata. Not to mention different types may have different metadata. So, -1 on removing.

          ext.idx.attr

          Yes, we may want it to be false. That's why I put it in! It can be used to extract things like HREF into other fields or not. Think faceting.

          ext.metadata.prefix

          This is not a mapping thing so much as a way to separately handle metadata fields from the main text fields. I'm not sure if it differs from the uprefix approach you are proposing except you can know exactly what is metadata and what isn't.

          Other questions that Yonik brought up:

          1. I don't think trying to auto map is a good idea. New file formats will have new ways of doing them, it's better to have the user handle it.
          2. Fine with dropping ext for common names
          3. Metadata is often not useful and I don't think we need to do work as suggested. See Eric's comment above.
          4. Enabling by default is fine.

          Show
          Grant Ingersoll added a comment - ext.ignore.und.fl I think this should be kept and this is a case where we should silently ignore. Parsing rich data is a different beast than normal Solr XML or other structured content. There are a lot of times where you only want to get specific fields and there can be a large number of fields. It is burdensome to have to add the ignores for all the metadata. Not to mention different types may have different metadata. So, -1 on removing. ext.idx.attr Yes, we may want it to be false. That's why I put it in! It can be used to extract things like HREF into other fields or not. Think faceting. ext.metadata.prefix This is not a mapping thing so much as a way to separately handle metadata fields from the main text fields. I'm not sure if it differs from the uprefix approach you are proposing except you can know exactly what is metadata and what isn't. Other questions that Yonik brought up: 1. I don't think trying to auto map is a good idea. New file formats will have new ways of doing them, it's better to have the user handle it. 2. Fine with dropping ext for common names 3. Metadata is often not useful and I don't think we need to do work as suggested. See Eric's comment above. 4. Enabling by default is fine.
          Hide
          Yonik Seeley added a comment -

          My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata.

          Map unknown fields to an ignored fieldtype.
          uprefix=ignored_

          Show
          Yonik Seeley added a comment - My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata. Map unknown fields to an ignored fieldtype. uprefix=ignored_
          Hide
          Eric Pugh added a comment -

          I am out of the office 6/29 - 6/30. For urgent issues, please contact
          Jason Hull at jhull@opensourceconnections.com or phone at (434)
          409-8451.

          Show
          Eric Pugh added a comment - I am out of the office 6/29 - 6/30. For urgent issues, please contact Jason Hull at jhull@opensourceconnections.com or phone at (434) 409-8451.
          Hide
          Yonik Seeley added a comment -

          It is burdensome to have to add the ignores for all the metadata.

          It would be easy to change the default from index to ignore:
          uprefix=ignored_ // ignored_ will be defined in the schema as indexed=false, stored=false
          uprefix=attr_

          Actually, that brings up another random question... when we get the metadata back from Tika, is it typed (can we tell that number of pages is an integer?)

          Show
          Yonik Seeley added a comment - It is burdensome to have to add the ignores for all the metadata. It would be easy to change the default from index to ignore: uprefix=ignored_ // ignored_ will be defined in the schema as indexed=false, stored=false uprefix=attr_ Actually, that brings up another random question... when we get the metadata back from Tika, is it typed (can we tell that number of pages is an integer?)
          Hide
          Yonik Seeley added a comment - - edited

          >> I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF.
          > This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML.

          That's confusing given that the examples on the wiki show PDFs being indexed with ext.idx.attr=true

          It also confused me since the docs say "Index the Tika XHTML attributes into separate fields, named after the attribute." and the docs also say "Tika does everything by producing an XHTML stream that it feeds to a SAX ContentHandler".
          That led me to believe that ext.idx.attr was for all tika generated metadata (or maybe it is, but tika doesn't generally use attributes?)

          It's also rather confusing just what rules can be applied to what. For example, does ext.metadata.prefix work on stuff produced by ext.idx.attr?
          edit: nope, I just tried, and that does not work.

          Show
          Yonik Seeley added a comment - - edited >> I just tried setting ext.idx.attr=false, and I didn't see any change after indexing a PDF. > This is often needed for HTML, where it is used to index the attributes of tags. Same would go for XML. That's confusing given that the examples on the wiki show PDFs being indexed with ext.idx.attr=true It also confused me since the docs say "Index the Tika XHTML attributes into separate fields, named after the attribute." and the docs also say "Tika does everything by producing an XHTML stream that it feeds to a SAX ContentHandler". That led me to believe that ext.idx.attr was for all tika generated metadata (or maybe it is, but tika doesn't generally use attributes?) It's also rather confusing just what rules can be applied to what. For example, does ext.metadata.prefix work on stuff produced by ext.idx.attr? edit: nope, I just tried, and that does not work.
          Hide
          Chris Harris added a comment -

          My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata.

          Map unknown fields to an ignored fieldtype.
          uprefix=ignored_

          That seems fine.

          Tangentially, I wonder how fast Tika's metadata extraction is, compared to its main body text extraction. If the latter doesn't dwarf the former, there might be value in adding a "Solr, don't even ask Tika to calculate for metadata at all; just have it extract the body text" flag; this could potentially speed things up for people that don't need the metadata. Maybe it would make sense to benchmark things before adding such a flag, though. I also don't have a good sense of how many people will want to use the metadata feature vs how many don't.

          Show
          Chris Harris added a comment - My only request is that, if you're changing how field mapping works and maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't care about any of your parsed metadata. Map unknown fields to an ignored fieldtype. uprefix=ignored_ That seems fine. Tangentially, I wonder how fast Tika's metadata extraction is, compared to its main body text extraction. If the latter doesn't dwarf the former, there might be value in adding a "Solr, don't even ask Tika to calculate for metadata at all; just have it extract the body text" flag; this could potentially speed things up for people that don't need the metadata. Maybe it would make sense to benchmark things before adding such a flag, though. I also don't have a good sense of how many people will want to use the metadata feature vs how many don't.
          Hide
          Yonik Seeley added a comment -

          The current ext.metadata.prefix parameter adds the prefix to all attributes, even those that have already been mapped (so last_modified appears instead as attr_last_modified). Seems like one really wants a prefix appended only for those params that are not explicitly mapped (or don't appear in the schema)... this is what the proposed "uprefix" (unknown field prefix) would do.

          Show
          Yonik Seeley added a comment - The current ext.metadata.prefix parameter adds the prefix to all attributes, even those that have already been mapped (so last_modified appears instead as attr_last_modified). Seems like one really wants a prefix appended only for those params that are not explicitly mapped (or don't appear in the schema)... this is what the proposed "uprefix" (unknown field prefix) would do.
          Hide
          Yonik Seeley added a comment -

          The date.format thing is interesting.... but shouldn't that really be part of a Date fieldType that can accept all those formats?
          Transforming in the update handler only means that you could add a literal.mydate=date1 via the update handler, and then fail to query it (because the date parsing was specific to the update handler.)

          Perhaps we could add this to the new trie field for dates?

          Show
          Yonik Seeley added a comment - The date.format thing is interesting.... but shouldn't that really be part of a Date fieldType that can accept all those formats? Transforming in the update handler only means that you could add a literal.mydate=date1 via the update handler, and then fail to query it (because the date parsing was specific to the update handler.) Perhaps we could add this to the new trie field for dates?
          Hide
          Grant Ingersoll added a comment -

          The date.format thing is interesting.... but shouldn't that really be part of a Date fieldType that can accept all those formats?

          Agreed, I was just wanting more Date Field Type capabilities the other day. It would be nice to be able to specify two things on the Date fieldType:
          1. Input formats accepted like what the ExtractingRequestHandler offers
          2. Output granularity. That is, may not want to store seconds, etc. so Solr should drop the precision. Note, this is different from Trie in that it is only indexing one token.

          Probably should handle on a separate issue.

          Show
          Grant Ingersoll added a comment - The date.format thing is interesting.... but shouldn't that really be part of a Date fieldType that can accept all those formats? Agreed, I was just wanting more Date Field Type capabilities the other day. It would be nice to be able to specify two things on the Date fieldType: 1. Input formats accepted like what the ExtractingRequestHandler offers 2. Output granularity. That is, may not want to store seconds, etc. so Solr should drop the precision. Note, this is different from Trie in that it is only indexing one token. Probably should handle on a separate issue.
          Hide
          Yonik Seeley added a comment -

          OK, here's my first crack at cleaning things up a little before release. Changes:

          • there were no tests for XML attribute indexing.
          • capture had no unit tests
          • boost has no unit tests
          • ignoring unknown fields had no unit test
          • metadata prefix had no unit test
          • logging ignored fields at the INFO level for each document loaded is too verbose
          • removed handling of undeclared fields and let downstream components
            handle this.
          • avoid the String catenation code for single valued fields when Tika only
            produces a single value (for performance)
          • remove multiple literal detection handling for single valued fields - let a downstream component handle it
          • map literal values just as one would with generated metadata, since the user may be just supplying the extra metadata. also apply transforms (date formatting currently)
          • fixed a bug where null field values were being added (and later dropped by Solr... hence it was never caught).
          • avoid catching previously thrown SolrExceptions... let them fly through
          • removed some unused code (id generation, etc)
          • added lowernames option to map field names to lowercase/underscores
          • switched builderStack from synchronized Stack to LinkedList
          • fixed a bug that caused content to be appended with no whitespace in between
          • made extracting request handler lazy loading in example config
          • added ignored_ and attr_ dynamic fields in example schema

          Interface:

          The default field is always "content" - use map to change it to something else
          lowernames=true/false  // if true, map names like Content-Type to content_type
          map.<fname>=<target_field>
          boost.<fname>=<boost>
          literal.<fname>=<literal_value>
          xpath=<xpath_expr>  - only generate content for the matching xpath expr
          extractOnly=true/false - if true, just return the extracted content
          capture=<xml_element_name>  // separate out these elements 
          captureAttr=<xml_element_name>   // separate out the attributes for these elements
          uprefix=<prefix>  // unknown field prefix - any unknown fields will be prepended with this value
          stream.type
          resource.name
          

          To try and make things more uniform, all fields, whether "content" or metadata or attributes or literals, all go through the same process.
          1) map to lowercase if lowernames=true
          2) apply map.field rules
          3) if the resulting field is unknown, prefix it with uprefix

          Hopefully people will agree that this is an improvement in general. I think in the future we'll need more advanced options, esp around dealing with links in HTML and more powerful xpath constructs, but that's for after 1.4 IMO.

          Show
          Yonik Seeley added a comment - OK, here's my first crack at cleaning things up a little before release. Changes: there were no tests for XML attribute indexing. capture had no unit tests boost has no unit tests ignoring unknown fields had no unit test metadata prefix had no unit test logging ignored fields at the INFO level for each document loaded is too verbose removed handling of undeclared fields and let downstream components handle this. avoid the String catenation code for single valued fields when Tika only produces a single value (for performance) remove multiple literal detection handling for single valued fields - let a downstream component handle it map literal values just as one would with generated metadata, since the user may be just supplying the extra metadata. also apply transforms (date formatting currently) fixed a bug where null field values were being added (and later dropped by Solr... hence it was never caught). avoid catching previously thrown SolrExceptions... let them fly through removed some unused code (id generation, etc) added lowernames option to map field names to lowercase/underscores switched builderStack from synchronized Stack to LinkedList fixed a bug that caused content to be appended with no whitespace in between made extracting request handler lazy loading in example config added ignored_ and attr_ dynamic fields in example schema Interface: The default field is always "content" - use map to change it to something else lowernames= true / false // if true , map names like Content-Type to content_type map.<fname>=<target_field> boost.<fname>=<boost> literal.<fname>=<literal_value> xpath=<xpath_expr> - only generate content for the matching xpath expr extractOnly= true / false - if true , just return the extracted content capture=<xml_element_name> // separate out these elements captureAttr=<xml_element_name> // separate out the attributes for these elements uprefix=<prefix> // unknown field prefix - any unknown fields will be prepended with this value stream.type resource.name To try and make things more uniform, all fields, whether "content" or metadata or attributes or literals, all go through the same process. 1) map to lowercase if lowernames=true 2) apply map.field rules 3) if the resulting field is unknown, prefix it with uprefix Hopefully people will agree that this is an improvement in general. I think in the future we'll need more advanced options, esp around dealing with links in HTML and more powerful xpath constructs, but that's for after 1.4 IMO.
          Hide
          Yonik Seeley added a comment -

          OK, I've committed the above. I'll work on updating the wiki, including clarifying things that didn't make sense the first time I looked at this.

          Show
          Yonik Seeley added a comment - OK, I've committed the above. I'll work on updating the wiki, including clarifying things that didn't make sense the first time I looked at this.
          Hide
          Yonik Seeley added a comment -

          Attaching a schema update to define some common useful metadata fields to improve the OOTB experience.
          Any concerns or suggestions for improvements? I'd like to commit shortly to get it into 1.4

          Show
          Yonik Seeley added a comment - Attaching a schema update to define some common useful metadata fields to improve the OOTB experience. Any concerns or suggestions for improvements? I'd like to commit shortly to get it into 1.4
          Hide
          Grant Ingersoll added a comment - - edited

          I don't think we should drop ext.def.fl (the name can change, but the functionality is useful) and am going to reopen this. Namely, it is often the case where one wants all values that aren't explicitly mapped to go into a default field and I don't think that is possible using uprefix. Since all metadata fields aren't knowable up front, there is currently no way to express this in the ExtractingRequestHandler.

          Show
          Grant Ingersoll added a comment - - edited I don't think we should drop ext.def.fl (the name can change, but the functionality is useful) and am going to reopen this. Namely, it is often the case where one wants all values that aren't explicitly mapped to go into a default field and I don't think that is possible using uprefix. Since all metadata fields aren't knowable up front, there is currently no way to express this in the ExtractingRequestHandler.
          Hide
          Grant Ingersoll added a comment -

          Adds in defaultField parameter and tests.

          Show
          Grant Ingersoll added a comment - Adds in defaultField parameter and tests.
          Hide
          Yonik Seeley added a comment -

          it is often the case where one wants all values that aren't explicitly mapped to go into a default field

          What's the real use-case, to be able to search all metadata? One could use a dynamic copyField into a single indexed field. That also helps if one sttill wants to keep all of the stored values for the metadata in separate fields.

          Show
          Yonik Seeley added a comment - it is often the case where one wants all values that aren't explicitly mapped to go into a default field What's the real use-case, to be able to search all metadata? One could use a dynamic copyField into a single indexed field. That also helps if one sttill wants to keep all of the stored values for the metadata in separate fields.
          Hide
          Grant Ingersoll added a comment -

          What's the real use-case, to be able to search all metadata? One could use a dynamic copyField into a single indexed field. That also helps if one sttill wants to keep all of the stored values for the metadata in separate fields.

          Yeah, that works too, but it is convoluted and I may not care about storing the attributes nor want to deal with copyFields and the extra performance costs. It just seems easier to have a default field capability. Then one can just have everything go to it.

          Show
          Grant Ingersoll added a comment - What's the real use-case, to be able to search all metadata? One could use a dynamic copyField into a single indexed field. That also helps if one sttill wants to keep all of the stored values for the metadata in separate fields. Yeah, that works too, but it is convoluted and I may not care about storing the attributes nor want to deal with copyFields and the extra performance costs. It just seems easier to have a default field capability. Then one can just have everything go to it.
          Hide
          Grant Ingersoll added a comment -

          Yonik, any objections to me committing the current patch given your concerns?

          Show
          Grant Ingersoll added a comment - Yonik, any objections to me committing the current patch given your concerns?
          Hide
          David Smiley added a comment -

          Grant, your response confuses me. How does a copyField necessitate storing the fields? And how is the copyField slower than this feature mapping to a common attribute which ends up with an equivalent outcome?

          Show
          David Smiley added a comment - Grant, your response confuses me. How does a copyField necessitate storing the fields? And how is the copyField slower than this feature mapping to a common attribute which ends up with an equivalent outcome?
          Hide
          Grant Ingersoll added a comment -

          How does a copyField necessitate storing the fields?

          Yonik suggested his approach helps with stored values for the metadata.

          And how is the copyField slower than this feature mapping to a common attribute which ends up with an equivalent outcome?

          As I understand Yonik's response, he was suggesting that I use the uprefix combined with copy fields. That involves two field entries, when I only care about the one catch all. copyFields do have a cost, especially when you don't need them.

          At any rate, with the patch I put up, you have the option of doing it either way.

          Show
          Grant Ingersoll added a comment - How does a copyField necessitate storing the fields? Yonik suggested his approach helps with stored values for the metadata. And how is the copyField slower than this feature mapping to a common attribute which ends up with an equivalent outcome? As I understand Yonik's response, he was suggesting that I use the uprefix combined with copy fields. That involves two field entries, when I only care about the one catch all. copyFields do have a cost, especially when you don't need them. At any rate, with the patch I put up, you have the option of doing it either way.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 815293.

          Show
          Grant Ingersoll added a comment - Committed revision 815293.
          Hide
          Chris Harris added a comment -

          This caught me by surprise, so I'm noting it here in case it helps anyone else:

          In SVN r815830 (September 16, 2009), Grant renamed the field name mapping argument "map" to "fmap". The reason was to make naming more consistent with the CSV handler. For more info on this see the following thread:

          http://www.nabble.com/Fwd%3A-CSV-Update---Need-help-mapping-csv-field-to-schema%27s-ID-td25463942.html

          Show
          Chris Harris added a comment - This caught me by surprise, so I'm noting it here in case it helps anyone else: In SVN r815830 (September 16, 2009), Grant renamed the field name mapping argument "map" to "fmap". The reason was to make naming more consistent with the CSV handler. For more info on this see the following thread: http://www.nabble.com/Fwd%3A-CSV-Update---Need-help-mapping-csv-field-to-schema%27s-ID-td25463942.html
          Hide
          Chris Harris added a comment -

          Grant and company: I just noticed that the example solrconfig.xml at the head of SVN trunk still uses map, not fmap. (In particular, there's "map.content", "map.a", and "map.div".) I assume this should be fixed for the 1.4 release. Interestingly, this doesn't seem to make any unit tests fail.

          Show
          Chris Harris added a comment - Grant and company: I just noticed that the example solrconfig.xml at the head of SVN trunk still uses map, not fmap. (In particular, there's "map.content", "map.a", and "map.div".) I assume this should be fixed for the 1.4 release. Interestingly, this doesn't seem to make any unit tests fail.
          Hide
          Yonik Seeley added a comment -

          example solrconfig.xml at the head of SVN trunk still uses map, not fmap.

          Thanks, I just fixed this.

          Show
          Yonik Seeley added a comment - example solrconfig.xml at the head of SVN trunk still uses map, not fmap. Thanks, I just fixed this.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Eric Pugh
            • Votes:
              32 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development