Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: clients - java
    • Labels:
      None

      Description

      Often times it is cost prohibitive to send full, rich documents over the wire. The contrib/extraction library has server side integration with Tika, but it would be nice to have a client side implementation as well. It should support both metadata and content or just metadata.

      1. clientextraction.tar.gz
        15 kB
        Tomás Fernández Löbbe

        Issue Links

          Activity

          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326731 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.8 [ 12326254 ]
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          David Smiley made changes -
          Fix Version/s 4.8 [ 12326254 ]
          Fix Version/s 4.7 [ 12325573 ]
          Uwe Schindler made changes -
          Fix Version/s 4.7 [ 12325573 ]
          Fix Version/s 4.6 [ 12325000 ]
          Adrien Grand made changes -
          Fix Version/s 4.6 [ 12325000 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Fix Version/s 4.4 [ 12324324 ]
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324324 ]
          Fix Version/s 4.3 [ 12324128 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324128 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.2 [ 12323893 ]
          Mark Miller made changes -
          Fix Version/s 4.2 [ 12323893 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.1 [ 12321141 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321141 ]
          Fix Version/s 4.0 [ 12314992 ]
          Hoss Man made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Simon Willnauer made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Fix Version/s 3.5 [ 12317876 ]
          Jan Høydahl made changes -
          Link This issue requires SOLR-2842 [ SOLR-2842 ]
          Robert Muir made changes -
          Fix Version/s 3.5 [ 12317876 ]
          Fix Version/s 3.4 [ 12316683 ]
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Robert Muir made changes -
          Fix Version/s 3.4 [ 12316683 ]
          Fix Version/s 4.0 [ 12314992 ]
          Fix Version/s 3.3 [ 12316471 ]
          Robert Muir made changes -
          Fix Version/s 3.3 [ 12316471 ]
          Fix Version/s 3.2 [ 12316172 ]
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hoss Man made changes -
          Fix Version/s 3.2 [ 12316172 ]
          Fix Version/s Next [ 12315093 ]
          Tomás Fernández Löbbe made changes -
          Attachment clientextraction.tar.gz [ 12468441 ]
          Hide
          Tomás Fernández Löbbe added a comment -

          I'll upload the code I mentioned a couple of days ago in case somebody want it. I added as a new contrib, that's why I'm uploading the tar file instead of a patch, it contains the same libraries as the extracting contrib. It still doesn't work with dates and have lot's of pending things, but I think we should take a decision on how to implement this patch before I continue coding.
          You will see that the "lib" directory is empty, that's because:
          1°) I can't upload a file with all the jars because it size would be more than 10 MB, the maximum for upload in Jira
          2°) It uses the same jars as the "extraction" contrib, so, for use the clientexctraction, simply copy the jars from the extraction contrib.

          Show
          Tomás Fernández Löbbe added a comment - I'll upload the code I mentioned a couple of days ago in case somebody want it. I added as a new contrib, that's why I'm uploading the tar file instead of a patch, it contains the same libraries as the extracting contrib. It still doesn't work with dates and have lot's of pending things, but I think we should take a decision on how to implement this patch before I continue coding. You will see that the "lib" directory is empty, that's because: 1°) I can't upload a file with all the jars because it size would be more than 10 MB, the maximum for upload in Jira 2°) It uses the same jars as the "extraction" contrib, so, for use the clientexctraction, simply copy the jars from the extraction contrib.
          Hide
          Tomás Fernández Löbbe added a comment -

          Now I get what you say about the UpdateRequestProcessor (I thought you where talking about a different/new component). I like the idea of reuse the code, I don't like the idea of adding complexity to SolrJ. Is it worthy to port the UpadateRequestProcessorChain to SolrJ? I definitely wouldn't like to have a configuration file on the SolrJ API.

          Show
          Tomás Fernández Löbbe added a comment - Now I get what you say about the UpdateRequestProcessor (I thought you where talking about a different/new component). I like the idea of reuse the code, I don't like the idea of adding complexity to SolrJ. Is it worthy to port the UpadateRequestProcessorChain to SolrJ? I definitely wouldn't like to have a configuration file on the SolrJ API.
          Hide
          Jan Høydahl added a comment -

          Nope, I have not started on 1763 yet.

          Show
          Jan Høydahl added a comment - Nope, I have not started on 1763 yet.
          Hide
          Tomás Fernández Löbbe added a comment -

          I'm sorry, I saw some comments about the UpdateProcessors, but I couldn't fin enough documentation. Is this a new component? Is it documented somewhere?
          I saw you've been working with SOLR-1763, do you have something of that?

          Show
          Tomás Fernández Löbbe added a comment - I'm sorry, I saw some comments about the UpdateProcessors, but I couldn't fin enough documentation. Is this a new component? Is it documented somewhere? I saw you've been working with SOLR-1763 , do you have something of that?
          Hide
          Jan Høydahl added a comment -

          I linked this issue to SOLR-1763, as they attempt to solve the same thing, on client vs server side.

          Instead of creating two solutions, we should base these two on same code base and config, so that it is easy to switch between them. Perhaps someone starts with server-side extraction but then want to optimize performance by going client-side. The switch should be intuitive.

          Thus, should we consider porting the whole UpdateProcessorChain to SolrJ? How cool would it be to choose whether to execute an UP on client or server side simply by configuration change? I realize that some UP's may depend on SolrCore or have other difficult dependencies, but it should be possible to work around, not?

          Show
          Jan Høydahl added a comment - I linked this issue to SOLR-1763 , as they attempt to solve the same thing, on client vs server side. Instead of creating two solutions, we should base these two on same code base and config, so that it is easy to switch between them. Perhaps someone starts with server-side extraction but then want to optimize performance by going client-side. The switch should be intuitive. Thus, should we consider porting the whole UpdateProcessorChain to SolrJ? How cool would it be to choose whether to execute an UP on client or server side simply by configuration change? I realize that some UP's may depend on SolrCore or have other difficult dependencies, but it should be possible to work around, not?
          Jan Høydahl made changes -
          Link This issue relates to SOLR-1763 [ SOLR-1763 ]
          Hide
          Tomás Fernández Löbbe added a comment -

          I have a possible implementation for this jira. I created a class SolrFileInputDocument that extends SolrInputDocument, the main difference is that it contains the methods:

          public void addFile(InputStream file)

          and

          public void addFile(InputStream file , Metadata metadata)

          This two methods will use Tika to extract the content and will end up creating fields (this.addField(...)) of the parent class SolrInputDocument. The SolrFileInputDocument accepts a Map instance to map the extracted metadata to a Solr field, something like this:

          Map<String, String> map = new HashMap<String, String>();
          map.put("content", "text");
          map.put("keywords", "cat");
          map.put("creator", "manu");
          SolrFileInputDocument document = new SolrFileInputDocument(map);

          I added the classes to another "contrib" directory, I don't know if this should be done this way, I just didn't want to add a dependency with Tika that might be not always needed. Adding this code to a client application would require to add the SolrJ jar plus the "clientextraction" jar

          I still haven't done anything to keep the "prefix" feature of the ExtractingRequestHandler (which I don't think is going to be difficult) and I'm still don't manage non text fields like dates, but I could do it if you think this is a good approach.

          Do you think this could work? I can upload the code tomorrow.

          Show
          Tomás Fernández Löbbe added a comment - I have a possible implementation for this jira. I created a class SolrFileInputDocument that extends SolrInputDocument, the main difference is that it contains the methods: public void addFile(InputStream file) and public void addFile(InputStream file , Metadata metadata) This two methods will use Tika to extract the content and will end up creating fields (this.addField(...)) of the parent class SolrInputDocument. The SolrFileInputDocument accepts a Map instance to map the extracted metadata to a Solr field, something like this: Map<String, String> map = new HashMap<String, String>(); map.put("content", "text"); map.put("keywords", "cat"); map.put("creator", "manu"); SolrFileInputDocument document = new SolrFileInputDocument(map); I added the classes to another "contrib" directory, I don't know if this should be done this way, I just didn't want to add a dependency with Tika that might be not always needed. Adding this code to a client application would require to add the SolrJ jar plus the "clientextraction" jar I still haven't done anything to keep the "prefix" feature of the ExtractingRequestHandler (which I don't think is going to be difficult) and I'm still don't manage non text fields like dates, but I could do it if you think this is a good approach. Do you think this could work? I can upload the code tomorrow.
          Hoss Man made changes -
          Fix Version/s Next [ 12315093 ]
          Fix Version/s 1.5 [ 12313566 ]
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Grant Ingersoll made changes -
          Field Original Value New Value
          Fix Version/s 1.5 [ 12313566 ]
          Grant Ingersoll created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Grant Ingersoll
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development