Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10
    • Fix Version/s: 1.2
    • Component/s: general
    • Labels:
      None

      Description

      It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container.

      I'd like to be able to set up and run such a server like this:

      $ java -jar tika-app.jar --port 1234

      We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath.

      1. TIKA-593.Mattmann.032712.patch.txt
        38 kB
        Chris A. Mattmann
      2. TIKA-593.Mattmann.032712.patch.2.txt
        38 kB
        Chris A. Mattmann
      3. TIKA-593.Mattmann.032612.patch.txt
        13 kB
        Chris A. Mattmann
      4. TIKA-593.Mattmann.032612.patch.2.txt
        19 kB
        Chris A. Mattmann
      5. TIKA-593_pom.diff
        1 kB
        Ingo Renner

        Activity

        Hide
        Maxim Valyanskiy added a comment -

        I made HTTP-server with Jersey (JAX-RS) and embedded Glassfish (or Grizzly?) for text, metadata and binary attachment extraction. I has very simple REST-style interface

        I think we can contribute it to Tika project. Also I can try to replace Glassfish and Jersey with Tomcat and Apache Wink if it is required.

        What do you think?

        Show
        Maxim Valyanskiy added a comment - I made HTTP-server with Jersey (JAX-RS) and embedded Glassfish (or Grizzly?) for text, metadata and binary attachment extraction. I has very simple REST-style interface I think we can contribute it to Tika project. Also I can try to replace Glassfish and Jersey with Tomcat and Apache Wink if it is required. What do you think?
        Hide
        Chris A. Mattmann added a comment -

        My 2c on this:

        +1 to using JAX RS

        RE: the actual implementation, I used Apache CXF for OODT and it's basically a jar drop in (or MVN pom.xml update) single dependency. Wink is still Incubating right?

        Show
        Chris A. Mattmann added a comment - My 2c on this: +1 to using JAX RS RE: the actual implementation, I used Apache CXF for OODT and it's basically a jar drop in (or MVN pom.xml update) single dependency. Wink is still Incubating right?
        Hide
        Maxim Valyanskiy added a comment -

        I uploaded my implementation on GitHub for preview: https://github.com/maxcom/tikaserver-ex (please look at README file for build instructions and usage examples). Please comment it.

        Show
        Maxim Valyanskiy added a comment - I uploaded my implementation on GitHub for preview: https://github.com/maxcom/tikaserver-ex (please look at README file for build instructions and usage examples). Please comment it.
        Hide
        Otis Gospodnetic added a comment -

        Out of curiosity, why not just have a simple webapp (war) that uses Tika that reads the InputStream and spits back the data in whatever format is needed/specified? Sure, it requires a servlet container, but is that really a big deal? Just asking because it seems a tiny bit simpler than using Netty or Mina or HttpComponents or embedded Jetty or Grizzly.

        Show
        Otis Gospodnetic added a comment - Out of curiosity, why not just have a simple webapp (war) that uses Tika that reads the InputStream and spits back the data in whatever format is needed/specified? Sure, it requires a servlet container, but is that really a big deal? Just asking because it seems a tiny bit simpler than using Netty or Mina or HttpComponents or embedded Jetty or Grizzly.
        Hide
        Maxim Valyanskiy added a comment -

        Added my implementation in revision 1074088

        Show
        Maxim Valyanskiy added a comment - Added my implementation in revision 1074088
        Hide
        Jukka Zitting added a comment -

        Reopening until we figure out what to do with the references to the dev.java.net repositories.
        Earlier we had problems with such references to non-standard Maven repositories and I
        wouldn't like to have this issue block another release.

        In revision 1079922 I removed the tika-server component from the default build, which
        should allow us to release Tika even with the dev.java.net dependencies in place (we just
        can't deploy tika-server to Maven central then).

        There were also some test failures due apparently to some dependency version mismatch.
        See https://builds.apache.org/hudson/job/Tika-trunk/483/org.apache.tika$tika-server/ for details.

        Show
        Jukka Zitting added a comment - Reopening until we figure out what to do with the references to the dev.java.net repositories. Earlier we had problems with such references to non-standard Maven repositories and I wouldn't like to have this issue block another release. In revision 1079922 I removed the tika-server component from the default build, which should allow us to release Tika even with the dev.java.net dependencies in place (we just can't deploy tika-server to Maven central then). There were also some test failures due apparently to some dependency version mismatch. See https://builds.apache.org/hudson/job/Tika-trunk/483/org.apache.tika$tika-server/ for details.
        Hide
        Maxim Valyanskiy added a comment -

        I updated tika-server component. I replaced Grizzly web server with Jetty (that is available from maven central repository). I'm going to try to replace Jersey with Wink later

        Show
        Maxim Valyanskiy added a comment - I updated tika-server component. I replaced Grizzly web server with Jetty (that is available from maven central repository). I'm going to try to replace Jersey with Wink later
        Hide
        Chris A. Mattmann added a comment -
        • pushing to 1.0. assign to me. I'd like to shepherd this through with CXF. I'll make time in the next week.
        Show
        Chris A. Mattmann added a comment - pushing to 1.0. assign to me. I'd like to shepherd this through with CXF. I'll make time in the next week.
        Hide
        Chris A. Mattmann added a comment -
        • push out to 1.1: prep for 1.0.
        Show
        Chris A. Mattmann added a comment - push out to 1.1: prep for 1.0.
        Hide
        Chris A. Mattmann added a comment -

        Here I go, I am going to try and integrate Apache CXF and swap out Jersey here. Wish me luck!

        Show
        Chris A. Mattmann added a comment - Here I go, I am going to try and integrate Apache CXF and swap out Jersey here. Wish me luck!
        Hide
        Chris A. Mattmann added a comment -

        With the latest version (1.1-trunk) of Tika, I tried building tika-server and got this unit test failure:

        -------------------------------------------------------
         T E S T S
        -------------------------------------------------------
        Running org.apache.tika.server.TikaResourceTest
        Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.374 sec
        Running org.apache.tika.server.UnpackerResourceTest
        Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.617 sec <<< FAILURE!
        Running org.apache.tika.server.MetadataResourceTest
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.346 sec
        
        Results :
        
        Tests in error: 
          testExeDOCX(org.apache.tika.server.UnpackerResourceTest)
        
        Tests run: 11, Failures: 0, Errors: 1, Skipped: 0
        

        Here are the surefire-reports:

        [chipotle:~/src/tika/trunk] mattmann% more tika-server/target/surefire-reports/org.apache.tika.server.UnpackerResourceTest.txt 
        -------------------------------------------------------------------------------
        Test set: org.apache.tika.server.UnpackerResourceTest
        -------------------------------------------------------------------------------
        Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.613 sec <<< FAILURE!
        testExeDOCX(org.apache.tika.server.UnpackerResourceTest)  Time elapsed: 1.418 sec  <<< ERROR!
        com.sun.jersey.api.client.UniformInterfaceException: PUT http://localhost:9998/unpacker returned a response status of 204 No Content
                at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:528)
                at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:506)
                at com.sun.jersey.api.client.WebResource.handle(WebResource.java:674)
                at com.sun.jersey.api.client.WebResource.put(WebResource.java:221)
        

        I'm looking into it.

        Show
        Chris A. Mattmann added a comment - With the latest version (1.1-trunk) of Tika, I tried building tika-server and got this unit test failure: ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.tika.server.TikaResourceTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.374 sec Running org.apache.tika.server.UnpackerResourceTest Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.617 sec <<< FAILURE! Running org.apache.tika.server.MetadataResourceTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.346 sec Results : Tests in error: testExeDOCX(org.apache.tika.server.UnpackerResourceTest) Tests run: 11, Failures: 0, Errors: 1, Skipped: 0 Here are the surefire-reports: [chipotle:~/src/tika/trunk] mattmann% more tika-server/target/surefire-reports/org.apache.tika.server.UnpackerResourceTest.txt ------------------------------------------------------------------------------- Test set: org.apache.tika.server.UnpackerResourceTest ------------------------------------------------------------------------------- Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 10.613 sec <<< FAILURE! testExeDOCX(org.apache.tika.server.UnpackerResourceTest) Time elapsed: 1.418 sec <<< ERROR! com.sun.jersey.api.client.UniformInterfaceException: PUT http://localhost:9998/unpacker returned a response status of 204 No Content at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:528) at com.sun.jersey.api.client.ClientResponse.getEntity(ClientResponse.java:506) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:674) at com.sun.jersey.api.client.WebResource.put(WebResource.java:221) I'm looking into it.
        Hide
        Ingo Renner added a comment -

        Got it running after fixing Eclipse's complaints in pom.xml and updating the mvn dependencies

        Show
        Ingo Renner added a comment - Got it running after fixing Eclipse's complaints in pom.xml and updating the mvn dependencies
        Hide
        Chris A. Mattmann added a comment -

        Ingo, I applied your patch to my local copy, but am still seeing test failures:

        INFO: Stopping the Grizzly Web Container...
        Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 9.341 sec <<< FAILURE!
        
        Results :
        
        Tests in error: 
          testExeDOCX(org.apache.tika.server.UnpackerResourceTest): PUT http://localhost:9998/unpacker returned a response status of 204 No Content
        
        Tests run: 11, Failures: 0, Errors: 1, Skipped: 0
        
        
        Show
        Chris A. Mattmann added a comment - Ingo, I applied your patch to my local copy, but am still seeing test failures: INFO: Stopping the Grizzly Web Container... Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 9.341 sec <<< FAILURE! Results : Tests in error: testExeDOCX(org.apache.tika.server.UnpackerResourceTest): PUT http://localhost:9998/unpacker returned a response status of 204 No Content Tests run: 11, Failures: 0, Errors: 1, Skipped: 0
        Hide
        Chris A. Mattmann added a comment -
        • push out to 1.2
        Show
        Chris A. Mattmann added a comment - push out to 1.2
        Hide
        Maxim Valyanskiy added a comment -

        I found that Jersey dependencies are on Maven Central now (https://blogs.oracle.com/theaquarium/entry/jersey_moving_forward_contributions_maven).

        I'm going to synchronize tika-server with our production code and enable it in default build after tika 1.1 release if there is no objections

        Show
        Maxim Valyanskiy added a comment - I found that Jersey dependencies are on Maven Central now ( https://blogs.oracle.com/theaquarium/entry/jersey_moving_forward_contributions_maven ). I'm going to synchronize tika-server with our production code and enable it in default build after tika 1.1 release if there is no objections
        Hide
        Maxim Valyanskiy added a comment - - edited
          testExeDOCX(org.apache.tika.server.UnpackerResourceTest): PUT http://localhost:9998/unpacker returned a response status of 204 No Content
        

        I could not reproduce this problem on current trunk version

        Show
        Maxim Valyanskiy added a comment - - edited testExeDOCX(org.apache.tika.server.UnpackerResourceTest): PUT http://localhost:9998/unpacker returned a response status of 204 No Content I could not reproduce this problem on current trunk version
        Hide
        Chris A. Mattmann added a comment -

        Hey Max, I don't have objections to moving forward re-enabling the module. How about we use CXF like I suggested though? I will try a commit to the POM shortly that will add in the CXF JaxRS dependencies. Let's try that.

        Show
        Chris A. Mattmann added a comment - Hey Max, I don't have objections to moving forward re-enabling the module. How about we use CXF like I suggested though? I will try a commit to the POM shortly that will add in the CXF JaxRS dependencies. Let's try that.
        Hide
        Chris A. Mattmann added a comment -
           <dependency>
              <groupId>org.apache.cxf</groupId>
              <artifactId>cxf-rt-frontend-jaxrs</artifactId>
              <version>2.3.1</version>
           </dependency>
        

        Max, see above. That will take care of the transitive dependencies for JAX-RS, including the API, etc.
        I'm not sure of a replacement for the test portions of the Jersey code. If you are +1 with the above, I'd like
        to commit it to the tika-server/pom.xml file.

        Show
        Chris A. Mattmann added a comment - <dependency> <groupId> org.apache.cxf </groupId> <artifactId> cxf-rt-frontend-jaxrs </artifactId> <version> 2.3.1 </version> </dependency> Max, see above. That will take care of the transitive dependencies for JAX-RS, including the API, etc. I'm not sure of a replacement for the test portions of the Jersey code. If you are +1 with the above, I'd like to commit it to the tika-server/pom.xml file.
        Hide
        Maxim Valyanskiy added a comment -

        Chris, that is not enough just to change dependency. There is TikaServerCli class that configures Jetty + Jersey combo, it depends jersey classes

        Show
        Maxim Valyanskiy added a comment - Chris, that is not enough just to change dependency. There is TikaServerCli class that configures Jetty + Jersey combo, it depends jersey classes
        Hide
        Chris A. Mattmann added a comment -
        • Max FYI my current progress. I'm trying to get the unit tests rewritten but they are failing right now. Check out MetadataResource to see. The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. I will go to the CXF lists tomorrow with my question about the failing unit tests.
        Show
        Chris A. Mattmann added a comment - Max FYI my current progress. I'm trying to get the unit tests rewritten but they are failing right now. Check out MetadataResource to see. The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. I will go to the CXF lists tomorrow with my question about the failing unit tests.
        Hide
        Chris A. Mattmann added a comment -
        • ok tests passing, mostly. Will finish tomorrow morning!
        Show
        Chris A. Mattmann added a comment - ok tests passing, mostly. Will finish tomorrow morning!
        Hide
        Chris A. Mattmann added a comment -
        • a lot closer. Unpacker tests are failing. Max, how did Jersey deal with the Map<String,byte[]> that you are returning in UnpackerResource? I don't see any @Providers in Jersey that natively know how to deal with this data structure, nor do I see any @Provider classes that you have written to take care of it. How was Jersey dealing with this?
        Show
        Chris A. Mattmann added a comment - a lot closer. Unpacker tests are failing. Max, how did Jersey deal with the Map<String,byte[]> that you are returning in UnpackerResource? I don't see any @Providers in Jersey that natively know how to deal with this data structure, nor do I see any @Provider classes that you have written to take care of it. How was Jersey dealing with this?
        Hide
        Maxim Valyanskiy added a comment -

        Chris, there is two providers in my code that process this Map's. It is ZipWriter and TarWriter:

        https://github.com/apache/tika/blob/trunk/tika-server/src/main/java/org/apache/tika/server/ZipWriter.java
        https://github.com/apache/tika/blob/trunk/tika-server/src/main/java/org/apache/tika/server/TarWriter.java

        I think now that it was not good idea to use Map class directly, it is better to introduce more specific interface

        Show
        Maxim Valyanskiy added a comment - Chris, there is two providers in my code that process this Map's. It is ZipWriter and TarWriter: https://github.com/apache/tika/blob/trunk/tika-server/src/main/java/org/apache/tika/server/ZipWriter.java https://github.com/apache/tika/blob/trunk/tika-server/src/main/java/org/apache/tika/server/TarWriter.java I think now that it was not good idea to use Map class directly, it is better to introduce more specific interface
        Hide
        Chris A. Mattmann added a comment -

        Thanks, Max, see latest patch. I'm close now:

        -------------------------------------------------------
         T E S T S
        -------------------------------------------------------
        Running org.apache.tika.server.MetadataResourceTest
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.281 sec
        Running org.apache.tika.server.TikaResourceTest
        Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.167 sec
        Running org.apache.tika.server.UnpackerResourceTest
        Tests run: 10, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 2.012 sec <<< FAILURE!
        
        Results :
        
        Failed tests:   test415(org.apache.tika.server.UnpackerResourceTest): expected:<415> but was:<406>
        
        Tests in error: 
          testTarDocPicture(org.apache.tika.server.UnpackerResourceTest): Invalid byte 0 at offset 0 in '{NUL}{NUL}{NUL}{NUL}{NUL}{NUL}{NUL}' len=8
          testText(org.apache.tika.server.UnpackerResourceTest): Stream closed
        
        

        2 failures, 1 error, and the rest pass. Any ideas?

        Show
        Chris A. Mattmann added a comment - Thanks, Max, see latest patch. I'm close now: ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.tika.server.MetadataResourceTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.281 sec Running org.apache.tika.server.TikaResourceTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.167 sec Running org.apache.tika.server.UnpackerResourceTest Tests run: 10, Failures: 1, Errors: 2, Skipped: 0, Time elapsed: 2.012 sec <<< FAILURE! Results : Failed tests: test415(org.apache.tika.server.UnpackerResourceTest): expected:<415> but was:<406> Tests in error: testTarDocPicture(org.apache.tika.server.UnpackerResourceTest): Invalid byte 0 at offset 0 in '{NUL}{NUL}{NUL}{NUL}{NUL}{NUL}{NUL}' len=8 testText(org.apache.tika.server.UnpackerResourceTest): Stream closed 2 failures, 1 error, and the rest pass. Any ideas?
        Hide
        Chris A. Mattmann added a comment -

        Hey Max, in r1305940, I committed the latest patch with those 3 tests disabled in UnpackagerResource for now. We can fix them and wrap this up and until I do so, I'll leave the issue open. Help is welcomed!

        Show
        Chris A. Mattmann added a comment - Hey Max, in r1305940, I committed the latest patch with those 3 tests disabled in UnpackagerResource for now. We can fix them and wrap this up and until I do so, I'll leave the issue open. Help is welcomed!
        Hide
        Chris A. Mattmann added a comment -

        OK, I give up for now. I disabled the 415 test that isn't passing. After researching this for hours, and working with Paul Ramirez (thanks for the help Paul), we basically found the following things to be true:

        • Jersey automatically sets Accept to something like / which IMHO is more sensible than CXF which sets it to an XML accept type (which causes the resource to not even find the path in test415)
        • For whatever reason, if you set accept to "xxx/xxx" instead of checks up front like it seems Jersey did, CXF will let the call get all the way to the UnpackerResource#unpack method and then cause the Tika AutoDetectParser to fail. Jersey seemed to have caught this. I have no clue why. We mucked around with different accept and type calls and got it to send 200 OK back and parse fine (e.g., if you set the accept to / and type to APPLICATION_MSWORD – but that defeats the purpose of the test. If you send in xxx/xxx, it seems like the JAX RS service should send back a 415.

        I need some massive help from anyone that knows CXF to figure this out. I have to step away from this for now. For now all tests pass, they are cleaned up using CXF client (with HttpClient removed), and I disabled test415. Any help to get 415 working with CXF is welcomed. Even if we have to modify UnpackerResource to do the check. I know that Sergey is watching this one (from CXF ville so would love some help here!)

        Show
        Chris A. Mattmann added a comment - OK, I give up for now. I disabled the 415 test that isn't passing. After researching this for hours, and working with Paul Ramirez (thanks for the help Paul), we basically found the following things to be true: Jersey automatically sets Accept to something like / which IMHO is more sensible than CXF which sets it to an XML accept type (which causes the resource to not even find the path in test415) For whatever reason, if you set accept to "xxx/xxx" instead of checks up front like it seems Jersey did, CXF will let the call get all the way to the UnpackerResource#unpack method and then cause the Tika AutoDetectParser to fail. Jersey seemed to have caught this. I have no clue why. We mucked around with different accept and type calls and got it to send 200 OK back and parse fine (e.g., if you set the accept to / and type to APPLICATION_MSWORD – but that defeats the purpose of the test. If you send in xxx/xxx, it seems like the JAX RS service should send back a 415. I need some massive help from anyone that knows CXF to figure this out. I have to step away from this for now. For now all tests pass, they are cleaned up using CXF client (with HttpClient removed), and I disabled test415. Any help to get 415 working with CXF is welcomed. Even if we have to modify UnpackerResource to do the check. I know that Sergey is watching this one (from CXF ville so would love some help here!)
        Hide
        Sergey Beryozkin added a comment - - edited

        Hi, here is the thread on the CXF list to do with handling 415:

        http://cxf.547215.n5.nabble.com/TIKA-593-odd-behavior-related-to-CXF-JAX-RS-services-and-415-Http-response-codes-tt5600131.html.

        Re Accept: I think that the client code needs to have an idea about the format of the data it expects back thus CXF WebClient will try to set some specific default. FYI, proxy-based clients will analyze @Produces/@Consumes. Also the idea of the client having to know what it expects back is finding its way into JAX-RS 2.0 client api too.

        Update: WebClient (trunk/2.5.3-SNAPSHOT) will only default Accept to application/xml if a specific custom class is expected to be populated on return, if Response is expected back then no action is taken

        Show
        Sergey Beryozkin added a comment - - edited Hi, here is the thread on the CXF list to do with handling 415: http://cxf.547215.n5.nabble.com/TIKA-593-odd-behavior-related-to-CXF-JAX-RS-services-and-415-Http-response-codes-tt5600131.html . Re Accept: I think that the client code needs to have an idea about the format of the data it expects back thus CXF WebClient will try to set some specific default. FYI, proxy-based clients will analyze @Produces/@Consumes. Also the idea of the client having to know what it expects back is finding its way into JAX-RS 2.0 client api too. Update: WebClient (trunk/2.5.3-SNAPSHOT) will only default Accept to application/xml if a specific custom class is expected to be populated on return, if Response is expected back then no action is taken
        Hide
        Maxim Valyanskiy added a comment -

        I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list (by the way, maybe CXF supports classpath scanning like Jersey?).

        Show
        Maxim Valyanskiy added a comment - I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list (by the way, maybe CXF supports classpath scanning like Jersey?).
        Hide
        Maxim Valyanskiy added a comment -

        The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food.

        CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree"

        Show
        Maxim Valyanskiy added a comment - The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree"
        Hide
        Sergey Beryozkin added a comment -

        > CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree"

        I guess I'd be happy to discuss this issue elsewhere, but since you brought it up, here is the 2.6.0-SNAPSHOT one:

        [INFO] org.apache.cxf:cxf-rt-frontend-jaxrs:jar:2.6.0-SNAPSHOT
        [INFO] +- org.apache.aries.blueprint:org.apache.aries.blueprint.core:jar:0.3.1:provided
        [INFO] | +- org.apache.aries.blueprint:org.apache.aries.blueprint.api:jar:0.3.1:provided
        [INFO] | +- org.apache.aries:org.apache.aries.util:jar:0.3:provided
        [INFO] | +- org.slf4j:slf4j-api:jar:1.6.2:provided (version managed from 1.5.11)
        [INFO] | +- org.apache.aries.testsupport:org.apache.aries.testsupport.unit:jar:0.3:provided
        [INFO] | - org.apache.aries.proxy:org.apache.aries.proxy.api:jar:0.3:provided
        [INFO] +- org.osgi:org.osgi.core:jar:4.2.0:provided
        [INFO] +- junit:junit:jar:4.8.2:test
        *************
        [INFO] +- org.apache.cxf:cxf-api:jar:2.6.0-SNAPSHOT:compile
        [INFO] | +- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.2:runtime
        [INFO] | | - org.codehaus.woodstox:stax2-api:jar:3.1.1:runtime
        [INFO] | +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.1:compile
        [INFO] | +- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile
        [INFO] | - wsdl4j:wsdl4j:jar:1.6.2:compile
        [INFO] +- org.apache.cxf:cxf-rt-core:jar:2.6.0-SNAPSHOT:compile
        [INFO] | - com.sun.xml.bind:jaxb-impl:jar:2.1.13:compile
        **************
        [INFO] +- org.springframework:spring-core:jar:3.0.7.RELEASE:provided
        [INFO] | +- org.springframework:spring-asm:jar:3.0.7.RELEASE:provided
        [INFO] | - commons-logging:commons-logging:jar:1.1.1:provided
        [INFO] +- org.springframework:spring-context:jar:3.0.7.RELEASE:provided
        [INFO] | +- org.springframework:spring-aop:jar:3.0.7.RELEASE:provided
        [INFO] | | - aopalliance:aopalliance:jar:1.0:provided
        [INFO] | +- org.springframework:spring-beans:jar:3.0.7.RELEASE:provided
        [INFO] | - org.springframework:spring-expression:jar:3.0.7.RELEASE:provided
        *****************
        [INFO] +- javax.ws.rs:jsr311-api:jar:1.1.1:compile
        [INFO] +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.6.0-SNAPSHOT:compile
        [INFO] +- org.apache.cxf:cxf-rt-transports-http:jar:2.6.0-SNAPSHOT:compile
        ******************
        [INFO] +- org.apache.geronimo.specs:geronimo-servlet_3.0_spec:jar:1.0:provided
        [INFO] +- org.apache.cxf:cxf-rt-transports-local:jar:2.6.0-SNAPSHOT:test
        [INFO] +- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.6.0-SNAPSHOT:test
        [INFO] | - com.sun.xml.bind:jaxb-xjc:jar:2.1.13:test
        [INFO] +- org.apache.cxf:cxf-rt-transports-http-jetty:jar:2.6.0-SNAPSHOT:test
        [INFO] | +- org.eclipse.jetty:jetty-server:jar:7.5.4.v20111024:test
        [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:7.5.4.v20111024:test
        [INFO] | | - org.eclipse.jetty:jetty-http:jar:7.5.4.v20111024:test
        [INFO] | | - org.eclipse.jetty:jetty-io:jar:7.5.4.v20111024:test
        [INFO] | | - org.eclipse.jetty:jetty-util:jar:7.5.4.v20111024:test
        [INFO] | - org.eclipse.jetty:jetty-security:jar:7.5.4.v20111024:test
        [INFO] +- org.slf4j:slf4j-jdk14:jar:1.6.2:test
        [INFO] +- org.easymock:easymock:jar:3.1:test
        [INFO] | +- cglib:cglib-nodep:jar:2.2.2:test
        [INFO] | - org.objenesis:objenesis:jar:1.2:test
        [INFO] - org.apache.cxf:cxf-rt-databinding-aegis:jar:2.6.0-SNAPSHOT:test

        Note the strong dependencies surrounded by '************', this is all we have.
        FYI the wsdl4j dependency is most likely can be excluded - few people have confirmed it

        I agree it is more monolitic in the 2.5.x. We are doing a major split starting from 2.6

        Show
        Sergey Beryozkin added a comment - > CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree" I guess I'd be happy to discuss this issue elsewhere, but since you brought it up, here is the 2.6.0-SNAPSHOT one: [INFO] org.apache.cxf:cxf-rt-frontend-jaxrs:jar:2.6.0-SNAPSHOT [INFO] +- org.apache.aries.blueprint:org.apache.aries.blueprint.core:jar:0.3.1:provided [INFO] | +- org.apache.aries.blueprint:org.apache.aries.blueprint.api:jar:0.3.1:provided [INFO] | +- org.apache.aries:org.apache.aries.util:jar:0.3:provided [INFO] | +- org.slf4j:slf4j-api:jar:1.6.2:provided (version managed from 1.5.11) [INFO] | +- org.apache.aries.testsupport:org.apache.aries.testsupport.unit:jar:0.3:provided [INFO] | - org.apache.aries.proxy:org.apache.aries.proxy.api:jar:0.3:provided [INFO] +- org.osgi:org.osgi.core:jar:4.2.0:provided [INFO] +- junit:junit:jar:4.8.2:test ************* [INFO] +- org.apache.cxf:cxf-api:jar:2.6.0-SNAPSHOT:compile [INFO] | +- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.2:runtime [INFO] | | - org.codehaus.woodstox:stax2-api:jar:3.1.1:runtime [INFO] | +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.1:compile [INFO] | +- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile [INFO] | - wsdl4j:wsdl4j:jar:1.6.2:compile [INFO] +- org.apache.cxf:cxf-rt-core:jar:2.6.0-SNAPSHOT:compile [INFO] | - com.sun.xml.bind:jaxb-impl:jar:2.1.13:compile ************** [INFO] +- org.springframework:spring-core:jar:3.0.7.RELEASE:provided [INFO] | +- org.springframework:spring-asm:jar:3.0.7.RELEASE:provided [INFO] | - commons-logging:commons-logging:jar:1.1.1:provided [INFO] +- org.springframework:spring-context:jar:3.0.7.RELEASE:provided [INFO] | +- org.springframework:spring-aop:jar:3.0.7.RELEASE:provided [INFO] | | - aopalliance:aopalliance:jar:1.0:provided [INFO] | +- org.springframework:spring-beans:jar:3.0.7.RELEASE:provided [INFO] | - org.springframework:spring-expression:jar:3.0.7.RELEASE:provided ***************** [INFO] +- javax.ws.rs:jsr311-api:jar:1.1.1:compile [INFO] +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.6.0-SNAPSHOT:compile [INFO] +- org.apache.cxf:cxf-rt-transports-http:jar:2.6.0-SNAPSHOT:compile ****************** [INFO] +- org.apache.geronimo.specs:geronimo-servlet_3.0_spec:jar:1.0:provided [INFO] +- org.apache.cxf:cxf-rt-transports-local:jar:2.6.0-SNAPSHOT:test [INFO] +- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.6.0-SNAPSHOT:test [INFO] | - com.sun.xml.bind:jaxb-xjc:jar:2.1.13:test [INFO] +- org.apache.cxf:cxf-rt-transports-http-jetty:jar:2.6.0-SNAPSHOT:test [INFO] | +- org.eclipse.jetty:jetty-server:jar:7.5.4.v20111024:test [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:7.5.4.v20111024:test [INFO] | | - org.eclipse.jetty:jetty-http:jar:7.5.4.v20111024:test [INFO] | | - org.eclipse.jetty:jetty-io:jar:7.5.4.v20111024:test [INFO] | | - org.eclipse.jetty:jetty-util:jar:7.5.4.v20111024:test [INFO] | - org.eclipse.jetty:jetty-security:jar:7.5.4.v20111024:test [INFO] +- org.slf4j:slf4j-jdk14:jar:1.6.2:test [INFO] +- org.easymock:easymock:jar:3.1:test [INFO] | +- cglib:cglib-nodep:jar:2.2.2:test [INFO] | - org.objenesis:objenesis:jar:1.2:test [INFO] - org.apache.cxf:cxf-rt-databinding-aegis:jar:2.6.0-SNAPSHOT:test Note the strong dependencies surrounded by '************', this is all we have. FYI the wsdl4j dependency is most likely can be excluded - few people have confirmed it I agree it is more monolitic in the 2.5.x. We are doing a major split starting from 2.6
        Hide
        Sergey Beryozkin added a comment -

        > I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list.

        What do you not understand ?
        FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ?

        > (by the way, maybe CXF supports classpath scanning like Jersey?)

        No it does not yet. It was a very specific decision - IMHO the random class scanning is impractical in many cases and causes more troubles than it's worth and if of little use when the custom providers have to be configured in the per-endpoint specific way as in case with most interesting applications. However I do accept that for simple mappers it can make sense, though I'm not sure what is simpler, restricting the packages to scan or just go and register required providers , I prefer the latter, but please see#

        https://issues.apache.org/jira/browse/CXF-4199

        In CXF 2.6.0 FIQL search extensions got moved to the new module, so it is time to optionally support it

        Show
        Sergey Beryozkin added a comment - > I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list. What do you not understand ? FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ? > (by the way, maybe CXF supports classpath scanning like Jersey?) No it does not yet. It was a very specific decision - IMHO the random class scanning is impractical in many cases and causes more troubles than it's worth and if of little use when the custom providers have to be configured in the per-endpoint specific way as in case with most interesting applications. However I do accept that for simple mappers it can make sense, though I'm not sure what is simpler, restricting the packages to scan or just go and register required providers , I prefer the latter, but please see# https://issues.apache.org/jira/browse/CXF-4199 In CXF 2.6.0 FIQL search extensions got moved to the new module, so it is time to optionally support it
        Hide
        Chris A. Mattmann added a comment -

        Hey Max:

        The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food.

        CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree"

        I guess here I was talking more about simply only having to rely on one Maven dependency tag in the tika-server pom.xml for cxf-rt-frontend-jars, rather than jersey server + core, and the other dependencies we used to have. If you look at the pom.xml, the deps are now reduced. That's what I was thinking (maybe a side effect?)

        Show
        Chris A. Mattmann added a comment - Hey Max: The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree" I guess here I was talking more about simply only having to rely on one Maven dependency tag in the tika-server pom.xml for cxf-rt-frontend-jars, rather than jersey server + core, and the other dependencies we used to have. If you look at the pom.xml, the deps are now reduced. That's what I was thinking (maybe a side effect?)
        Hide
        Maxim Valyanskiy added a comment - - edited

        > FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ?

        I'll try to explain. Tika server's resources can handle any input mime-type. When we no not specify mime type in our PUT request (or specify something generic like 'application/octet-stream'), Tika uses its own mime-type detector to detect its type and choose parser.

        When we specify mime-type it skips detection stage and choose parser that handles specified document type.

        When we can't handle specified mime-type, when we can't detect it, or when we do not have parser for that type, we throw WebApplicationException(Response.Status.UNSUPPORTED_MEDIA_TYPE) - 415 code.

        Tika parser framework wraps that exception into TikaException.

        TikaExceptionMapper unwraps it:

            if (e.getCause() !=null && e.getCause() instanceof WebApplicationException) {
              return ((WebApplicationException) e.getCause()).getResponse();
            }
        

        That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415.

        PS: maybe we can speak Russian on jabber?

        Show
        Maxim Valyanskiy added a comment - - edited > FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ? I'll try to explain. Tika server's resources can handle any input mime-type. When we no not specify mime type in our PUT request (or specify something generic like 'application/octet-stream'), Tika uses its own mime-type detector to detect its type and choose parser. When we specify mime-type it skips detection stage and choose parser that handles specified document type. When we can't handle specified mime-type, when we can't detect it, or when we do not have parser for that type, we throw WebApplicationException(Response.Status.UNSUPPORTED_MEDIA_TYPE) - 415 code. Tika parser framework wraps that exception into TikaException. TikaExceptionMapper unwraps it: if (e.getCause() !=null && e.getCause() instanceof WebApplicationException) { return ((WebApplicationException) e.getCause()).getResponse(); } That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415. PS: maybe we can speak Russian on jabber?
        Hide
        Chris A. Mattmann added a comment -

        but the test failed because TikaExceptionMapper was not added to providers list

        Max, you're totally right! In r1306883, I committed some cleanup, removing the FIXME and uncommenting @Test, and all tests pass. I'm going to mark this issue as resolved now!

        We can track further progress and updates in other issues. Thanks for the help here!

        Show
        Chris A. Mattmann added a comment - but the test failed because TikaExceptionMapper was not added to providers list Max, you're totally right! In r1306883, I committed some cleanup, removing the FIXME and uncommenting @Test, and all tests pass. I'm going to mark this issue as resolved now! We can track further progress and updates in other issues. Thanks for the help here!
        Hide
        Maxim Valyanskiy added a comment -

        I we have another problem with Tika server. We combine all dependency jar's into one big jar that can be run via 'java -jar tika-server.jar'. It includes Tika with all parsers, web-server and etc.

        When I try to run it a have following exception:

        SEVERE: Can't start
        org.apache.cxf.service.factory.ServiceConstructionException
        	at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:190)
        	at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:92)
        Caused by: org.apache.cxf.BusException: No DestinationFactory was found for the namespace http://cxf.apache.org/transports/http.
        	at org.apache.cxf.transport.DestinationFactoryManagerImpl.getDestinationFactory(DestinationFactoryManagerImpl.java:126)
        	at org.apache.cxf.endpoint.ServerImpl.initDestination(ServerImpl.java:88)
        	at org.apache.cxf.endpoint.ServerImpl.<init>(ServerImpl.java:72)
        	at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:151)
        	... 1 more
        
        

        I think that something is wrong in bundle-plugin configuration

        Show
        Maxim Valyanskiy added a comment - I we have another problem with Tika server. We combine all dependency jar's into one big jar that can be run via 'java -jar tika-server.jar'. It includes Tika with all parsers, web-server and etc. When I try to run it a have following exception: SEVERE: Can't start org.apache.cxf.service.factory.ServiceConstructionException at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:190) at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:92) Caused by: org.apache.cxf.BusException: No DestinationFactory was found for the namespace http://cxf.apache.org/transports/http. at org.apache.cxf.transport.DestinationFactoryManagerImpl.getDestinationFactory(DestinationFactoryManagerImpl.java:126) at org.apache.cxf.endpoint.ServerImpl.initDestination(ServerImpl.java:88) at org.apache.cxf.endpoint.ServerImpl.<init>(ServerImpl.java:72) at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:151) ... 1 more I think that something is wrong in bundle-plugin configuration
        Hide
        Chris A. Mattmann added a comment -
        • for the intent of this issue, I think that the functionality is complete. We can open up new issues to track further improvements and bugs. Thanks Max, Sergey, Ingo, and others who have contributed, and to Jukka for the original idea and spec!
        Show
        Chris A. Mattmann added a comment - for the intent of this issue, I think that the functionality is complete. We can open up new issues to track further improvements and bugs. Thanks Max, Sergey, Ingo, and others who have contributed, and to Jukka for the original idea and spec!
        Hide
        Chris A. Mattmann added a comment -
        • crap, just saw Max's comment. I'm going to let this sit for a while and make sure Max and I can fully run this, before closing the issue. We're close though!
        Show
        Chris A. Mattmann added a comment - crap, just saw Max's comment. I'm going to let this sit for a while and make sure Max and I can fully run this, before closing the issue. We're close though!
        Hide
        Sergey Beryozkin added a comment - - edited

        Max,

        > ... That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415.

        Right. The good thing we know the cause and as I indicated we will get to the optional class scanning support in CXF.

        The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through

        Update: Max, sure, we can talk on Jabber, please share your id with me on #cxf or post here

        Show
        Sergey Beryozkin added a comment - - edited Max, > ... That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415. Right. The good thing we know the cause and as I indicated we will get to the optional class scanning support in CXF. The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through Update: Max, sure, we can talk on Jabber, please share your id with me on #cxf or post here
        Hide
        Chris A. Mattmann added a comment -

        The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through

        I know! I got down the wrong rabbit hole, thanks for the help, both of you heh...

        Show
        Chris A. Mattmann added a comment - The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through I know! I got down the wrong rabbit hole, thanks for the help, both of you heh...
        Hide
        Sergey Beryozkin added a comment -

        > I think that something is wrong in bundle-plugin configuration

        The packaged jar contains duplicate entries for different packages in /org/apache/cxf/, and probably for others. May be you should use the Maven Shade plugin, here is the example from CXF:

        http://svn.apache.org/repos/asf/cxf/trunk/osgi/bundle/all/pom.xml

        Show
        Sergey Beryozkin added a comment - > I think that something is wrong in bundle-plugin configuration The packaged jar contains duplicate entries for different packages in /org/apache/cxf/, and probably for others. May be you should use the Maven Shade plugin, here is the example from CXF: http://svn.apache.org/repos/asf/cxf/trunk/osgi/bundle/all/pom.xml
        Hide
        Maxim Valyanskiy added a comment -

        added shade plugin, now jar can run

        Show
        Maxim Valyanskiy added a comment - added shade plugin, now jar can run
        Hide
        Chris A. Mattmann added a comment -

        I updated the documentation here: http://wiki.apache.org/tika/TikaJAXRS

        Thanks Max!

        Show
        Chris A. Mattmann added a comment - I updated the documentation here: http://wiki.apache.org/tika/TikaJAXRS Thanks Max!
        Hide
        Sergey Beryozkin added a comment -

        Max, Chris, thanks !

        Show
        Sergey Beryozkin added a comment - Max, Chris, thanks !
        Hide
        Markus Jelsma added a comment -

        Great work!

        Show
        Markus Jelsma added a comment - Great work!
        Hide
        Chris A. Mattmann added a comment -

        Thanks, guys!

        Show
        Chris A. Mattmann added a comment - Thanks, guys!
        Hide
        Maxim Valyanskiy added a comment -

        I updated documentation in wiki

        Show
        Maxim Valyanskiy added a comment - I updated documentation in wiki

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jukka Zitting
          • Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development