Droids
  1. Droids
  2. DROIDS-8

[Patch] Create tied integration with Apache Tika (for parser and handler)

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: tika
    • Labels:
      None

      Description

      http://incubator.apache.org/tika/
      Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

      1. LinkExtractor.java
        4 kB
        Javier Puerto
      2. DROIDS-8-droids-tika.patch
        20 kB
        Ryan McKinley
      3. tikaparser.diff
        18 kB
        Javier Puerto
      4. tikaparser.diff
        4 kB
        Javier Puerto

        Issue Links

          Activity

          Hide
          Richard Frovarp added a comment -

          No longer an issue.

          Show
          Richard Frovarp added a comment - No longer an issue.
          Hide
          Thorsten Scherler added a comment -

          AFAIR the patches are applied to trunk, right?

          Can we close this issue?

          Show
          Thorsten Scherler added a comment - AFAIR the patches are applied to trunk, right? Can we close this issue?
          Hide
          Javier Puerto added a comment -

          Thank for the code formatting, i think i had the eclipse wrong configured.

          >Could the LinkExtractor live in core? It does not appear to depend on tika. If so, how does it relate to:
          >org.apache.droids.parse.html.HtmlParser?
          +1
          Not depend on tika and could be use in others task with sax.
          Please move it to the package you suggest.

          Show
          Javier Puerto added a comment - Thank for the code formatting, i think i had the eclipse wrong configured. >Could the LinkExtractor live in core? It does not appear to depend on tika. If so, how does it relate to: >org.apache.droids.parse.html.HtmlParser? +1 Not depend on tika and could be use in others task with sax. Please move it to the package you suggest.
          Hide
          Ryan McKinley added a comment -

          Thanks Javier – I just committed your updated version.

          Could the LinkExtractor live in core? It does not appear to depend on tika. If so, how does it relate to:
          org.apache.droids.parse.html.HtmlParser?

          Show
          Ryan McKinley added a comment - Thanks Javier – I just committed your updated version. Could the LinkExtractor live in core? It does not appear to depend on tika. If so, how does it relate to: org.apache.droids.parse.html.HtmlParser?
          Hide
          Javier Puerto added a comment -

          Rewrite the LinkExtractor class to be more clear and apply some improvements.

          Show
          Javier Puerto added a comment - Rewrite the LinkExtractor class to be more clear and apply some improvements.
          Hide
          Ryan McKinley added a comment -

          Here is a patch that adds the tika parser to a new module: droids-tika

          This does not include the changes to:
          core/java/regex-urlfilter.txt
          core/java/org/apache/droids/helper/factories/ParserFactory.java
          dynamics/java/org/apache/droids/droids-core-context.xml

          Show
          Ryan McKinley added a comment - Here is a patch that adds the tika parser to a new module: droids-tika This does not include the changes to: core/java/regex-urlfilter.txt core/java/org/apache/droids/helper/factories/ParserFactory.java dynamics/java/org/apache/droids/droids-core-context.xml
          Hide
          Javier Puerto added a comment -

          >One question:
          >- Is there a reason why you used the EchoHandler and not the XHTMLContentHandler?
          No, you could implement with the XHTMLContentHandler. I will try it later.

          >It seems that you have unreleated changes in the patch:
          >- core/java/regex-urlfilter.txt (the complete file)
          >- in dynamics/java/org/apache/droids/droids-core-context.xml
          >...
          >- <property name="locations" value="classpath:org/apache/droids/droids-core.properties"/>
          >+ <property name="locations" value="classpath:org/apache/droids/droids-test.properties"/>
          Ops, only testing changes. The regex-urlfilter.txt to fit the web and in the spring context the test properties file.

          There's a few TODOs in the patch i want to post a more complete version soon.

          Show
          Javier Puerto added a comment - >One question: >- Is there a reason why you used the EchoHandler and not the XHTMLContentHandler? No, you could implement with the XHTMLContentHandler. I will try it later. >It seems that you have unreleated changes in the patch: >- core/java/regex-urlfilter.txt (the complete file) >- in dynamics/java/org/apache/droids/droids-core-context.xml >... >- <property name="locations" value="classpath:org/apache/droids/droids-core.properties"/> >+ <property name="locations" value="classpath:org/apache/droids/droids-test.properties"/> Ops, only testing changes. The regex-urlfilter.txt to fit the web and in the spring context the test properties file. There's a few TODOs in the patch i want to post a more complete version soon.
          Hide
          Jukka Zitting added a comment -

          > I like the idea that LinkExtractor is a handler very much.

          Me too. I think it would be a good idea to add an abstracted version of the class directly in Tika, as I believe it would be useful also outside Droids.

          Show
          Jukka Zitting added a comment - > I like the idea that LinkExtractor is a handler very much. Me too. I think it would be a good idea to add an abstracted version of the class directly in Tika, as I believe it would be useful also outside Droids.
          Hide
          Thorsten Scherler added a comment -

          I had a look at your patch.
          Thanks for your contribution.

          I like the idea that LinkExtractor is a handler very much.

          One question:

          • Is there a reason why you used the EchoHandler and not the XHTMLContentHandler?

          It seems that you have unreleated changes in the patch:

          • core/java/regex-urlfilter.txt (the complete file)
          • in dynamics/java/org/apache/droids/droids-core-context.xml
            ...
          • <property name="locations" value="classpath:org/apache/droids/droids-core.properties"/>
            + <property name="locations" value="classpath:org/apache/droids/droids-test.properties"/>
            ...
            the block around org.apache.droids.handle.Save
          Show
          Thorsten Scherler added a comment - I had a look at your patch. Thanks for your contribution. I like the idea that LinkExtractor is a handler very much. One question: Is there a reason why you used the EchoHandler and not the XHTMLContentHandler? It seems that you have unreleated changes in the patch: core/java/regex-urlfilter.txt (the complete file) in dynamics/java/org/apache/droids/droids-core-context.xml ... <property name="locations" value="classpath:org/apache/droids/droids-core.properties"/> + <property name="locations" value="classpath:org/apache/droids/droids-test.properties"/> ... the block around org.apache.droids.handle.Save
          Hide
          Javier Puerto added a comment -

          I sent the file and go out this weekend. Silly mistake

          Show
          Javier Puerto added a comment - I sent the file and go out this weekend. Silly mistake
          Hide
          Thorsten Scherler added a comment -

          Hi javier,

          I had a quick look on the diff but it seems there are missing some files.

          I can see in the spring context: org.apache.droids.parse.html.TikaParser but the diff you provided the class is not marked as added.

          Can you please do
          svn st
          and see whether the class is marked as ?

          If so you need to deo
          svn add

          and do the patch again.

          Show
          Thorsten Scherler added a comment - Hi javier, I had a quick look on the diff but it seems there are missing some files. I can see in the spring context: org.apache.droids.parse.html.TikaParser but the diff you provided the class is not marked as added. Can you please do svn st and see whether the class is marked as ? If so you need to deo svn add and do the patch again.
          Hide
          Javier Puerto added a comment -

          A first tika parser implementation. Works with default crawler and worker.

          The parser only wrapped the Tika parser with the LinkExtractor to get the OutLinks and EchoHandler to save the parsed data. Then return a ParseImpl.

          Show
          Javier Puerto added a comment - A first tika parser implementation. Works with default crawler and worker. The parser only wrapped the Tika parser with the LinkExtractor to get the OutLinks and EchoHandler to save the parsed data. Then return a ParseImpl.
          Hide
          Javier Puerto added a comment -

          I was thinking in the same way. We must implement a controller that iterate over a list of handlers by a common interface. But i doubt between use ByteArrayInputStream or a Writer because tika output is text (the encoding?).

          The TeeContentHandler is great but the handlers run in paralell not in chain, it could be usefull in the last stage of the process when the data not need more transformations.

          The stages could be:
          1 Parse [and LinkExtraction?]
          2 Handler
          3 Action

          Show
          Javier Puerto added a comment - I was thinking in the same way. We must implement a controller that iterate over a list of handlers by a common interface. But i doubt between use ByteArrayInputStream or a Writer because tika output is text (the encoding?). The TeeContentHandler is great but the handlers run in paralell not in chain, it could be usefull in the last stage of the process when the data not need more transformations. The stages could be: 1 Parse [and LinkExtraction?] 2 Handler 3 Action
          Hide
          Thorsten Scherler added a comment -

          Here some more links related to the issue:
          http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/ContentHandler.html
          http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html

          To get a string representation of the handler tika does:

          http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/utils/ParseUtils.java
          ...
          return handler.toString();

          Generally there are a lot of specialist Handler that can be used:
          http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/sax/

          The tika documentation gives the example with sysout:
          ContentHandler handler = new BodyContentHandler(System.out);

          The best fit is that we use the XHTMLContentHandler with a BufferedOutputStream. Then convert it to an InputStream and work with this (a valid xml document) in the next stages (linkExtraction/handler).

          One can create as well a linkExtractorHandler that will return the Outlinks from the doc. This however will happen in the parser stage meaning there is no LinkExtractor.

          Looking at http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/sax/TeeContentHandler.java we may actually prefer this. We would use both handler and pass the xhtml as stream to the handler.

          Show
          Thorsten Scherler added a comment - Here some more links related to the issue: http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/ContentHandler.html http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html To get a string representation of the handler tika does: http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/utils/ParseUtils.java ... return handler.toString(); Generally there are a lot of specialist Handler that can be used: http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/sax/ The tika documentation gives the example with sysout: ContentHandler handler = new BodyContentHandler(System.out); The best fit is that we use the XHTMLContentHandler with a BufferedOutputStream. Then convert it to an InputStream and work with this (a valid xml document) in the next stages (linkExtraction/handler). One can create as well a linkExtractorHandler that will return the Outlinks from the doc. This however will happen in the parser stage meaning there is no LinkExtractor. Looking at http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/sax/TeeContentHandler.java we may actually prefer this. We would use both handler and pass the xhtml as stream to the handler.
          Hide
          Thorsten Scherler added a comment -

          I do not want to create a dependency in the api to tika. If your solution does not produce this dependency it is good as gold.

          As I see it the integration should be:

          ->Parser produces SAX and metadata
          ->LinkExtractor [new component -> see LABS-149] uses SAX events to extract links/tasks
          -> Handler (can use SAX, metadata, original stream)

          Does that makes sense?

          Show
          Thorsten Scherler added a comment - I do not want to create a dependency in the api to tika. If your solution does not produce this dependency it is good as gold. As I see it the integration should be: ->Parser produces SAX and metadata ->LinkExtractor [new component -> see LABS-149] uses SAX events to extract links/tasks -> Handler (can use SAX, metadata, original stream) Does that makes sense?
          Hide
          Rafael Pintor added a comment -

          Hello Thor

          I think it would be a good development that extends from your class Parse interface tika. Then we will have the option of using the parsed of tika or parsed simpler than now exists. t seems that?

          Xao

          Show
          Rafael Pintor added a comment - Hello Thor I think it would be a good development that extends from your class Parse interface tika. Then we will have the option of using the parsed of tika or parsed simpler than now exists. t seems that? Xao
          Hide
          Thorsten Scherler added a comment -

          http://svn.apache.org/repos/asf/incubator/tika/trunk/src/site/apt/documentation.apt

          some documentation about the concept of the parser api in tika.

          Thanks Jukka!

          Show
          Thorsten Scherler added a comment - http://svn.apache.org/repos/asf/incubator/tika/trunk/src/site/apt/documentation.apt some documentation about the concept of the parser api in tika. Thanks Jukka!
          Hide
          Thorsten Scherler added a comment -

          I will use tika trunk till I can come up with something regarding the link extraction which I consider stable.

          Then after this a release would come very handy.

          Thanks Jukka, will keep you updated.

          Show
          Thorsten Scherler added a comment - I will use tika trunk till I can come up with something regarding the link extraction which I consider stable. Then after this a release would come very handy. Thanks Jukka, will keep you updated.
          Hide
          Thorsten Scherler added a comment -

          Nutch uses tika mainly for mimetypes as I understand their code.

          Seems NUTCH-608 is the principal work ATM.

          Show
          Thorsten Scherler added a comment - Nutch uses tika mainly for mimetypes as I understand their code. Seems NUTCH-608 is the principal work ATM.
          Hide
          Jukka Zitting added a comment -

          I can look at setting up nightly builds, and doing a 0.2 release with all the stuff you need would also be an option.

          Show
          Jukka Zitting added a comment - I can look at setting up nightly builds, and doing a 0.2 release with all the stuff you need would also be an option.
          Hide
          Thorsten Scherler added a comment -

          I saw TIKA-128 thanks very much for that. ATM I planed to play around with 0.1 but may soon switch to trunk. Do you have nightly builds to integrate it smoothly with the dep mgt?

          Thanks Jukka for your work on tika.

          Show
          Thorsten Scherler added a comment - I saw TIKA-128 thanks very much for that. ATM I planed to play around with 0.1 but may soon switch to trunk. Do you have nightly builds to integrate it smoothly with the dep mgt? Thanks Jukka for your work on tika.
          Hide
          Jukka Zitting added a comment -

          Will you be using the Tika 0.1-incubating release, or the latest trunk? The release is already reasonably good, but if you have any issues (like TIKA-128) I would be eager to resolve them in the trunk.

          You may also want to check out Nutch where they've already started integrating Tika.

          Show
          Jukka Zitting added a comment - Will you be using the Tika 0.1-incubating release, or the latest trunk? The release is already reasonably good, but if you have any issues (like TIKA-128 ) I would be eager to resolve them in the trunk. You may also want to check out Nutch where they've already started integrating Tika.

            People

            • Assignee:
              Unassigned
              Reporter:
              Thorsten Scherler
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development