Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.2
    • Fix Version/s: None
    • Component/s: general
    • Labels:
      None

      Description

      Tika servlet, use file or directory path to build a list of XML documents. The next version will allow file upload.
      Usage :
      //Extract document content and metadata
      http://localhost:8080/tikaServlet/?filePath=C:\test&start=0&rows=10
      //Extract metadata
      http://localhost:8080/tikaServlet/?filePath=C:\test&start=0&rows=10&extract=metadata
      //Extract document content
      http://localhost:8080/tikaServlet/?filePath=C:\test&start=0&rows=10&extract=content

      1. tikaServlet.war
        1017 kB
        Rida Benjelloun

        Activity

        Hide
        Rida Benjelloun added a comment -

        Please add tika libs in WEB_INF lib directory

        Show
        Rida Benjelloun added a comment - Please add tika libs in WEB_INF lib directory
        Hide
        Jukka Zitting added a comment -

        Nice, though I wonder what the use case is. Do we need this?

        Show
        Jukka Zitting added a comment - Nice, though I wonder what the use case is. Do we need this?
        Hide
        Rida Benjelloun added a comment -

        Using tika as servlet will allow non java programmer to use tika. I have done this implementation for a team composed from .NET programmers who wants to extract content and metatdata from documents without using Java.
        The only think they have to know is an API to process XML in there specific programming language.

        Show
        Rida Benjelloun added a comment - Using tika as servlet will allow non java programmer to use tika. I have done this implementation for a team composed from .NET programmers who wants to extract content and metatdata from documents without using Java. The only think they have to know is an API to process XML in there specific programming language.
        Hide
        Grant Ingersoll added a comment -

        I haven't looked at the patch, but I would suggest that it be viewed as a contrib dependency. I think one of the things that will really help adoption is if it is easy to remove any dependencies that aren't needed for a given application. That is, we probably have a core Tika, and then Tika contribs.

        Also, note, SOLR-284 is also related.

        Show
        Grant Ingersoll added a comment - I haven't looked at the patch, but I would suggest that it be viewed as a contrib dependency. I think one of the things that will really help adoption is if it is easy to remove any dependencies that aren't needed for a given application. That is, we probably have a core Tika, and then Tika contribs. Also, note, SOLR-284 is also related.
        Hide
        Jukka Zitting added a comment -

        Another alternative for cross-platform use is the CLI feature:

        1. Extracting structured text content from a file
          java -jar tika-0.2-standalone.jar --xml /path/to/file
        1. Extracting plain text content from a file
          java -jar tika-0.2-standalone.jar --text /path/to/file
        1. Extracting metadata from a file
          java -jar tika-0.2-standalone.jar --metadata /path/to/file

        This way you don't need a separate server process and there won't be any concerns about unauthorized users getting access to your files.

        I'm a bit concerned about any web service that allows the client to retrieve the contents of any file on the local file system. Would it make more sense to always require the client to upload the files they want parsed?

        Also, the file system traversal feature seems a bit outside the scope of Tika, though having something like this in a contrib area might be nice.

        Show
        Jukka Zitting added a comment - Another alternative for cross-platform use is the CLI feature: Extracting structured text content from a file java -jar tika-0.2-standalone.jar --xml /path/to/file Extracting plain text content from a file java -jar tika-0.2-standalone.jar --text /path/to/file Extracting metadata from a file java -jar tika-0.2-standalone.jar --metadata /path/to/file This way you don't need a separate server process and there won't be any concerns about unauthorized users getting access to your files. I'm a bit concerned about any web service that allows the client to retrieve the contents of any file on the local file system. Would it make more sense to always require the client to upload the files they want parsed? Also, the file system traversal feature seems a bit outside the scope of Tika, though having something like this in a contrib area might be nice.
        Hide
        Grant Ingersoll added a comment -

        Also, the file system traversal feature seems a bit outside the scope of Tika, though having something like this in a contrib area might be nice.

        I believe Droids (crawling) has integrated Tika already as well. But, yeah, as optional contribs, those make sense. We will have lots of dependencies on extraction libraries as it is, so I really think it makes sense to stay as lean as possible elsewhere. Before you know it, Tika will be a 50-100 MB download, and that will slow adoption...

        Show
        Grant Ingersoll added a comment - Also, the file system traversal feature seems a bit outside the scope of Tika, though having something like this in a contrib area might be nice. I believe Droids (crawling) has integrated Tika already as well. But, yeah, as optional contribs, those make sense. We will have lots of dependencies on extraction libraries as it is, so I really think it makes sense to stay as lean as possible elsewhere. Before you know it, Tika will be a 50-100 MB download, and that will slow adoption...
        Hide
        Ingo Renner added a comment -

        I see a servlet making quite some sense - think of Solr, but only having the extraction request handler... That way you could have a central meta data / text extracting server without needing to install java + tika on all the hosts where you might need it in a replicated CMS environment f.e.

        So the scenario would be that a CMS trys to extract text, meta data from a file, but does not have a local tika at hand. It would then send the file to a Tika server and get the results back in XML or JSON like Solr does.

        Show
        Ingo Renner added a comment - I see a servlet making quite some sense - think of Solr, but only having the extraction request handler... That way you could have a central meta data / text extracting server without needing to install java + tika on all the hosts where you might need it in a replicated CMS environment f.e. So the scenario would be that a CMS trys to extract text, meta data from a file, but does not have a local tika at hand. It would then send the file to a Tika server and get the results back in XML or JSON like Solr does.
        Hide
        Jukka Zitting added a comment -

        Resolving as Won't Fix, since the extractOnly option of Solr's ExtractingRequestHanlder already provides similar functionality.

        Show
        Jukka Zitting added a comment - Resolving as Won't Fix, since the extractOnly option of Solr's ExtractingRequestHanlder already provides similar functionality.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Rida Benjelloun
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development