Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1308

Support in memory parse mode(don't create temp file): to support run Tika in GAE

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.5
    • 1.17, 2.0.0-BETA, 2.1.0
    • parser

    Description

      I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg:

      String urlStr = req.getParameter("imageUrl");
      byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
      
      ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
      Metadata metadata = new Metadata();
      BodyContentHandler ch = new BodyContentHandler();
      AutoDetectParser parser = new AutoDetectParser();
      parser.parse(bais, ch, metadata, new ParseContext());
      bais.close();
      

      This fails with exception:

      Caused by: java.lang.SecurityException: Unable to create temporary file
      	at java.io.File.createTempFile(File.java:1986)
      	at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
      	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
      	at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
      

      Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream.

      I can understand why tika create temp file from the stream: so tika can parse it multiple times.

      But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times.
      – This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server.

      We can add a parameter in parser.parse to indicate whether do in memory parse only.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yuanyun.cn jefferyyuan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: