Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-522

AutoDetectParser treats HTML/XML files as Audio

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 0.7
    • 0.8
    • parser
    • None
    • WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse 20100617-1415

    Description

      I am crawling an SMB share. I've used the steps outlined in Tika samples to initialize; given a File object in f, my code is:

      parser = new AutoDetectParser();
      context.set(Parser.class, parser);
      // Get the URL
      URL url = f.toURI().toURL();
      // Extract Metadata
      Metadata metadata = new Metadata();
      BodyContentHandler handler = new BodyContentHandler(-1); // -1 = infinite size for XML string buffer (per file)
      // Get the input stream
      InputStream input = MetadataHelper.getInputStream(url, metadata);
      // Parse the document
      parser.parse(input, handler, metadata, context);

      If I place a breakpoint right after the parser.parse invoke, I find the metadata calling my input out as an Audio file. If I try to debug the parse steps, it correctly tags it as Text/HTML. Seems like a timing-related problem.

      I have a half-baked workaround: I invoke Thread.sleep(5000) just after the context.set invoke... in 3 sequential test runs that works fine. Problem is, this was working fine several days ago without that (perhaps my computer was busy with other things and the timing issue did not pop up then).

      I have downloade and am building today's 0.8 from svn to see if that helps, though I am concerned about the impacts to the rest of my testing if I have to swtich to 0.8. Just understanding what was going on would be a huge help

      • UPDATE * I was able to repro this once under the debugger. MimeTypes.detect invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to determine the Mime Type based on the first 8k of data. I did not trace into getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and "text/html" most others. I can supply the HTML file if desired.

      Attachments

        Activity

          People

            kkrugler Kenneth William Krugler
            dennisad Dennis Adler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: