Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2524

Create/integrate a parser for XPS

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 2.0, 1.18
    • Component/s: parser
    • Labels:

      Description

      When we parse XPS files using the AutoParser we always get an empty string.
      If we use DefaultDetector.detect() it correctly detects the MediaType as "application/vnd.ms-xpsdocument".

      This page
      https://tika.apache.org/1.16/formats.html
      suggests that XPS (application/vnd.ms-xpsdocument) is supported however.

      Our code:
      InputStream bis = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + "doc_xps.xps");
      Metadata metadata = new Metadata();
      BodyContentHandler handler = new BodyContentHandler();
      AutoDetectParser parser = new AutoDetectParser();
      TikaInputStream tikaStream = TikaInputStream.get(bis);
      parser.parse(tikaStream, handler, metadata);
      String parsedText = handler.toString();

      I will attach doc_xps.xps if I can

        Attachments

        1. A3S3TDRXL6DN2AN3NU2OE5L7KGFY6DZA.xps
          237 kB
          Tim Allison
        2. doc_xps.xps
          87 kB
          Peter Davies
        3. WithBiDi.xps
          136 kB
          Nick Burch

          Activity

            People

            • Assignee:
              tallison@apache.org Tim Allison
              Reporter:
              pete_openanswers Peter Davies
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: