Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2524

Create/integrate a parser for XPS

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.16
    • 1.18, 2.0.0
    • parser

    Description

      When we parse XPS files using the AutoParser we always get an empty string.
      If we use DefaultDetector.detect() it correctly detects the MediaType as "application/vnd.ms-xpsdocument".

      This page
      https://tika.apache.org/1.16/formats.html
      suggests that XPS (application/vnd.ms-xpsdocument) is supported however.

      Our code:
      InputStream bis = this.getClass().getResourceAsStream("/" + EXPECTED_LOCATION + "doc_xps.xps");
      Metadata metadata = new Metadata();
      BodyContentHandler handler = new BodyContentHandler();
      AutoDetectParser parser = new AutoDetectParser();
      TikaInputStream tikaStream = TikaInputStream.get(bis);
      parser.parse(tikaStream, handler, metadata);
      String parsedText = handler.toString();

      I will attach doc_xps.xps if I can

      Attachments

        1. A3S3TDRXL6DN2AN3NU2OE5L7KGFY6DZA.xps
          237 kB
          Tim Allison
        2. doc_xps.xps
          87 kB
          Peter Davies
        3. WithBiDi.xps
          136 kB
          Nick Burch

        Activity

          People

            tallison Tim Allison
            pete_openanswers Peter Davies
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: