Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1768

Document headers and footers in metadata

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.13
    • None
    • None
    • None

    Description

      I have a use case where I need document headers and footers to be explicitly marked as such in Tika's output metadata fields. As far as I can see, there's no easy built-in way for doing this.

      The attached patch adds a HeaderFooterContentHandler which enables addition of headers and footers into their own metadata fields. This works out of the box with Word file formats.
      Also included in the patch are some tweaks to enable Excel and Powerpoint parsers/extractors to explicitly mark headers and footers as such in the output XHTML and

      enable the aforementioned content handler to spot them. Unit tests have been added, and existing ones modified, to verify that the parsers and the content handler work together correctly.

      Attachments

        1. headers_footers.patch
          90 kB
          Aeham Abushwashi
        2. HeaderAndFooterTestFiles.zip
          95 kB
          Aeham Abushwashi

        Activity

          People

            Unassigned Unassigned
            aeham.abushwashi Aeham Abushwashi
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: