Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1768

Document headers and footers in metadata

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I have a use case where I need document headers and footers to be explicitly marked as such in Tika's output metadata fields. As far as I can see, there's no easy built-in way for doing this.

      The attached patch adds a HeaderFooterContentHandler which enables addition of headers and footers into their own metadata fields. This works out of the box with Word file formats.
      Also included in the patch are some tweaks to enable Excel and Powerpoint parsers/extractors to explicitly mark headers and footers as such in the output XHTML and

      enable the aforementioned content handler to spot them. Unit tests have been added, and existing ones modified, to verify that the parsers and the content handler work together correctly.

      1. HeaderAndFooterTestFiles.zip
        95 kB
        Aeham Abushwashi
      2. headers_footers.patch
        90 kB
        Aeham Abushwashi

        Activity

        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Any updates on this enhancement (and patch)?

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Any updates on this enhancement (and patch)?
        Hide
        udittmer Ulf Dittmer added a comment -

        As a semi-related issue, I'd like to see an option to have parsers ignore headers and footers.

        Show
        udittmer Ulf Dittmer added a comment - As a semi-related issue, I'd like to see an option to have parsers ignore headers and footers.
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Is this likely to be fixed in the next release of Tika?

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Is this likely to be fixed in the next release of Tika?
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Updated patch

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Updated patch
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Attached data files separately because they cannot be extracted out of the patch file. Please unzip and copy to tika-parsers\src\test\resources\test-documents before running the new tests.

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Attached data files separately because they cannot be extracted out of the patch file. Please unzip and copy to tika-parsers\src\test\resources\test-documents before running the new tests.

          People

          • Assignee:
            Unassigned
            Reporter:
            aeham.abushwashi Aeham Abushwashi
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development