Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-906

Headers, footers, and footnotes not extracted from Pages documents

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
    • Environment:

      Windows 7

      Description

      Tika does not extract anything from the header or footer area and also does not extract footnotes.

        Activity

        Hide
        gvalenc Gabriel Valencia added a comment -

        Contains header text, footer text (including automatic page numbering), and some footnotes.

        Show
        gvalenc Gabriel Valencia added a comment - Contains header text, footer text (including automatic page numbering), and some footnotes.
        Hide
        gagravarr Nick Burch added a comment -

        Support added in r1331618. We can now get headers, footers and footnotes, assuming a file only has one set of each, with the default names. (If a file has multiple styles with different ones, the code will likely just end up with the last one)

        Note that we are rapidly approaching the point when the current model for the parser won't cope. At that point, we'll need to start holding things like styles, headers, footers etc properly, track state more as we process the file (a single state level isn't really enough), be aware of styles applied to text etc.

        Show
        gagravarr Nick Burch added a comment - Support added in r1331618. We can now get headers, footers and footnotes, assuming a file only has one set of each, with the default names. (If a file has multiple styles with different ones, the code will likely just end up with the last one) Note that we are rapidly approaching the point when the current model for the parser won't cope. At that point, we'll need to start holding things like styles, headers, footers etc properly, track state more as we process the file (a single state level isn't really enough), be aware of styles applied to text etc.
        Hide
        gvalenc Gabriel Valencia added a comment -

        This document also had automatic page numbering in the footer, but that doesn't get parsed. It's contained in the sf in the sf:footer as an sf:page-number. However, it only has one of them even though there are 2 pages. I guess the rest are automatically added by Pages.

        Show
        gvalenc Gabriel Valencia added a comment - This document also had automatic page numbering in the footer, but that doesn't get parsed. It's contained in the sf in the sf:footer as an sf:page-number. However, it only has one of them even though there are 2 pages. I guess the rest are automatically added by Pages.
        Hide
        gvalenc Gabriel Valencia added a comment -

        Going to reopen in light of the automatic page number issue.

        Show
        gvalenc Gabriel Valencia added a comment - Going to reopen in light of the automatic page number issue.
        Hide
        chrismattmann Chris A. Mattmann added a comment -
        • push to 1.3
        Show
        chrismattmann Chris A. Mattmann added a comment - push to 1.3
        Hide
        chrismattmann Chris A. Mattmann added a comment -
        • push to 1.3
        Show
        chrismattmann Chris A. Mattmann added a comment - push to 1.3
        Hide
        davemeikle Dave Meikle added a comment -

        Support for AutoPageNumbers added in r1358856.

        Show
        davemeikle Dave Meikle added a comment - Support for AutoPageNumbers added in r1358856.
        Hide
        rgauss Ray Gauss II added a comment -

        AutoPageNumberUtilsTest,java is missing a license header and causing rat to fail.

        Shall I add the header?

        Show
        rgauss Ray Gauss II added a comment - AutoPageNumberUtilsTest,java is missing a license header and causing rat to fail. Shall I add the header?
        Hide
        mikemccand Michael McCandless added a comment -

        Shall I add the header?

        +1

        Show
        mikemccand Michael McCandless added a comment - Shall I add the header? +1
        Hide
        davemeikle Dave Meikle added a comment -

        Sorry - I missed the header the first time. Added it now in r1367301.

        Thanks for spotting Ray.

        Show
        davemeikle Dave Meikle added a comment - Sorry - I missed the header the first time. Added it now in r1367301. Thanks for spotting Ray.

          People

          • Assignee:
            Unassigned
            Reporter:
            gvalenc Gabriel Valencia
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development