Tika
  1. Tika
  2. TIKA-904

Pages documents created in Layout mode not supported

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
    • Environment:

      Windows 7

      Description

      Pages supports Layout editing mode, which provides free-form editing as opposed to standard line-by-line word processing. You use text boxes and other embedded objects to add content. Tika only extracts metadata from these documents, not the actual content.

      1. testPagesCanvasJIRA.pages
        65 kB
        Gabriel Valencia
      2. TIKA-904.patch
        6 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch w/ test case & fix... it looks like we also have to output characters we find inside the sl:page-group tag... but ignore it if it's inside sf:ghost-text.

        Show
        Michael McCandless added a comment - Patch w/ test case & fix... it looks like we also have to output characters we find inside the sl:page-group tag... but ignore it if it's inside sf:ghost-text.
        Hide
        Gabriel Valencia added a comment -

        Looks like free-form docs have everything under sl:document -> sl:drawables -> sl:page-group (1 or more), under which are the sf:drawable-shape instances. If you dig into these, you eventually get to sf:text-body.

        Unlike the regular structure, the main sf:text-storage has nothing in it. However, the above use of page-groups is also how a regular document stores embedded text boxes. So it seems the only difference is that free-form docs only make use of the sl:drawables section.

        Show
        Gabriel Valencia added a comment - Looks like free-form docs have everything under sl:document -> sl:drawables -> sl:page-group (1 or more), under which are the sf:drawable-shape instances. If you dig into these, you eventually get to sf:text-body. Unlike the regular structure, the main sf:text-storage has nothing in it. However, the above use of page-groups is also how a regular document stores embedded text boxes. So it seems the only difference is that free-form docs only make use of the sl:drawables section.
        Hide
        Nick Burch added a comment -

        Any chance you could compare two simple documents, one with regular structure and one with free-form, and see how the xml structure of the part that holds the text differs? Especially important is if the free-form also stores the text in sf tags in sf:text-body, and if sf:page-start elements show up?

        Show
        Nick Burch added a comment - Any chance you could compare two simple documents, one with regular structure and one with free-form, and see how the xml structure of the part that holds the text differs? Especially important is if the free-form also stores the text in sf tags in sf:text-body, and if sf:page-start elements show up?
        Hide
        Gabriel Valencia added a comment -

        Sample Layout editing mode document

        Show
        Gabriel Valencia added a comment - Sample Layout editing mode document

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Gabriel Valencia
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development