Tika
  1. Tika
  2. TIKA-904

Pages documents created in Layout mode not supported

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
    • Environment:

      Windows 7

      Description

      Pages supports Layout editing mode, which provides free-form editing as opposed to standard line-by-line word processing. You use text boxes and other embedded objects to add content. Tika only extracts metadata from these documents, not the actual content.

      1. TIKA-904.patch
        6 kB
        Michael McCandless
      2. testPagesCanvasJIRA.pages
        65 kB
        Gabriel Valencia

        Activity

        Gabriel Valencia created issue -
        Hide
        Gabriel Valencia added a comment -

        Sample Layout editing mode document

        Show
        Gabriel Valencia added a comment - Sample Layout editing mode document
        Gabriel Valencia made changes -
        Field Original Value New Value
        Attachment testPagesCanvasJIRA.pages [ 12524883 ]
        Gabriel Valencia made changes -
        Labels iwork
        Gabriel Valencia made changes -
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Gabriel Valencia made changes -
        Labels iwork iWork
        Hide
        Nick Burch added a comment -

        Any chance you could compare two simple documents, one with regular structure and one with free-form, and see how the xml structure of the part that holds the text differs? Especially important is if the free-form also stores the text in sf tags in sf:text-body, and if sf:page-start elements show up?

        Show
        Nick Burch added a comment - Any chance you could compare two simple documents, one with regular structure and one with free-form, and see how the xml structure of the part that holds the text differs? Especially important is if the free-form also stores the text in sf tags in sf:text-body, and if sf:page-start elements show up?
        Hide
        Gabriel Valencia added a comment -

        Looks like free-form docs have everything under sl:document -> sl:drawables -> sl:page-group (1 or more), under which are the sf:drawable-shape instances. If you dig into these, you eventually get to sf:text-body.

        Unlike the regular structure, the main sf:text-storage has nothing in it. However, the above use of page-groups is also how a regular document stores embedded text boxes. So it seems the only difference is that free-form docs only make use of the sl:drawables section.

        Show
        Gabriel Valencia added a comment - Looks like free-form docs have everything under sl:document -> sl:drawables -> sl:page-group (1 or more), under which are the sf:drawable-shape instances. If you dig into these, you eventually get to sf:text-body. Unlike the regular structure, the main sf:text-storage has nothing in it. However, the above use of page-groups is also how a regular document stores embedded text boxes. So it seems the only difference is that free-form docs only make use of the sl:drawables section.
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Hide
        Michael McCandless added a comment -

        Patch w/ test case & fix... it looks like we also have to output characters we find inside the sl:page-group tag... but ignore it if it's inside sf:ghost-text.

        Show
        Michael McCandless added a comment - Patch w/ test case & fix... it looks like we also have to output characters we find inside the sl:page-group tag... but ignore it if it's inside sf:ghost-text.
        Michael McCandless made changes -
        Attachment TIKA-904.patch [ 12527666 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.2 [ 12320169 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Gabriel Valencia
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development