Tika
  1. Tika
  2. TIKA-905

Embedded text boxes and shapes with text not supported

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.0
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
    • Environment:

      Windows 7

      Description

      This is similar to TIKA-904 but for normal word processing documents. In those, text contained in text boxes and shapes is not extracted.

        Activity

        Gabriel Valencia created issue -
        Hide
        Gabriel Valencia added a comment -

        Contains various embedded objects including text boxes and shapes with text

        Show
        Gabriel Valencia added a comment - Contains various embedded objects including text boxes and shapes with text
        Gabriel Valencia made changes -
        Field Original Value New Value
        Attachment testPagesEmbeddedJIRA.pages [ 12524887 ]
        Gabriel Valencia made changes -
        Labels iwork
        Hide
        Gabriel Valencia added a comment -

        I'm new to JIRA, so please change if I'm wrong. I figure this should be an improvement, not a bug.

        Show
        Gabriel Valencia added a comment - I'm new to JIRA, so please change if I'm wrong. I figure this should be an improvement, not a bug.
        Gabriel Valencia made changes -
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Gabriel Valencia made changes -
        Labels iwork iWork
        Hide
        Nick Burch added a comment -

        Are you able to identify where in the file these text boxes occur, and what sort of tags hold the text? If the text boxes don't occur in the main text area, can you identify how to link back from the main text to the text box? (You might find it helpful to review how annotations work, which we now support as of r1331640, for an idea of how this might work)

        Show
        Nick Burch added a comment - Are you able to identify where in the file these text boxes occur, and what sort of tags hold the text? If the text boxes don't occur in the main text area, can you identify how to link back from the main text to the text box? (You might find it helpful to review how annotations work, which we now support as of r1331640, for an idea of how this might work)
        Hide
        Gabriel Valencia added a comment -

        Check out my comment in TIKA-904. They are all contained in sl:document -> sl:drawables -> sl:page-group (1 or more) -> sf:drawable-shape (1 or more) -> sf:text -> sf:text-storage -> sf:text-body -> sf.

        You get one sf:drawable-shape for each text box.

        Show
        Gabriel Valencia added a comment - Check out my comment in TIKA-904 . They are all contained in sl:document -> sl:drawables -> sl:page-group (1 or more) -> sf:drawable-shape (1 or more) -> sf:text -> sf:text-storage -> sf:text-body -> sf . You get one sf:drawable-shape for each text box.
        Hide
        Michael McCandless added a comment -

        Looks like this was fixed with TIKA-904.

        Show
        Michael McCandless added a comment - Looks like this was fixed with TIKA-904 .
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.2 [ 12320169 ]
        Resolution Duplicate [ 3 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Gabriel Valencia
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development