Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1840

No way to link slide notes to slide in PPT output.

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.11
    • Fix Version/s: 1.16
    • Component/s: parser
    • Labels:
      None

      Description

      I'm integrating Apache Tika into my project, and I want to extract (text) information from Powerpoint slides. Both PPT and PPTX

      I've noticed when using PPT format, the slide notes are all aggregated at the end of the XML output, and there is no way to identify which note belongs to which slide.

      I began looking at the code and found the following:

      // TODO Find the Notes for this slide and extract inline
      

      in HSLFExtractor.java on line 140

      I would like to implement this part and contribute

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user zetisam opened a pull request:

        https://github.com/apache/tika/pull/72

        fix for TIKA-1840 contributed by zetisam

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/zetisam/tika TIKA-1840

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/72.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #72


        commit 52b82bddef7c7ae8a430c9871594295e71882055
        Author: Sam Heijens <sam.heijens@zeticon.com>
        Date: 2016-01-22T10:09:48Z

        fix for TIKA-1840 contributed by zetisam


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user zetisam opened a pull request: https://github.com/apache/tika/pull/72 fix for TIKA-1840 contributed by zetisam You can merge this pull request into a Git repository by running: $ git pull https://github.com/zetisam/tika TIKA-1840 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/72.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #72 commit 52b82bddef7c7ae8a430c9871594295e71882055 Author: Sam Heijens <sam.heijens@zeticon.com> Date: 2016-01-22T10:09:48Z fix for TIKA-1840 contributed by zetisam
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/72

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/72
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Committed in master and sync'ed to Github. Since we are rolling with 1.12 and this is super close, I figured we can merge it and improve iteratively. Nick Burch. Thanks Sam!

        [chipotle:~/tmp/tika1.12] mattmann% git merge TIKA-1840
        Updating efb645e..1bc6176
        Fast-forward
         CHANGES.txt                                                                    |  3 +++
         tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java | 14 ++++++++++++--
         2 files changed, 15 insertions(+), 2 deletions(-)
        [chipotle:~/tmp/tika1.12] mattmann% git push -u origin master
        Counting objects: 94, done.
        Delta compression using up to 4 threads.
        Compressing objects: 100% (21/21), done.
        Writing objects: 100% (29/29), 2.44 KiB | 0 bytes/s, done.
        Total 29 (delta 11), reused 0 (delta 0)
        To https://git-wip-us.apache.org/repos/asf/tika.git
           efb645e..1bc6176  master -> master
        Branch master set up to track remote branch master from origin.
        [chipotle:~/tmp/tika1.12] mattmann% 
        
        Show
        chrismattmann Chris A. Mattmann added a comment - Committed in master and sync'ed to Github. Since we are rolling with 1.12 and this is super close, I figured we can merge it and improve iteratively. Nick Burch . Thanks Sam! [chipotle:~/tmp/tika1.12] mattmann% git merge TIKA-1840 Updating efb645e..1bc6176 Fast-forward CHANGES.txt | 3 +++ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java | 14 ++++++++++++-- 2 files changed, 15 insertions(+), 2 deletions(-) [chipotle:~/tmp/tika1.12] mattmann% git push -u origin master Counting objects: 94, done. Delta compression using up to 4 threads. Compressing objects: 100% (21/21), done. Writing objects: 100% (29/29), 2.44 KiB | 0 bytes/s, done. Total 29 (delta 11), reused 0 (delta 0) To https://git-wip-us.apache.org/repos/asf/tika.git efb645e..1bc6176 master -> master Branch master set up to track remote branch master from origin. [chipotle:~/tmp/tika1.12] mattmann%
        Hide
        gagravarr Nick Burch added a comment -

        Re-opening as the applied patch causes the notes text to be included twice, which isn't ideal, so further work still remains. (Details on the github request)

        Show
        gagravarr Nick Burch added a comment - Re-opening as the applied patch causes the notes text to be included twice, which isn't ideal, so further work still remains. (Details on the github request)
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        no worries Nick, I set the fix version to 1.13 (for the updates mentioned on Github).

        Show
        chrismattmann Chris A. Mattmann added a comment - no worries Nick, I set the fix version to 1.13 (for the updates mentioned on Github).

          People

          • Assignee:
            chrismattmann Chris A. Mattmann
            Reporter:
            zetisam Sam H
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development