Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1315

Basic list support in WordExtractor

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.6
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:
      None

      Description

      Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff.

      In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did.

      Attached files are:
      Updated test
      Fixed WordExtractor
      Added ListUtils

        Attachments

        1. complex_list_test.doc
          52 kB
          Moritz Dorka
        2. ListNumbering.patch
          7 kB
          Moritz Dorka
        3. ListManager.tar.bz2
          8 kB
          Moritz Dorka
        4. WordParserTest.java.patch
          0.6 kB
          Filip Bednárik
        5. WordExtractor.java.patch
          3 kB
          Filip Bednárik
        6. ListUtils.java
          8 kB
          Filip Bednárik

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                drndos Filip Bednárik
              • Votes:
                1 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: