Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1140

Better table representation, cell spanning in Word Extractor

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • parser
    • None

    Description

      As for current version of Word Extractor, it have access to different
      features of tables, but most of them are not used. As an example of possible improvements, may be support for borders, fixed cell widths and cell spanning.
      It should be noted that some of that features are already used in poi version of Html converted, so, that code can be reused in Tika.
      As an example of possible solution may be patch linked as an attachment. It have some code that is based on 2007 version of doc format specification(especially, Border type and color detection), so, different improvements tends to be made to meet with older formats.
      Patch already includes some changes in unit tests, that are required in accordance with changes in document structure.

      Attachments

        1. word_table.patch
          16 kB
          Denis Kildishev

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kildishev Denis Kildishev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: