Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-644

parsing of Microsoft Word doc with style "Heading X" where X>6 creates invalid HTML with tags <h7>,<h8> etc

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:

      Description

      org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
      <!ENTITY % heading "h1|h2|h3|h4|h5|h6">

      changing line 380 from:
      tag = "h"+num;
      to
      tag = "h"+Math.min(num, 6);

      will resolve this.

        Attachments

          Activity

            People

            • Assignee:
              nick Nick Burch
              Reporter:
              chud chris hudson
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 5m
                5m
                Remaining:
                Remaining Estimate - 5m
                5m
                Logged:
                Time Spent - Not Specified
                Not Specified