Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3963

HTML author isn't mapped to its dc:creator counterpart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0
    • 2.8.0
    • metadata
    • None
    • Tika server on Windows

      Curl client on WSL Ubuntu instance

       

    Description

      The 2.x migration doc (here) mentions that author is generally, and automatically, mapped to it's dc:creator equivalent when returned by Tika 2.x.  That doesn't seem to be happening for HTML files. Can this be fixed?

      author.html

      $ curl -X PUT --upload-file /mnt/c/tmp/author.html --header "Content-Disposition: attachment; filename=\"author.html\"" --header "Accept:Application/json" http://localhost:9998/rmeta/text | python -m json.tool
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100  1152  100   716  100   436    685    417  0:00:01  0:00:01 {}:{}:{}  1102
      [
          {
              "Content-Encoding": "UTF-8",
              "Content-Length": "436",
              "Content-Type": "text/html; charset=UTF-8",
              "X-TIKA:Parsed-By": [
                  "org.apache.tika.parser.DefaultParser",
                  "org.apache.tika.parser.html.HtmlParser"
              ],
              "X-TIKA:Parsed-By-Full-Set": [
                  "org.apache.tika.parser.DefaultParser",
                  "org.apache.tika.parser.html.HtmlParser"
              ],
              "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAll meta information goes in the head section...\n\n\n",
              "X-TIKA:content_handler": "ToTextContentHandler",
              "X-TIKA:embedded_depth": "0",
              "X-TIKA:parse_time_millis": "886",
              "author": "John Doe",
              "description": "Free Web tutorials",
              "keywords": "HTML,CSS,XML,JavaScript",
              "resourceName": "author.html",
              "title": "OldMetaTitle",
              "viewport": "width=device-width, initial-scale=1.0"

      {\{    }

      }}
      ]

      Attachments

        1. author.html
          0.4 kB
          Josh Burchard

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jmbox80 Josh Burchard
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: