Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2755

Allow Tika to skip extraction of <img> tags in HTML

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.19.1
    • None
    • server
    • None

    Description

      We are using Tika Server to extract text from HTML files. Tika extracts the alt text of image tags present in HTML files as [image: this is the alt text of the image]. This ends up in Solr and shows up in the results when we generate document summaries at query time (via Solr’s highlight functionality).

      If you PUT the attached HTML file to /tika, it will return the following response

      [image: Return to the homepage]
      This is a test

      It would be nice to have just this instead

      This is a test 

      Attachments

        1. TestForImageTag.html
          0.1 kB
          Harinder

        Activity

          People

            Unassigned Unassigned
            Hanjan Harinder
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: