Uploaded image for project: 'James Server'
  1. James Server
  2. JAMES-2910

HTML could be indexed directly in ElasticSearch

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.5.0
    • elasticsearch, guice
    • None

    Description

      When tika is disabled, the DefaultTextExtract is used, which does not perform html text extraction.

      This results in decreased precision in search in such situation (index being polluted by html) and of course results in a massive index size.

      Proposal:

      CassandraGuice should default to JsoupTextExtractor when tika is disabled.

      This will allow html text extraction to actually happen.

      Attachments

        Activity

          People

            Unassigned Unassigned
            btellier Benoit Tellier
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: