Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1749

Optionally exclude title from content field

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7
    • Fix Version/s: 1.15
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The HTML parser plugin inserts document title into document content. Since the title alone can be retrieved via DOMContentUtils.getTitle() and content is retrieved via DOMContentUtils.getText(), there is no need to duplicate title in the content. When title is included in the content it becomes difficult/impossible to extract document body without title. A need to extract document body without title is visible when user wants to index or display body and title separately.

      Attached is a patch which prevents including title in document content in the HTML parser plugin.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                gregp Greg Padiasek
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: