Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.1, 1.13
    • Fix Version/s: 2.4, 1.14
    • Component/s: parser
    • Labels:
      None

      Description

      (initially reported with patch/pull-request by Vipul Behl, see #190)

      The parser (parse-tika and parse-html) could be improved to add line breaks between paragraphs, instead of writing the whole document into a single line.

        Issue Links

          Activity

          Hide
          wastl-nagel Sebastian Nagel added a comment -

          A fix for 1.x is ready: https://github.com/apache/nutch/pull/196

          If there are no objections I would commit it later today to bring the Jenkins builds back to normal.

          Show
          wastl-nagel Sebastian Nagel added a comment - A fix for 1.x is ready: https://github.com/apache/nutch/pull/196 If there are no objections I would commit it later today to bring the Jenkins builds back to normal.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Nutch-trunk #3432 (See https://builds.apache.org/job/Nutch-trunk/3432/)
          Fix for NUTCH-2397 (improved solution contributed by Vipul Behl, closes (snagel: https://github.com/apache/nutch/commit/48c38b03f3cfb73402431f262990a6d091570e9a)

          • (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
          • (edit) src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java
          • (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Nutch-trunk #3432 (See https://builds.apache.org/job/Nutch-trunk/3432/ ) Fix for NUTCH-2397 (improved solution contributed by Vipul Behl, closes (snagel: https://github.com/apache/nutch/commit/48c38b03f3cfb73402431f262990a6d091570e9a ) (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java (edit) src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
          Hide
          markus17 Markus Jelsma added a comment -

          Thanks Sebastian!

          Show
          markus17 Markus Jelsma added a comment - Thanks Sebastian!
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel opened a new pull request #198: NUTCH-2397: Parser to add paragraph line breaks
          URL: https://github.com/apache/nutch/pull/198

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel opened a new pull request #198: NUTCH-2397 : Parser to add paragraph line breaks URL: https://github.com/apache/nutch/pull/198 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Patch/pull-request for 2.x...

          Show
          wastl-nagel Sebastian Nagel added a comment - Patch/pull-request for 2.x...
          Hide
          githubbot ASF GitHub Bot added a comment -

          sebastian-nagel closed pull request #198: NUTCH-2397: Parser to add paragraph line breaks
          URL: https://github.com/apache/nutch/pull/198

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - sebastian-nagel closed pull request #198: NUTCH-2397 : Parser to add paragraph line breaks URL: https://github.com/apache/nutch/pull/198 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Also committed to 2.x, thanks Kaidul Islam for review!

          Show
          wastl-nagel Sebastian Nagel added a comment - Also committed to 2.x, thanks Kaidul Islam for review!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1592 (See https://builds.apache.org/job/Nutch-nutchgora/1592/)
          NUTCH-2397: Parser to add paragraph line breaks (snagel: https://github.com/apache/nutch/commit/aaa8099c8fe3761869f4c881fb66b2c11a2e350b)

          • (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
          • (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1592 (See https://builds.apache.org/job/Nutch-nutchgora/1592/ ) NUTCH-2397 : Parser to add paragraph line breaks (snagel: https://github.com/apache/nutch/commit/aaa8099c8fe3761869f4c881fb66b2c11a2e350b ) (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java

            People

            • Assignee:
              Unassigned
              Reporter:
              wastl-nagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development