Nutch
  1. Nutch
  2. NUTCH-802

Problems managing outlinks with large url length

    Details

    • Patch Info:
      Patch Available

      Description

      Nutch can get idle during the collection of outlinks if the URL address of the outlink is too large.

      The maximum sizes of an URL for the main web servers are:

      • Apache: 4,000 bytes
      • Microsoft Internet Information Server (IIS): 16, 384 bytes
      • Perl HTTP::Daemon: 8.000 bytes

      URL adress sizes bigger than 4000 bytes are problematic, so the limit should be set in the nutch-default.xml configuration file.

      I attached a patch

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Submitting a patch is not "fixing", it's fixed when the patch is accepted and applied.

        Show
        Andrzej Bialecki added a comment - Submitting a patch is not "fixing", it's fixed when the patch is accepted and applied.
        Hide
        Andrzej Bialecki added a comment -

        We already have a general way to control this and other aspects of URL-s as such, namely with URLFilters. I agree that this functionality could be useful, but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or urlfilter-validator).

        Show
        Andrzej Bialecki added a comment - We already have a general way to control this and other aspects of URL-s as such, namely with URLFilters. I agree that this functionality could be useful, but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or urlfilter-validator).
        Hide
        Markus Jelsma added a comment -

        What are we going to do with this? Mark as won't fix? I also prefer regex as the solution.

        Show
        Markus Jelsma added a comment - What are we going to do with this? Mark as won't fix? I also prefer regex as the solution.
        Hide
        Lewis John McGibbney added a comment -

        From recent user list correspondence it would appear that the community are comfortable working with urlfilter's as well.

        I agree Markus

        Show
        Lewis John McGibbney added a comment - From recent user list correspondence it would appear that the community are comfortable working with urlfilter's as well. I agree Markus
        Hide
        Lewis John McGibbney added a comment -

        +1 for marking as won't fix. No-one seems to have touched this in ages. If someone wishes to address it in the future they can open a new issue with the more appropriate solution.

        Show
        Lewis John McGibbney added a comment - +1 for marking as won't fix. No-one seems to have touched this in ages. If someone wishes to address it in the future they can open a new issue with the more appropriate solution.
        Hide
        Tejas Patil added a comment -

        Agree with Markus and Lewis. Hence marking this one as wont fix. If someone wishes to address it in the future they can open a new issue with the more appropriate solution.

        Show
        Tejas Patil added a comment - Agree with Markus and Lewis. Hence marking this one as wont fix. If someone wishes to address it in the future they can open a new issue with the more appropriate solution.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Pablo Aragón
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development