Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1475

Index-More Plugin -- A better fall back value for date field

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.1, 1.5.1
    • 1.7, 2.2.1
    • None
    • All

    • Patch Available

    Description

      Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date" field for the Solr index. The "last modified" field is the last modified date from the http headers if available, if not available it is left empty. Currently, the "date" field is the same as the "last modified" field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed.

      This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from

      time = page.getFetchTime(); // use fetch time

      to

      time = new Date().getTime();

      Users interested in the getFetchTime value can still get it from the "tstamp" field.

      Attachments

        1. index-more-1xand2x.patch
          1 kB
          James Sullivan
        2. index-more-2x.patch
          0.9 kB
          James Sullivan
        3. index-more-2x.patch
          0.7 kB
          James Sullivan
        4. NUTCH-1475-trunk-v1.patch
          0.9 kB
          Sebastian Nagel
        5. NUTCH-1475-trunk-v2.patch
          1.0 kB
          Sebastian Nagel

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            snagel Sebastian Nagel
            sully James Sullivan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment