Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1547

BasicIndexingFilter - Problem to index full title

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.6, 2.1
    • 1.7, 2.2
    • indexer
    • None

    Description

      I have faced this issue when trying to index the entire title, just like the content, configuring its value on nutch-default.xml to -1 (indexer.max.title.length). I think the behavior should be the same as the content.

      If you would like to fix it, just replace the line number 90:

      if (title.length() > MAX_TITLE_LENGTH) { // truncate title if needed

      by this one:

      if (MAX_TITLE_LENGTH > -1 && title.length() > MAX_TITLE_LENGTH) { // truncate title if needed

      Stack Trace:

      java.lang.StringIndexOutOfBoundsException: String index out of range: -1
      at java.lang.String.substring(String.java:1937)
      at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
      at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
      at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
      at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
      at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)

      Cheers.

      Attachments

        1. NUTCH-1547-2x.patch
          1 kB
          lufeng
        2. NUTCH-1547.patch
          1 kB
          lufeng

        Activity

          People

            amuseme.lu lufeng
            rauber Gustavo Rauber
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified