Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1190

MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.18
    • indexer, plugin
    • None
    • jdk6

    • Patch Available

    Description

      There many issues about missing date format:
      NUTCH-871
      NUTCH-912
      NUTCH-1015

      The data formats can be diverse, so why not move those data formats to a extra config file?
      I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.

        public void setConf(Configuration conf) {
          this.conf = conf;
          MIME = new MimeUtil(conf);
          
          URL res = conf.getResource("date-styles.txt");
          if(res==null){
            LOG.error("Can't find resource: date-styles.txt");
          }else{
            try {
              List lines = FileUtils.readLines(new File(res.getFile()));
              for (int i = 0; i < lines.size(); i++) {
                String dateStyle = (String) lines.get(i);
                if(StringUtils.isBlank(dateStyle)){
                  lines.remove(i);
                  i--;
                  continue;
                }
                dateStyle=StringUtils.trim(dateStyle);
                if(dateStyle.startsWith("#")){
                  lines.remove(i);
                  i--;
                  continue;
                }
                lines.set(i, dateStyle);
              }
              dateStyles = new String[lines.size()];
              lines.toArray(dateStyles);
            } catch (IOException e) {
              LOG.error("Failed to load resource: date-styles.txt");
            }
          }
        }
      

      Then parse "lastModified" like this(sample):

        private long getTime(String date, String url) {
          ......
          Date parsedDate = DateUtils.parseDate(date, dateStyles);
          time = parsedDate.getTime();
          ......
          return time;
        }
      

      This path also contains the "path" of NUTCH-1140.
      Find more details in the patch file.

      Attachments

        1. NUTCH-1190-trunk.patch
          8 kB
          Lewis John McGibbney
        2. date-styles.txt
          0.6 kB
          Zhang JinYan
        3. MoreIndexingFilter.patch
          6 kB
          Zhang JinYan

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yearn20m Zhang JinYan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: