Nutch
  1. Nutch
  2. NUTCH-1190

MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 1.10
    • Component/s: indexer
    • Labels:
      None
    • Environment:

      jdk6

    • Patch Info:
      Patch Available

      Description

      There many issues about missing date format:
      NUTCH-871
      NUTCH-912
      NUTCH-1015

      The data formats can be diverse, so why not move those data formats to a extra config file?
      I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.

        public void setConf(Configuration conf) {
          this.conf = conf;
          MIME = new MimeUtil(conf);
          
          URL res = conf.getResource("date-styles.txt");
          if(res==null){
            LOG.error("Can't find resource: date-styles.txt");
          }else{
            try {
              List lines = FileUtils.readLines(new File(res.getFile()));
              for (int i = 0; i < lines.size(); i++) {
                String dateStyle = (String) lines.get(i);
                if(StringUtils.isBlank(dateStyle)){
                  lines.remove(i);
                  i--;
                  continue;
                }
                dateStyle=StringUtils.trim(dateStyle);
                if(dateStyle.startsWith("#")){
                  lines.remove(i);
                  i--;
                  continue;
                }
                lines.set(i, dateStyle);
              }
              dateStyles = new String[lines.size()];
              lines.toArray(dateStyles);
            } catch (IOException e) {
              LOG.error("Failed to load resource: date-styles.txt");
            }
          }
        }
      

      Then parse "lastModified" like this(sample):

        private long getTime(String date, String url) {
          ......
          Date parsedDate = DateUtils.parseDate(date, dateStyles);
          time = parsedDate.getTime();
          ......
          return time;
        }
      

      This path also contains the "path" of NUTCH-1140.
      Find more details in the patch file.

      1. date-styles.txt
        0.6 kB
        Zhang JinYan
      2. MoreIndexingFilter.patch
        6 kB
        Zhang JinYan
      3. NUTCH-1190-trunk.patch
        8 kB
        Lewis John McGibbney

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Zhang JinYan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development