Nutch
  1. Nutch
  2. NUTCH-1190

MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: 1.10
    • Component/s: indexer
    • Labels:
      None
    • Environment:

      jdk6

    • Patch Info:
      Patch Available

      Description

      There many issues about missing date format:
      NUTCH-871
      NUTCH-912
      NUTCH-1015

      The data formats can be diverse, so why not move those data formats to a extra config file?
      I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.

        public void setConf(Configuration conf) {
          this.conf = conf;
          MIME = new MimeUtil(conf);
          
          URL res = conf.getResource("date-styles.txt");
          if(res==null){
            LOG.error("Can't find resource: date-styles.txt");
          }else{
            try {
              List lines = FileUtils.readLines(new File(res.getFile()));
              for (int i = 0; i < lines.size(); i++) {
                String dateStyle = (String) lines.get(i);
                if(StringUtils.isBlank(dateStyle)){
                  lines.remove(i);
                  i--;
                  continue;
                }
                dateStyle=StringUtils.trim(dateStyle);
                if(dateStyle.startsWith("#")){
                  lines.remove(i);
                  i--;
                  continue;
                }
                lines.set(i, dateStyle);
              }
              dateStyles = new String[lines.size()];
              lines.toArray(dateStyles);
            } catch (IOException e) {
              LOG.error("Failed to load resource: date-styles.txt");
            }
          }
        }
      

      Then parse "lastModified" like this(sample):

        private long getTime(String date, String url) {
          ......
          Date parsedDate = DateUtils.parseDate(date, dateStyles);
          time = parsedDate.getTime();
          ......
          return time;
        }
      

      This path also contains the "path" of NUTCH-1140.
      Find more details in the patch file.

      1. date-styles.txt
        0.6 kB
        Zhang JinYan
      2. MoreIndexingFilter.patch
        6 kB
        Zhang JinYan
      3. NUTCH-1190-trunk.patch
        8 kB
        Lewis John McGibbney

        Issue Links

          Activity

          Zhang JinYan created issue -
          Zhang JinYan made changes -
          Field Original Value New Value
          Attachment MoreIndexingFilter.patch [ 12501786 ]
          Attachment date-styles.txt [ 12501787 ]
          Zhang JinYan made changes -
          Description There many issues about missing date format:
          [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
          [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
          [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

          The data formats can be diverse, so why not move those data formats to a extra config file?
          I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt", which will be load on startup.
          {code}
            public void setConf(Configuration conf) {
              this.conf = conf;
              MIME = new MimeUtil(conf);
              
              URL res = conf.getResource("date-styles.txt");
              if(res==null){
                LOG.error("Can't find resource: date-styles.txt");
              }else{
                try {
                  List lines = FileUtils.readLines(new File(res.getFile()));
                  for (int i = 0; i < lines.size(); i++) {
                    String dateStyle = (String) lines.get(i);
                    if(StringUtils.isBlank(dateStyle)){
                      lines.remove(i);
                      i--;
                      continue;
                    }
                    dateStyle=StringUtils.trim(dateStyle);
                    if(dateStyle.startsWith("#")){
                      lines.remove(i);
                      i--;
                      continue;
                    }
                    lines.set(i, dateStyle);
                  }
                  dateStyles = new String[lines.size()];
                  lines.toArray(dateStyles);
                } catch (IOException e) {
                  LOG.error("Failed to load resource: date-styles.txt");
                }
              }
            }
          {code}
          Then parse "lastModified" like this(sample):
          {code}
            private long getTime(String date, String url) {
              ......
              Date parsedDate = DateUtils.parseDate(date, dateStyles);
              time = parsedDate.getTime();
              ......
              return time;
            }
          {code}
          This path also contains the "path" of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
          Find more details in the patch file.
          There many issues about missing date format:
          [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
          [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
          [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

          The data formats can be diverse, so why not move those data formats to a extra config file?
          I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.
          {code}
            public void setConf(Configuration conf) {
              this.conf = conf;
              MIME = new MimeUtil(conf);
              
              URL res = conf.getResource("date-styles.txt");
              if(res==null){
                LOG.error("Can't find resource: date-styles.txt");
              }else{
                try {
                  List lines = FileUtils.readLines(new File(res.getFile()));
                  for (int i = 0; i < lines.size(); i++) {
                    String dateStyle = (String) lines.get(i);
                    if(StringUtils.isBlank(dateStyle)){
                      lines.remove(i);
                      i--;
                      continue;
                    }
                    dateStyle=StringUtils.trim(dateStyle);
                    if(dateStyle.startsWith("#")){
                      lines.remove(i);
                      i--;
                      continue;
                    }
                    lines.set(i, dateStyle);
                  }
                  dateStyles = new String[lines.size()];
                  lines.toArray(dateStyles);
                } catch (IOException e) {
                  LOG.error("Failed to load resource: date-styles.txt");
                }
              }
            }
          {code}
          Then parse "lastModified" like this(sample):
          {code}
            private long getTime(String date, String url) {
              ......
              Date parsedDate = DateUtils.parseDate(date, dateStyles);
              time = parsedDate.getTime();
              ......
              return time;
            }
          {code}
          This path also contains the "path" of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
          Find more details in the patch file.
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 2.2 [ 12323285 ]
          Lewis John McGibbney made changes -
          Link This issue blocks NUTCH-1015 [ NUTCH-1015 ]
          Lewis John McGibbney made changes -
          Attachment NUTCH-1190-trunk.patch [ 12564583 ]
          Gavin made changes -
          Link This issue blocks NUTCH-1015 [ NUTCH-1015 ]
          Gavin made changes -
          Link This issue is depended upon by NUTCH-1015 [ NUTCH-1015 ]
          Lewis John McGibbney made changes -
          Fix Version/s 2.3 [ 12324325 ]
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 2.2 [ 12323285 ]
          Sebastian Nagel made changes -
          Fix Version/s 1.8 [ 12324326 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 2.3 [ 12324325 ]
          Fix Version/s 1.8 [ 12324326 ]
          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Zhang JinYan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development