Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: update
    • Labels:

      Description

      Processor which analyzes a URL and outputs to other fields: length, #levels, isTopLevel true/false, host part, path part, canonicalized URL etc.

      Kindly donated by Oslo University

      1. SOLR-2826.patch
        18 kB
        Jan Høydahl
      2. SOLR-2826_remove_dead_code.patch
        2 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        Here's the code. This code has been running in production for months.

        Sample config:

        <processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
          <bool name="enabled">true</bool>
          <str name="inputField">myUrl</str>
          <str name="domainOutputField">host</str>
          <str name="canonicalUrlOutputField">canonicalurl</str>
        </processor>
        

        This will read the url from field "myUrl", analyze it and write host name to "host", a canonical (normalized) version of URL to "canonicalurl", URL length to "url_length", number of levels in URL to "url_levels", if URL is a toplevel URL, write "1" to field "url_toplevel", if it looks like a landing page, e.g. index.html, write "1" to field "url_landingpage"...

        Show
        Jan Høydahl added a comment - Here's the code. This code has been running in production for months. Sample config: <processor class= "org.apache.solr.update.processor.URLClassifyProcessorFactory" > <bool name= "enabled" > true </bool> <str name= "inputField" >myUrl</str> <str name= "domainOutputField" >host</str> <str name= "canonicalUrlOutputField" >canonicalurl</str> </processor> This will read the url from field "myUrl", analyze it and write host name to "host", a canonical (normalized) version of URL to "canonicalurl", URL length to "url_length", number of levels in URL to "url_levels", if URL is a toplevel URL, write "1" to field "url_toplevel", if it looks like a landing page, e.g. index.html, write "1" to field "url_landingpage"...
        Hide
        Jan Høydahl added a comment -

        Unless no comments in a day or two, I'll commit this, as it has good tests and is proven in production.

        Show
        Jan Høydahl added a comment - Unless no comments in a day or two, I'll commit this, as it has good tests and is proven in production.
        Hide
        Koji Sekiguchi added a comment -

        Can we get rid of landingPageSuffixesSet in the URLClassifyProcessor constructor?

        public URLClassifyProcessor(SolrParams parameters,
            SolrQueryRequest request,
            SolrQueryResponse response,
            UpdateRequestProcessor nextProcessor) {
          super(nextProcessor);
          
          HashSet<String> landingPageSuffixesSet = new HashSet<String>();
          for(String s : landingPageSuffixes) {
            landingPageSuffixesSet.add(s);
          }
          this.initParameters(parameters);
        }
        
        Show
        Koji Sekiguchi added a comment - Can we get rid of landingPageSuffixesSet in the URLClassifyProcessor constructor? public URLClassifyProcessor(SolrParams parameters, SolrQueryRequest request, SolrQueryResponse response, UpdateRequestProcessor nextProcessor) { super (nextProcessor); HashSet< String > landingPageSuffixesSet = new HashSet< String >(); for ( String s : landingPageSuffixes) { landingPageSuffixesSet.add(s); } this .initParameters(parameters); }
        Hide
        Jan Høydahl added a comment -

        This patch removes the dead, unused code and fixes a typo

        Show
        Jan Høydahl added a comment - This patch removes the dead, unused code and fixes a typo
        Hide
        Jan Høydahl added a comment -

        Dead code patch checked in to trunk and branch_3x

        Show
        Jan Høydahl added a comment - Dead code patch checked in to trunk and branch_3x

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development