Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2231

Jexl support in generator job

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: None
    • Labels:
      None

      Description

      Generator should support Jexl expressions. This would make it much easier to implement focussing crawlers that rely on information stored in the CrawlDB. With the HostDB it is possible to restrict the generator to select only interesting records but it is very cumbersome and involves domainblacklist-urlfiltering.

      With Jexl support, it is no hassle!

      Crawl only english records:

      bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
      

      Crawl only HTML records:

      bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 'text/html' || Content_Type == 'application/xhtml+xml')"
      

      Keep in mind:

      • Jexl doesn't allow a hyphen/minus in field identifier, they are transformed to underscores
      • string literals must be in quotes, only surrounding qoute needs to be escaped by backslash

        Attachments

        1. NUTCH-2231.patch
          12 kB
          Markus Jelsma
        2. NUTCH-2231.patch
          13 kB
          Markus Jelsma

          Issue Links

            Activity

              People

              • Assignee:
                markus17 Markus Jelsma
                Reporter:
                markus17 Markus Jelsma
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                '"-->>]]> ERROR OCCURRED