Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1959

Improving CommonCrawlFormat implementations

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 1.10
    • Component/s: None
    • Labels:
      None

      Description

      CommonCrawlFormat is an interface for Java classes that implement methods for writing data into Common Crawl format. AbstractCommonCrawlFormat is an abstract class that implements CommonCrawlFormat and provides abstract methods for "CommonCrawl formatter" classes.
      You can find in attachment a PATCH that includes some improvements for CommonCrawlFormat-based classes;

      • CommonCrawlFormat and AbstractCommonCrawlFormat now provide only the getJsonData() method, responsible for getting out JSON data.
      • AbstractCommonCrawlFormat provides also the abstract methods that each subclass has to implement in order to handle JSON objects.
      • CommonCrawlFormatSimple is a StringBuilder-based formatter that now provide also escaping of JSON string values.

      This PATCH aims at providing a better interface for implementing/extending CommonCrawlFormat classes.

      I would really appreciate your feedback.
      Thanks a lot,
      Giuseppe

        Attachments

        1. NUTCH-1959.patch
          41 kB
          Giuseppe Totaro
        2. NUTCH-1959.v02.patch
          41 kB
          Lewis John McGibbney

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                gostep Giuseppe Totaro
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: