Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1959

Improving CommonCrawlFormat implementations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.9
    • 1.10
    • None
    • None

    Description

      CommonCrawlFormat is an interface for Java classes that implement methods for writing data into Common Crawl format. AbstractCommonCrawlFormat is an abstract class that implements CommonCrawlFormat and provides abstract methods for "CommonCrawl formatter" classes.
      You can find in attachment a PATCH that includes some improvements for CommonCrawlFormat-based classes;

      • CommonCrawlFormat and AbstractCommonCrawlFormat now provide only the getJsonData() method, responsible for getting out JSON data.
      • AbstractCommonCrawlFormat provides also the abstract methods that each subclass has to implement in order to handle JSON objects.
      • CommonCrawlFormatSimple is a StringBuilder-based formatter that now provide also escaping of JSON string values.

      This PATCH aims at providing a better interface for implementing/extending CommonCrawlFormat classes.

      I would really appreciate your feedback.
      Thanks a lot,
      Giuseppe

      Attachments

        1. NUTCH-1959.v02.patch
          41 kB
          Lewis John McGibbney
        2. NUTCH-1959.patch
          41 kB
          Giuseppe Totaro

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              gostep Giuseppe Totaro
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: