Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1997

Add CBOR "magic header" to CommonCrawlDataDumper output

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: tool
    • Labels:

      Description

      For each file extracted from Nutch crawled data, CommonCrawlDataDumper wraps a single string value, representing the JSON text, into CBOR.
      For instance, using the Unix hexdump tool, we can see that, as expected, the first byte of all files is "0x7F" (the first three bits are "011", that is the major type for strings, and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in RFC7049). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available here).
      In order to add support for CBOR detection using Apache Tika (as described in TIKA-1610), it would be great if CommonCrawlDataDumper tool is able to add the self-describing CBOR "magic header" (Tag 55799) to CBOR-encoded output files.
      Thanks a lot Luke sh for this great research. Thanks Chris A. Mattmann for supporting me on this work.

        Attachments

        1. NUTCH-1997.patch
          2 kB
          Giuseppe Totaro

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                gostep Giuseppe Totaro
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: