[NUTCH-1997] Add CBOR "magic header" to CommonCrawlDataDumper output - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.10
Component/s: tool
Labels:
- memex
- patch

Description

For each file extracted from Nutch crawled data, CommonCrawlDataDumper wraps a single string value, representing the JSON text, into CBOR.
For instance, using the Unix hexdump tool, we can see that, as expected, the first byte of all files is "0x7F" (the first three bits are "011", that is the major type for strings, and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in RFC7049). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available here).
In order to add support for CBOR detection using Apache Tika (as described in TIKA-1610), it would be great if CommonCrawlDataDumper tool is able to add the self-describing CBOR "magic header" (Tag 55799) to CBOR-encoded output files.
Thanks a lot Lukeliush for this great research. Thanks chrismattmann for supporting me on this work.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1997.patch
22/Apr/15 21:59
2 kB
Giuseppe Totaro

Issue Links

is related to

TIKA-1610 CBOR Parser and detection [improvement]

Resolved

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Giuseppe Totaro

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Apr/15 18:33

Updated:: 13/Mar/24 14:51

Resolved:: 25/Apr/15 15:57