Description
For each file extracted from Nutch crawled data, CommonCrawlDataDumper wraps a single string value, representing the JSON text, into CBOR.
For instance, using the Unix hexdump tool, we can see that, as expected, the first byte of all files is "0x7F" (the first three bits are "011", that is the major type for strings, and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in RFC7049). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available here).
In order to add support for CBOR detection using Apache Tika (as described in TIKA-1610), it would be great if CommonCrawlDataDumper tool is able to add the self-describing CBOR "magic header" (Tag 55799) to CBOR-encoded output files.
Thanks a lot Lukeliush for this great research. Thanks chrismattmann for supporting me on this work.
Attachments
Attachments
Issue Links
- is related to
-
TIKA-1610 CBOR Parser and detection [improvement]
- Resolved