Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4048

Gzipped WARC not identifying all assets

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.9.0
    • None
    • None

    Description

      The WARC parser works for non GZipped WARC files, but for GZipped WARC files it appears not all embedded files are being identified.

       

      Processing a WARC.GZ file should return identical JSON output as the plain WARC file, with the addition of the GZ file metadata. However, in the attached JSON outputs, the JPEG present in the plain WARC file is not represented in the WARC.GZ.json file.

       

      Additionally, the warc: metadata is not being returned for all files, although this may be by design. 

       

      Attached are two JSON files, one for the GZipped WARC file and one for the plain WARC file. And the two original files.

      Attachments

        1. rec-20230518121844489398-5335604b8b23.warc
          7 kB
          Gregory Lepore
        2. rec-20230518121844489398-5335604b8b23.warc.gz
          6 kB
          Gregory Lepore
        3. rec-20230518121844489398-5335604b8b23.warc.gz.json
          12 kB
          Gregory Lepore
        4. rec-20230518121844489398-5335604b8b23.warc.json
          13 kB
          Gregory Lepore
        5. Screenshot 2023-05-30 at 3.49.19 PM.png
          64 kB
          Tim Allison
        6. Screenshot 2023-05-30 at 3.50.41 PM.png
          93 kB
          Tim Allison

        Activity

          People

            tallison Tim Allison
            greg@rhobard.com Gregory Lepore
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: