Uploaded image for project: 'Commons Compress'
  1. Commons Compress
  2. COMPRESS-450

Enable skipping past invalid tar header entries

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.16.1
    • None
    • Archivers

    Description

      In TarArchiveInputStream::getNextTarEntry(), if reading an parsing the header fails, an IOException is thrown. State (e.g. currEntry) is not cleared, and trying to get any further entries/data from the archive is thus not possible.

      In our use case, we sometimes encounter corrupt tar archives where the data following a header (that specifies a non-zero data size) is completely or partly missing; for example as for hdr_b in the stream:

       

      ...[hdr_a][data_a1]...[data_an][hdr_b][hdr_c][data_c1][data_c2]...[data_cn]...

       

      We have no influence on how these archives are created, so cannot fix it on that side. However, it would be nice to be able to at least pick up reading the tar file at the next valid header it finds, so at least most of the data can be retrieved. In other words, similar to the behaviour of gnu tar:

      • If reading/parsing the header fails, and no header was read successfully before, or the previous header read attempt failed as well, then fail completely
      • Otherwise if reading/parsing the header fails, throw an error. A next call to getNextTarEntry will read blocks until it finds one that has a valid header checksum, and try to parse that as a header.

      The attached version of TarArchiveInputStream does this.

      Some issues with this approach:

      • In the example stream given above, the hdr_c and subsequent blocks (depending on the data size specified in hdr_b) will already have been returned/read as data for b. However, that is also the case in the current version of TarArchiveInputStream.
      • So, (at least) file c is lost, and the next entry to be picked up will likely be hdr_d (or even later). Data blocks that look like a tar header at first sight but actually (in the current context) aren't, might be misinterpreted to be headers (this can occur for example with a tar archive stored inside a main tar archive).
      • Currently, the code just throws an IOException with a different error message, as I didn't want to change the behaviour too much. But it would be a lot better to have a different exception (child of IOException) for a "header parse" error, to distinguish it from a general IO exception reading the underlying stream.
      • I'm not too sure about what to do in case of a "fatal" error (skip to the end of file?)

      Still, the above has been useful for us, and maybe this benefits others as well.

       

       

      Attachments

        1. TarArchiveInputStream.java
          26 kB
          Tijmen R

        Activity

          People

            Unassigned Unassigned
            tijmen Tijmen R
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: