Tika
  1. Tika
  2. TIKA-697

Tika reports the content type of AR archives as "text/plain"

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None
    • Environment:

      Linux (CentOS 5.6)

      Description

      The Tika.detect(InputStream) method returns "text/plain" for AR archives created with the Linux "Create Archive" option of Nautilus (available via right-clicking on a file).

      The Apache Commons Compress "autodetection" code of the ArchiveStreamFactory looks at the first 12 bytes of the stream and correctly identifies the type as AR.

        Activity

        PNS created issue -
        Hide
        Nick Burch added a comment -

        I've added a couple of test documents in r1161038.

        I think from these that we want to look for the pattern "!<arch>\n" i.e. 21 3c 61 72 63 68 3e 0a

        Show
        Nick Burch added a comment - I've added a couple of test documents in r1161038. I think from these that we want to look for the pattern "!<arch>\n" i.e. 21 3c 61 72 63 68 3e 0a
        Hide
        PNS added a comment - - edited

        Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A).

        What needs to be changed in the Tika code is at least the TextDetector.detect() method, so that it returns an AR media type if the first 8 bytes of the archive are the AR signature.

        The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and http://www.iana.org/assignments/media-types/index.html).

        Assuming the existence of a statement like

        	public static final MediaType APPLICATION_AR = application("x-ar");
        

        in class org.apache.tika.mime.MediaType, following is a quick implementation of the proposed changes in the TextDetector.detect() method:

        	// Code immediately after the static initialization block of the IS_CONTROL_BYTE[] array
        
        	private static final byte[] AR_HEADER = new byte[]
           	                     {0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A};
        	private boolean checkArHeader;
        
        	@Override
        	public MediaType detect(InputStream input, Metadata metadata)
        	throws IOException {
        		if (input == null) {
        			return MediaType.OCTET_STREAM;
        		}
        
        		input.mark(NUMBER_OF_BYTES_TO_TEST);
                        checkArHeader = true;
        		try {
        			for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
        				int ch = input.read();
        				if (ch == -1) {
        					if (i > 0) {
        						return MediaType.TEXT_PLAIN;
        					} else {
        						// See https://issues.apache.org/jira/browse/TIKA-483
        						return MediaType.OCTET_STREAM;
        					}
        				} else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) {
        					return MediaType.OCTET_STREAM;
        				} else if (checkArHeader) {
                                                // See https://issues.apache.org/jira/browse/TIKA-697
        					if ((i>7) || (AR_HEADER[i] != ch)) {
        						checkArHeader = false;
        					} else if ((i==7) && (AR_HEADER[i] == ch)) {
        						return MediaType.APPLICATION_AR;
        					}
        				}
        			}
        			return MediaType.TEXT_PLAIN;
        		} finally {
        			input.reset();
        		}
        	}
        

        Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables (AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block.

        I have tested the above with numerous combinations of files and it works as expected.

        Show
        PNS added a comment - - edited Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix )) is very simple and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A). What needs to be changed in the Tika code is at least the TextDetector.detect() method, so that it returns an AR media type if the first 8 bytes of the archive are the AR signature. The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and http://www.iana.org/assignments/media-types/index.html ). Assuming the existence of a statement like public static final MediaType APPLICATION_AR = application( "x-ar" ); in class org.apache.tika.mime.MediaType , following is a quick implementation of the proposed changes in the TextDetector.detect() method: // Code immediately after the static initialization block of the IS_CONTROL_BYTE[] array private static final byte [] AR_HEADER = new byte [] {0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A}; private boolean checkArHeader; @Override public MediaType detect(InputStream input, Metadata metadata) throws IOException { if (input == null ) { return MediaType.OCTET_STREAM; } input.mark(NUMBER_OF_BYTES_TO_TEST); checkArHeader = true ; try { for ( int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) { int ch = input.read(); if (ch == -1) { if (i > 0) { return MediaType.TEXT_PLAIN; } else { // See https://issues.apache.org/jira/browse/TIKA-483 return MediaType.OCTET_STREAM; } } else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) { return MediaType.OCTET_STREAM; } else if (checkArHeader) { // See https://issues.apache.org/jira/browse/TIKA-697 if ((i>7) || (AR_HEADER[i] != ch)) { checkArHeader = false ; } else if ((i==7) && (AR_HEADER[i] == ch)) { return MediaType.APPLICATION_AR; } } } return MediaType.TEXT_PLAIN; } finally { input.reset(); } } Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables (AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block. I have tested the above with numerous combinations of files and it works as expected.
        Hide
        Alex Ott added a comment -

        I think, that following magic in tika-mimetypes.xml will be enough (instead of modifying code of Tika):

        <mime-type type="application/x-unix-archive">
        <magic priority="50">
        <match value="0x213C617263683E0A" type="string" offset="0" />
        </magic>
        <glob pattern="*.a"/>
        </mime-type>

        Show
        Alex Ott added a comment - I think, that following magic in tika-mimetypes.xml will be enough (instead of modifying code of Tika): <mime-type type="application/x-unix-archive"> <magic priority="50"> <match value="0x213C617263683E0A" type="string" offset="0" /> </magic> <glob pattern="*.a"/> </mime-type>
        Hide
        Alex Ott added a comment -

        This patch adds signature for Unix Archive files (.a)

        I think, that signature for .deb files should be also updated accordingly

        Show
        Alex Ott added a comment - This patch adds signature for Unix Archive files (.a) I think, that signature for .deb files should be also updated accordingly
        Alex Ott made changes -
        Field Original Value New Value
        Attachment tika-697.diff [ 12502742 ]
        Hide
        PNS added a comment -

        Even better, but maybe we need to add "*.ar" as a glob pattern, too?

        Show
        PNS added a comment - Even better, but maybe we need to add "*.ar" as a glob pattern, too?
        Hide
        Alex Ott added a comment -

        No problem, just add:

        <glob pattern="*.ar"/>

        after

        <glob pattern="*.a"/>

        ... But I really never saw such file extension

        Show
        Alex Ott added a comment - No problem, just add: <glob pattern="*.ar"/> after <glob pattern="*.a"/> ... But I really never saw such file extension
        Hide
        Nick Burch added a comment -

        Thanks for this

        I've tweaked the existing mime magic in r1206896, which should now correctly detect the file format (the previous one had an eronious = at the start, and lacked the \n). I've also added the alternate extension and alternate mimetype

        In r1206898 I've also added mime magic for .deb, based on the working one for archive. Ideally we should also add a very small .deb file to the test suite

        Show
        Nick Burch added a comment - Thanks for this I've tweaked the existing mime magic in r1206896, which should now correctly detect the file format (the previous one had an eronious = at the start, and lacked the \n). I've also added the alternate extension and alternate mimetype In r1206898 I've also added mime magic for .deb, based on the working one for archive. Ideally we should also add a very small .deb file to the test suite
        Nick Burch made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.1 [ 12318849 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            PNS
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development