Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1513

Add mime detection and parsing for dbf files

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.14
    • Component/s: None
    • Labels:
      None

      Description

      I just came across an Apache licensed dbf parser that is available on maven.

      Let's add dbf parsing to Tika.

      Any other recommendations for alternate parsers?

        Activity

        Hide
        grossws Konstantin Gribov added a comment -

        Is this lib alive? Last commits were in mid 2014, some issues are from late 2013, PRs are from mid 2014.

        At least extensive testing is needed. Have we any freely accessible dbfs to use them in tests?

        Show
        grossws Konstantin Gribov added a comment - Is this lib alive? Last commits were in mid 2014, some issues are from late 2013, PRs are from mid 2014. At least extensive testing is needed. Have we any freely accessible dbfs to use them in tests?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I share your concern. There are ~2600 .dbase3 files in govdocs1.

        Show
        tallison@mitre.org Tim Allison added a comment - I share your concern. There are ~2600 .dbase3 files in govdocs1.
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        I have found https://github.com/iryndin/jdbf. Seems to be more active and to support more field types (memo, picture, etc) and more dbf formats.
        Its pom file declares Apache v2 license. But I could not find it on maven.

        Show
        lfcnassif Luis Filipe Nassif added a comment - I have found https://github.com/iryndin/jdbf . Seems to be more active and to support more field types (memo, picture, etc) and more dbf formats. Its pom file declares Apache v2 license. But I could not find it on maven.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Any interest in encouraging iryndin to push to maven?

        Show
        tallison@mitre.org Tim Allison added a comment - Any interest in encouraging iryndin to push to maven?
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        I can if the community thinks that jdbf is a better option.

        Show
        lfcnassif Luis Filipe Nassif added a comment - I can if the community thinks that jdbf is a better option.
        Hide
        grossws Konstantin Gribov added a comment -

        Tim Allison, I think it's good idea, even if Tika won't use it as dependency.

        Show
        grossws Konstantin Gribov added a comment - Tim Allison , I think it's good idea, even if Tika won't use it as dependency.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        From a brochure-level evaluation , I'd prefer jdbf. If we want to carry out an evaluation on the 2600 govdocs1 files, we'll have to implement wrappers for both. I propose the following:

        1) I'll build a parser with jamel. If basic functional tests look decent on the 2600 files from govdocs1, I'll commit that to Tika, and we'll have basic mavenized dbf support.

        2) After that I'll build a parser with jdbf and we can compare output on the govdocs1 files. If jdbf results are equal or better, we can try to persuade iryndin to push to maven.

        ??

        Show
        tallison@mitre.org Tim Allison added a comment - From a brochure-level evaluation , I'd prefer jdbf. If we want to carry out an evaluation on the 2600 govdocs1 files, we'll have to implement wrappers for both. I propose the following: 1) I'll build a parser with jamel. If basic functional tests look decent on the 2600 files from govdocs1, I'll commit that to Tika, and we'll have basic mavenized dbf support. 2) After that I'll build a parser with jdbf and we can compare output on the govdocs1 files. If jdbf results are equal or better, we can try to persuade iryndin to push to maven. ??
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        I talked to iryndin and he liked the idea to push jdbf to maven central. Can someone with experience on that help him?

        Show
        lfcnassif Luis Filipe Nassif added a comment - I talked to iryndin and he liked the idea to push jdbf to maven central. Can someone with experience on that help him?
        Hide
        gagravarr Nick Burch added a comment -
        Show
        gagravarr Nick Burch added a comment - If it's the project themselves pushing it to central, then the docs to follow are http://central.sonatype.org/pages/ossrh-guide.html and http://central.sonatype.org/pages/apache-maven.html#performing-a-release-deployment-with-the-maven-release-plugin
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you, Luis Filipe Nassif and Nick Burch!

        I think I'll work on the sqlite parser integration first and then turn to this...maybe this will be in maven by then?

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you, Luis Filipe Nassif and Nick Burch ! I think I'll work on the sqlite parser integration first and then turn to this...maybe this will be in maven by then?
        Hide
        iryndin Ivan Ryndin added a comment -

        Hi guys!
        I started working on jdbf push to Maven Central.
        I think this will take 1-2 weeks for me - review code once more, create javadocs, update POM file according to instructions.
        I'll drop a note here when it will be ready.

        Show
        iryndin Ivan Ryndin added a comment - Hi guys! I started working on jdbf push to Maven Central. I think this will take 1-2 weeks for me - review code once more, create javadocs, update POM file according to instructions. I'll drop a note here when it will be ready.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ivan Ryndin, No rush on our side (well, at least mine). I look forward to testing jdbf and potentially integrating it into Tika. What's your level of interest in ongoing support? Thank you for pushing it into the public Maven repo!

        Show
        tallison@mitre.org Tim Allison added a comment - Ivan Ryndin , No rush on our side (well, at least mine). I look forward to testing jdbf and potentially integrating it into Tika. What's your level of interest in ongoing support? Thank you for pushing it into the public Maven repo!
        Hide
        iryndin Ivan Ryndin added a comment -

        Well, I plan ongoing support of JDBF, though I left the project which it was done for (linux-hosted java webapp where there was need to read/write DBFs as one of exchange formats).

        What do you mean by rushing onto your side? Do you invite me to work on some TIKA issues?

        Show
        iryndin Ivan Ryndin added a comment - Well, I plan ongoing support of JDBF, though I left the project which it was done for (linux-hosted java webapp where there was need to read/write DBFs as one of exchange formats). What do you mean by rushing onto your side? Do you invite me to work on some TIKA issues?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Great! Well, yes, we're always looking to build the community. Please join on!

        What I meant, though, was that I probably won't get to this for few weeks myself, so your estimate of 1-2 weeks is great.

        Thank you, again!

        Show
        tallison@mitre.org Tim Allison added a comment - Great! Well, yes, we're always looking to build the community. Please join on! What I meant, though, was that I probably won't get to this for few weeks myself, so your estimate of 1-2 weeks is great. Thank you, again!
        Hide
        iryndin Ivan Ryndin added a comment -

        Well, okay, let my first job for the TIKA project will be pushing JDBF to Maven Central, and then let's discuss my further steps.

        Show
        iryndin Ivan Ryndin added a comment - Well, okay, let my first job for the TIKA project will be pushing JDBF to Maven Central, and then let's discuss my further steps.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ivan Ryndin, on codepage detection in dbf...in one of the specs I read, it looks like there is a byte in the header that may or may be set that specifies the codepage for the table. Are you, by chance, parsing that?

        If we wanted to integrate our charset detector, would we call getBytes() on the first X DbfRecords, run those through our detector and then reprocess the stream with that charset?

        I installed OpenOffice so that I could create test dbf documents, but the results have been pretty poor.

        Show
        tallison@mitre.org Tim Allison added a comment - Ivan Ryndin , on codepage detection in dbf...in one of the specs I read, it looks like there is a byte in the header that may or may be set that specifies the codepage for the table. Are you, by chance, parsing that? If we wanted to integrate our charset detector, would we call getBytes() on the first X DbfRecords, run those through our detector and then reprocess the stream with that charset? I installed OpenOffice so that I could create test dbf documents, but the results have been pretty poor.
        Hide
        iryndin Ivan Ryndin added a comment -

        There are no reliable ways to detect codepage of DBF files. I haven't met DBF specs where codepage is somehow specified with some special byte.
        The only way to determine codepage is trial and error.

        Possibly there can be one interesting approach to detect codepage similar to that used in language detection. This is statistics based approach. I mean n-gram based language detection methods. I haven't met any ready-to-use framework to detect codepage this way. However, not sure it is worth implementing.

        Show
        iryndin Ivan Ryndin added a comment - There are no reliable ways to detect codepage of DBF files. I haven't met DBF specs where codepage is somehow specified with some special byte. The only way to determine codepage is trial and error. — Possibly there can be one interesting approach to detect codepage similar to that used in language detection. This is statistics based approach. I mean n-gram based language detection methods. I haven't met any ready-to-use framework to detect codepage this way. However, not sure it is worth implementing.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ah, ok. These are the links that I came across: general structure (with mention of codepage mark at byte 29) and mappings for the code page byte and here. I realize that there is always a difference between specs and reality.

        On charset detection, y, ngram naive bayes would be fun, but for now we'll use the built in charset detection that comes with Tika.

        Show
        tallison@mitre.org Tim Allison added a comment - Ah, ok. These are the links that I came across: general structure (with mention of codepage mark at byte 29) and mappings for the code page byte and here . I realize that there is always a difference between specs and reality. On charset detection, y, ngram naive bayes would be fun, but for now we'll use the built in charset detection that comes with Tika.
        Hide
        iryndin Ivan Ryndin added a comment -

        Yeah, I saw these articles. Probably, this code page byte exists only in files produced with Visual FoxPro only.
        I haven't met this byte different from 0x00 in DBF files I work with.

        Show
        iryndin Ivan Ryndin added a comment - Yeah, I saw these articles. Probably, this code page byte exists only in files produced with Visual FoxPro only. I haven't met this byte different from 0x00 in DBF files I work with.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Hi Ivan Ryndin, I wanted to check in to see how the cleanup/mavenizing is going. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Hi Ivan Ryndin , I wanted to check in to see how the cleanup/mavenizing is going. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        From govdocs1, it looks like first byte of 0X03 is a safe way to identify these files.

        This was useful.

        Two mime type questions:
        1) What should we use as the canonical mime type for .dbf files? Proposal: application/x-dbf.

        2) What mimes should the parser "accept", or what should we include in the aliases?
        From filext.com:

        • application/dbase
        • application/x-dbase
        • application/dbf
        • application/x-dbf
        • zz-application/zz-winassoc-dbf

        First attempt at mime definition:

          <mime-type type="application/x-dbf">
            <magic priority="100">
              <match value="0x03" type="string" offset="0"/>
            </magic>
            <glob pattern="*.dbf"/>
            <glob pattern="*.dbase"/>
          </mime-type>
        
        Show
        tallison@mitre.org Tim Allison added a comment - From govdocs1, it looks like first byte of 0X03 is a safe way to identify these files. This was useful. Two mime type questions: 1) What should we use as the canonical mime type for .dbf files? Proposal: application/x-dbf . 2) What mimes should the parser "accept", or what should we include in the aliases? From filext.com : application/dbase application/x-dbase application/dbf application/x-dbf zz-application/zz-winassoc-dbf First attempt at mime definition: <mime-type type="application/x-dbf"> <magic priority="100"> <match value="0x03" type="string" offset="0"/> </magic> <glob pattern="*.dbf"/> <glob pattern="*.dbase"/> </mime-type>
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        Hi Tim,

        I am ok with 1) and 2). But I think an one byte magic can result in many false positives, specially binary files. My current approach is detection by extension only. That needed a declaration of text/plain as a supertype.

        Show
        lfcnassif Luis Filipe Nassif added a comment - Hi Tim, I am ok with 1) and 2). But I think an one byte magic can result in many false positives, specially binary files. My current approach is detection by extension only. That needed a declaration of text/plain as a supertype.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y, I was concerned by that generally. Are you getting false positives with 0x03 specifically? I didn't find any in govdocs1, but I realize that corpus has limitations.

        Will add text/plain as supertype. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Y, I was concerned by that generally. Are you getting false positives with 0x03 specifically? I didn't find any in govdocs1, but I realize that corpus has limitations. Will add text/plain as supertype. Thank you!
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        No, I did not give a try to 0x03. How many files are detected as octet-stream in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am indexing ALL desktop files). I will test 0x03 and report the results here. Can we at least decrease the magic priority to 10 or 20 for now?

        Show
        lfcnassif Luis Filipe Nassif added a comment - No, I did not give a try to 0x03. How many files are detected as octet-stream in govdocs1? I wouldn't like to hit an issue similar to TIKA-1554 again (I am indexing ALL desktop files). I will test 0x03 and report the results here. Can we at least decrease the magic priority to 10 or 20 for now?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Completely agree.

        Only 2,386 files.

        This is the table of the file extensions for files identified as application/octet-stream.

        File Extension Count
        dbase3 1664
        wp 362
        unk 285
        gls 60
        ileaf 4
        sys 3
        chp 2
        lnk 2
        mac 2
        squeak 1
        bin 1

        Would very much appreciate what you find, and yes, we can certainly decrease the priority...I had my priorities backwards. Sorry.

        Obviously, if you find false positives, we'll back off to file suffix. I, too, was less than enthusiastic about a single byte mime id'er.

        Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Completely agree. Only 2,386 files. This is the table of the file extensions for files identified as application/octet-stream. File Extension Count dbase3 1664 wp 362 unk 285 gls 60 ileaf 4 sys 3 chp 2 lnk 2 mac 2 squeak 1 bin 1 Would very much appreciate what you find, and yes, we can certainly decrease the priority...I had my priorities backwards. Sorry. Obviously, if you find false positives, we'll back off to file suffix. I, too, was less than enthusiastic about a single byte mime id'er. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        In looking at this, I wonder if we could add 0x00 at 30 and 31?

        In govdocs1, files that start with 0x03:

        file suffix count
        dbase3 2601
        gls 60
        bin 1

        In commoncrawl:

        file suffix count
        dbf 532
        ndx 40
        dct 33
        tfm 12
        ctg 11
        _bf 2
        cti 2
        stp 2
        NO_SUFFIX 2
        a04 1
        a05 1
        fw 1
        mxp 1
        pyc 1
        txt 1
        Show
        tallison@mitre.org Tim Allison added a comment - - edited In looking at this , I wonder if we could add 0x00 at 30 and 31? In govdocs1, files that start with 0x03: file suffix count dbase3 2601 gls 60 bin 1 In commoncrawl: file suffix count dbf 532 ndx 40 dct 33 tfm 12 ctg 11 _bf 2 cti 2 stp 2 NO_SUFFIX 2 a04 1 a05 1 fw 1 mxp 1 pyc 1 txt 1
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        Hi Tim,

        I've processed a forensic disk copy with 533,949 files. I got 137 files detected as application/x-dbf using the 0x03 signature, all false positives. Not so good. Many of them are deleted/recovered files pointing to binary data.

        The reference you've posted (http://www.dbf2002.com/dbf-file-format.html) states that byte at offset 0x00 can have other values depending on file version or software vendor. And some of them are supported by jdbf. So I think 0x03 is also too restrictive.

        Show
        lfcnassif Luis Filipe Nassif added a comment - Hi Tim, I've processed a forensic disk copy with 533,949 files. I got 137 files detected as application/x-dbf using the 0x03 signature, all false positives. Not so good. Many of them are deleted/recovered files pointing to binary data. The reference you've posted ( http://www.dbf2002.com/dbf-file-format.html ) states that byte at offset 0x00 can have other values depending on file version or software vendor. And some of them are supported by jdbf. So I think 0x03 is also too restrictive.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Oh, broken files, y, that would explain your concern. And, y, that's pretty bad.

        Would you be able to run "file" against a handful of your false positives to see what "file" says those files are?

        This is the definition in my magic file, but it is commented out...not sure how "file" is actually working...

        #0      byte       0x03
        #!:mime application/x-dbf
        #>8     leshort   >0
        #>>12   leshort    0    FoxBase+, FoxPro, dBaseIII+, dBaseIV, no memo
        
        Show
        tallison@mitre.org Tim Allison added a comment - - edited Oh, broken files, y, that would explain your concern. And, y, that's pretty bad. Would you be able to run "file" against a handful of your false positives to see what "file" says those files are? This is the definition in my magic file, but it is commented out...not sure how "file" is actually working... #0 byte 0x03 #!:mime application/x-dbf #>8 leshort >0 #>>12 leshort 0 FoxBase+, FoxPro, dBaseIII+, dBaseIV, no memo
        Hide
        davemeikle Dave Meikle added a comment -
        • Pushed to 1.11 following 1.10 release
        Show
        davemeikle Dave Meikle added a comment - Pushed to 1.11 following 1.10 release
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Hi Ivan Ryndin, I wanted to check in to see if you've had a chance to make any progress on this. I've let it go to the backburner for a bit.

        Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Hi Ivan Ryndin , I wanted to check in to see if you've had a chance to make any progress on this. I've let it go to the backburner for a bit. Thank you!
        Hide
        nicholasc Nick C added a comment -

        I ended up building a detector that tries to validate the dbf header instead of just looking for 0x03 which caused false positives. If you're interested I'll submit a patch.

        Show
        nicholasc Nick C added a comment - I ended up building a detector that tries to validate the dbf header instead of just looking for 0x03 which caused false positives. If you're interested I'll submit a patch.
        Hide
        gagravarr Nick Burch added a comment -

        Is it based on JDBF, or did you write it from scratch?

        Show
        gagravarr Nick Burch added a comment - Is it based on JDBF, or did you write it from scratch?
        Hide
        nicholasc Nick C added a comment -

        I wrote the detector from scratch a couple months ago because 0x03 caused too many false positives. For the parser I ended up using jdbf but found some bugs. One was that the parser would error if inputStream.read(...) returned less than the number of required bytes (The code needs to use something like IOUtils.readFully)

        The logic I used was

        • Validate the signature
        • Validate the header last update date (Is the month between 1 and 12 and is the day valid for that month)
        • Validate the header size by dividing by 32 and making sure there aren’t more then 255 fields
        • Calculate the file size using the record count, header length and record length from the header making sure its less than 4GB. If I can get the input stream length without reading the entire stream (TikaInputStream.hasLength or metadata.content_length) I make sure the calculated size matches (or is within 2 bytes).

        I'll put the code up on github tomorrow and get a list of the jdbf bugs.

        Show
        nicholasc Nick C added a comment - I wrote the detector from scratch a couple months ago because 0x03 caused too many false positives. For the parser I ended up using jdbf but found some bugs. One was that the parser would error if inputStream.read(...) returned less than the number of required bytes (The code needs to use something like IOUtils.readFully) The logic I used was Validate the signature Validate the header last update date (Is the month between 1 and 12 and is the day valid for that month) Validate the header size by dividing by 32 and making sure there aren’t more then 255 fields Calculate the file size using the record count, header length and record length from the header making sure its less than 4GB. If I can get the input stream length without reading the entire stream (TikaInputStream.hasLength or metadata.content_length) I make sure the calculated size matches (or is within 2 bytes). I'll put the code up on github tomorrow and get a list of the jdbf bugs.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Great. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Great. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ivan Ryndin, any interest in working on this?

        Show
        tallison@mitre.org Tim Allison added a comment - Ivan Ryndin , any interest in working on this?
        Hide
        nicholasc Nick C added a comment - - edited

        Some of my checks maybe a little strict because you can have extra bytes at the end of the file and after the field headers (I haven't personally seen any files like that though) I figure in those cases hopefully the file extension glob matches. I put some TODOs that can be changed to call jdbf for validating the DBF file type and field type. Feel free to do what you want with the code
        https://gist.github.com/fxfixer/e54f86095a548cbfb8aeb948ff77a41b

        I used the jdbf v3 branch and here are the bugs I noticed. If Ivan Ryndin is interested I'll create a pull request.
        Calls to input.read(byte[]…) should use IOUtils.readFully. (Sometimes if the dbf is in a zip file, the read call returns less than the requested bytes)
        DBFMetadataReader.readHeader()

        • Needs to call IOUtils.readFully when reading headerBytes
        • NPE if DbfFileTypeEnum.fromInt returns null (Maybe throw an unsupported exception?)
        • Reads record count as int instead of unsigned int

        DBFRecordIterator

        • Unnecessary call to Arrays.fill to set byte[] bytes to 0 (Not really a bug)
        • Needs to call IOUtils.readFully when reading recordBuffer;

        Some encoding names are not correct in CharsetHelper.getCharsetByByte
        936 = cp936 // Chinese (PRC, Singapore) Windows
        932 = cp932 // Japanese Windows
        1255 = Windows-1255 // Hebrew Windows
        1256 = Windows-1256 // Arabic Windows
        1250 = Windows-1250 // Eastern European Windows
        1251 = Windows-1251 // Russian Windows
        1254 = Windows-1254 // Turkish Windows
        1253 = Windows-1253 // Greek Windows

        Show
        nicholasc Nick C added a comment - - edited Some of my checks maybe a little strict because you can have extra bytes at the end of the file and after the field headers (I haven't personally seen any files like that though) I figure in those cases hopefully the file extension glob matches. I put some TODOs that can be changed to call jdbf for validating the DBF file type and field type. Feel free to do what you want with the code https://gist.github.com/fxfixer/e54f86095a548cbfb8aeb948ff77a41b I used the jdbf v3 branch and here are the bugs I noticed. If Ivan Ryndin is interested I'll create a pull request. Calls to input.read(byte[]…) should use IOUtils.readFully. (Sometimes if the dbf is in a zip file, the read call returns less than the requested bytes) DBFMetadataReader.readHeader() Needs to call IOUtils.readFully when reading headerBytes NPE if DbfFileTypeEnum.fromInt returns null (Maybe throw an unsupported exception?) Reads record count as int instead of unsigned int DBFRecordIterator Unnecessary call to Arrays.fill to set byte[] bytes to 0 (Not really a bug) Needs to call IOUtils.readFully when reading recordBuffer; Some encoding names are not correct in CharsetHelper.getCharsetByByte 936 = cp936 // Chinese (PRC, Singapore) Windows 932 = cp932 // Japanese Windows 1255 = Windows-1255 // Hebrew Windows 1256 = Windows-1256 // Arabic Windows 1250 = Windows-1250 // Eastern European Windows 1251 = Windows-1251 // Russian Windows 1254 = Windows-1254 // Turkish Windows 1253 = Windows-1253 // Greek Windows
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Nick Burch, would you mind taking a look at the detector? Is there a way that we can convert this to a mime definition? Or should we add a DBFDetector?

        Nick C, it looks great to me. I agree that we'll probably want to relax some of the length checks (just make sure they're > 0 or something reasonable)...we wouldn't want this to fail on truncated dbfs, and as you've pointed out, there can be extra bytes at the end of the file. If there's any way to avoid adding the dependency, that'd be great...although, I very much appreciate the concern for overflow!

        In your experience, do we need to validate the fieldentry or can we stop sooner? If we do, then I suspect there's no way to convert to a mime definition, but I suspect much of the earlier stuff could easily be translated.

        Oh, and please make sure to add an Apache license header...unless Nick B can easily translate this to a mime definition.

        Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Nick Burch , would you mind taking a look at the detector? Is there a way that we can convert this to a mime definition? Or should we add a DBFDetector? Nick C , it looks great to me. I agree that we'll probably want to relax some of the length checks (just make sure they're > 0 or something reasonable)...we wouldn't want this to fail on truncated dbfs, and as you've pointed out, there can be extra bytes at the end of the file. If there's any way to avoid adding the dependency, that'd be great...although, I very much appreciate the concern for overflow! In your experience, do we need to validate the fieldentry or can we stop sooner? If we do, then I suspect there's no way to convert to a mime definition, but I suspect much of the earlier stuff could easily be translated. Oh, and please make sure to add an Apache license header...unless Nick B can easily translate this to a mime definition. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Is there any interest in forking jdbf either into Tika; or Nick C, do you have any interest in hosting it/pushing it to Maven? I'd far, far prefer to update Ivan Ryndin's code in place and avoid forking if necessary.

        Show
        tallison@mitre.org Tim Allison added a comment - Is there any interest in forking jdbf either into Tika; or Nick C , do you have any interest in hosting it/pushing it to Maven? I'd far, far prefer to update Ivan Ryndin 's code in place and avoid forking if necessary.
        Hide
        nicholasc Nick C added a comment -

        I added the license header. I think some of the checks could be removed. I'll do some testing to see how far in the code the false positives I had stop matching and determine if I can make it simple enough to be a mime definition. It be nice if Tika's mime definition allowed for more complex matching like the linux magic db.

        I also don't mind forking it into Tika or hosting it. A lot of the classes seem to be unused in jdbf v3 so it could be slimmed down to just a couple.

        Show
        nicholasc Nick C added a comment - I added the license header. I think some of the checks could be removed. I'll do some testing to see how far in the code the false positives I had stop matching and determine if I can make it simple enough to be a mime definition. It be nice if Tika's mime definition allowed for more complex matching like the linux magic db. I also don't mind forking it into Tika or hosting it. A lot of the classes seem to be unused in jdbf v3 so it could be slimmed down to just a couple.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        It be nice if Tika's mime definition allowed for more complex matching like the linux magic db.

        Well, you know there's still plenty of time to get that into Tika 2.0.

        I'll do some testing to see how far in the code the false positives I had stop matching and determine if I can make it simple enough to be a mime definition

        Great. Thank you, again. Ballpark, how many dbfs do you have to dev with? Do you want some from our test corpus?

        Show
        tallison@mitre.org Tim Allison added a comment - It be nice if Tika's mime definition allowed for more complex matching like the linux magic db. Well, you know there's still plenty of time to get that into Tika 2.0. I'll do some testing to see how far in the code the false positives I had stop matching and determine if I can make it simple enough to be a mime definition Great. Thank you, again. Ballpark, how many dbfs do you have to dev with? Do you want some from our test corpus?
        Hide
        nicholasc Nick C added a comment -

        Well, you know there's still plenty of time to get that into Tika 2.0

        Maybe I'll add that to my to do list. I have been wanting to work on improving the RTF parser to handle tables/html and generate valid xhtml (multiple lists seem to cause issues)

        Ballpark, how many dbfs do you have to dev with? Do you want some from our test corpus?

        At least 200. I would like more to test with though.

        Show
        nicholasc Nick C added a comment - Well, you know there's still plenty of time to get that into Tika 2.0 Maybe I'll add that to my to do list. I have been wanting to work on improving the RTF parser to handle tables/html and generate valid xhtml (multiple lists seem to cause issues) Ballpark, how many dbfs do you have to dev with? Do you want some from our test corpus? At least 200. I would like more to test with though.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        At least 200. I would like more to test with though.

        I think I rm'd the bz2 I shared with Ivan up above. I'll see what I can dig up.

        Show
        tallison@mitre.org Tim Allison added a comment - At least 200. I would like more to test with though. I think I rm'd the bz2 I shared with Ivan up above. I'll see what I can dig up.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Nope. Didn't remove them. There are roughly 3k files that ended with dbf or dbase3 in govdocs1 and an earlier version of our slice of commoncrawl.
        The files may not actually be dbfs, and they're likely truncated (at least those that came from commoncrawl).

        Give this a shot.

        Thank you, Rackspace!

        Show
        tallison@mitre.org Tim Allison added a comment - Nope. Didn't remove them. There are roughly 3k files that ended with dbf or dbase3 in govdocs1 and an earlier version of our slice of commoncrawl. The files may not actually be dbfs, and they're likely truncated (at least those that came from commoncrawl). Give this a shot. Thank you, Rackspace!
        Hide
        nicholasc Nick C added a comment - - edited

        Did some more testing and simplified the rules enough that it could be made in to a regex. It's not pretty but works. It checks the signature/version, month(1-12), day(1-31), header length > 65, record length > 1, and first field's type (could be stricter)

        <magic priority="100">
        <match value="(?s)^[\\x02\\x03\\x30\\x31\\x32\\x43\\x63\\x83\\x8B\\xCB\\xF5\\xE5\\xFB].[\\x01-\\x0C][\\x01-\\x1F].{4}(?:.[^\\x00]|[\\x41-\\xFF].)(?:[^\\x00\\x01].|.[^\\x00]).{31}[A-Z@+]" type="regex" offset="0"/>
        </magic>
        
        Show
        nicholasc Nick C added a comment - - edited Did some more testing and simplified the rules enough that it could be made in to a regex. It's not pretty but works. It checks the signature/version, month(1-12), day(1-31), header length > 65, record length > 1, and first field's type (could be stricter) <magic priority= "100" > <match value= "(?s)^[\\x02\\x03\\x30\\x31\\x32\\x43\\x63\\x83\\x8B\\xCB\\xF5\\xE5\\xFB].[\\x01-\\x0C][\\x01-\\x1F].{4}(?:.[^\\x00]|[\\x41-\\xFF].)(?:[^\\x00\\x01].|.[^\\x00]).{31}[A-Z@+]" type= "regex" offset= "0" /> </magic>
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I'll add this before running the final 1.13 regression tests and see what happens. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - I'll add this before running the final 1.13 regression tests and see what happens. Thank you!
        Hide
        nicholasc Nick C added a comment - - edited

        I was running this on more data and ran in to a text file that matched. It started with a 2(0x32) and 3 newlines. Had to make a small change that checks for a null byte before the field type (field names are null terminated)

        <magic priority="100">
        <match value="(?s)^[\\x02\\x03\\x30\\x31\\x32\\x43\\x63\\x83\\x8B\\xCB\\xF5\\xE5\\xFB].[\\x01-\\x0C][\\x01-\\x1F].{4}(?:.[^\\x00]|[\\x41-\\xFF].)(?:[^\\x00\\x01].|.[^\\x00]).{31}(?&lt;=[\\x00][^\\x00]{0,10})[A-Z@+]" type="regex" offset="0"/>
        </magic>
        
        Show
        nicholasc Nick C added a comment - - edited I was running this on more data and ran in to a text file that matched. It started with a 2(0x32) and 3 newlines. Had to make a small change that checks for a null byte before the field type (field names are null terminated) <magic priority= "100" > <match value= "(?s)^[\\x02\\x03\\x30\\x31\\x32\\x43\\x63\\x83\\x8B\\xCB\\xF5\\xE5\\xFB].[\\x01-\\x0C][\\x01-\\x1F].{4}(?:.[^\\x00]|[\\x41-\\xFF].)(?:[^\\x00\\x01].|.[^\\x00]).{31}(?&lt;=[\\x00][^\\x00]{0,10})[A-Z@+]" type= "regex" offset= "0" /> </magic>
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I won't commit this until we get our corpus results back...perhaps I'll redo the run with this if there's time.

        Coincidentally, on this comparison, it looks like DROID is identifying ~3k files in our corpus as some version of dbase.

        In your spare time, if you could document that work of art, that'd be handy. Thank you, again.

        Show
        tallison@mitre.org Tim Allison added a comment - I won't commit this until we get our corpus results back...perhaps I'll redo the run with this if there's time. Coincidentally, on this comparison , it looks like DROID is identifying ~3k files in our corpus as some version of dbase. In your spare time, if you could document that work of art, that'd be handy. Thank you, again.
        Hide
        nicholasc Nick C added a comment -

        Sounds good. I'll be running this on more files this week and will report back if I notice any false positives. If you want you can make the field type check stricter which would possibly prevent other false positives (Replace [A-Z@+] with [BCDFGILMNOPQTVWXY@+])

        Details Regex
        Enable dotall mode (so dots match new lines) (?s)
        Signature/Version ^[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB]
        Year (no check) .
        Month (1-12) [\x01-\x0C]
        Day (1-31) [\x01-\x1F]
        Record count (uint32, no check) .{4}
        Header length (ushort) greater than 65 (.[^\x00]|[\x41-\xFF].)
        Record length (ushort) greater than 1 ([^\x00\x01].|.[^\x00])
        Skip to first field header .{31}
        Make sure field name is null terminated (regex zero-width lookbehind) (?<=[\x00][^\x00]{0,10})
        Field type [BCDFGILMNOPQTVWXY@+]

        Full Regex

        (?s)^[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB].[\x01-\x0C][\x01-\x1F].{4}(?:.[^\x00]|[\x41-\xFF].)(?:[^\x00\x01].|.[^\x00]).{31}(?<=[\x00][^\x00]{0,10})[BCDFGILMNOPQTVWXY@+]
        
        Show
        nicholasc Nick C added a comment - Sounds good. I'll be running this on more files this week and will report back if I notice any false positives. If you want you can make the field type check stricter which would possibly prevent other false positives (Replace [A-Z@+] with [BCDFGILMNOPQTVWXY@+]) Details Regex Enable dotall mode (so dots match new lines) (?s) Signature/Version ^[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB] Year (no check) . Month (1-12) [\x01-\x0C] Day (1-31) [\x01-\x1F] Record count (uint32, no check) .{4} Header length (ushort) greater than 65 (.[^\x00]|[\x41-\xFF].) Record length (ushort) greater than 1 ([^\x00\x01].|.[^\x00]) Skip to first field header .{31} Make sure field name is null terminated (regex zero-width lookbehind) (?<=[\x00][^\x00]{0,10}) Field type [BCDFGILMNOPQTVWXY@+] Full Regex (?s)^[\x02\x03\x30\x31\x32\x43\x63\x83\x8B\xCB\xF5\xE5\xFB].[\x01-\x0C][\x01-\x1F].{4}(?:.[^\x00]|[\x41-\xFF].)(?:[^\x00\x01].|.[^\x00]).{31}(?<=[\x00][^\x00]{0,10})[BCDFGILMNOPQTVWXY@+]
        Hide
        nicholasc Nick C added a comment -

        Tested more files using the full regex and haven't had any false positives.

        Show
        nicholasc Nick C added a comment - Tested more files using the full regex and haven't had any false positives.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Great. Frankly, the initial regex looked quite good...small handful of false positives. I look forward to running this on our corpus once 1.13 is released.

        Once we get feedback from Ivan Ryndin on the parser, it'll be great to add detection and parsing in one go.

        Thank you, again.

        Show
        tallison@mitre.org Tim Allison added a comment - Great. Frankly, the initial regex looked quite good...small handful of false positives. I look forward to running this on our corpus once 1.13 is released. Once we get feedback from Ivan Ryndin on the parser, it'll be great to add detection and parsing in one go. Thank you, again.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ivan Ryndin, now that 1.13 is in the voting process, I'd like to re-engage on this issue for 1.14. Would you be willing to make the updates that Nick C recommended and push to maven central? Or, again, as a far less preferable option, would you object to us incorporating your code within Tika?

        Show
        tallison@mitre.org Tim Allison added a comment - Ivan Ryndin , now that 1.13 is in the voting process, I'd like to re-engage on this issue for 1.14. Would you be willing to make the updates that Nick C recommended and push to maven central? Or, again, as a far less preferable option, would you object to us incorporating your code within Tika?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Rolled our own parser. Will commit tomorrow.

        Show
        tallison@mitre.org Tim Allison added a comment - Rolled our own parser. Will commit tomorrow.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Nick C, do you, by chance, have any shareable examples of files that don't start with 0x03, e.g. Visual FoxPro, dBase IV, etc? Any shareable examples of .dbt (memo) files? Thank you, again for the mime-detection regex!

        How do we want to handle detecting the variants?

        Option 1: replicate the above regex for each variant and change the first byte? With parent mime-type "application/x-dbf"?
        Option 2: send them all to the DBFParser, and that will update the mime type.

        How do we want to represent the variants via the mime, e.g. 0x30 Visual FoxPro: "application/x-dbf; Visual FoxPro"

        Show
        tallison@mitre.org Tim Allison added a comment - Nick C , do you, by chance, have any shareable examples of files that don't start with 0x03, e.g. Visual FoxPro, dBase IV, etc? Any shareable examples of .dbt (memo) files? Thank you, again for the mime-detection regex! How do we want to handle detecting the variants? Option 1: replicate the above regex for each variant and change the first byte? With parent mime-type "application/x-dbf"? Option 2: send them all to the DBFParser, and that will update the mime type. How do we want to represent the variants via the mime, e.g. 0x30 Visual FoxPro: "application/x-dbf; Visual FoxPro"
        Hide
        tallison@mitre.org Tim Allison added a comment -

        First version is done. Plenty of areas in the "todo" list, but this should be a good start.

        Nick C, please try it out and let me know if it works for you.

        Show
        tallison@mitre.org Tim Allison added a comment - First version is done. Plenty of areas in the "todo" list, but this should be a good start. Nick C , please try it out and let me know if it works for you.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #999 (See https://builds.apache.org/job/tika-trunk-jdk1.7/999/)
        TIKA-1513 – add mime detection and parsing for dbf files. Thanks to (tallison: rev e74f66375f20d914f8585597b6d9492586a0caa9)

        • tika-parsers/src/test/resources/test-documents/testDBF_gb18030.dbf
        • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFCell.java
        • tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        • tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFRow.java
        • tika-parsers/src/test/resources/test-documents/testDBF.dbf
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
          TIKA-1513 – add mime detection and parsing for dbf files. Thanks to (tallison: rev cb492f4b16ccdd0c0d8129f215e75a14f294cc89)
        • CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #999 (See https://builds.apache.org/job/tika-trunk-jdk1.7/999/ ) TIKA-1513 – add mime detection and parsing for dbf files. Thanks to (tallison: rev e74f66375f20d914f8585597b6d9492586a0caa9) tika-parsers/src/test/resources/test-documents/testDBF_gb18030.dbf tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFCell.java tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFRow.java tika-parsers/src/test/resources/test-documents/testDBF.dbf tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java TIKA-1513 – add mime detection and parsing for dbf files. Thanks to (tallison: rev cb492f4b16ccdd0c0d8129f215e75a14f294cc89) CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #6 (See https://builds.apache.org/job/tika-2.x-windows/6/)
        TIKA-1513: add mime detection and parser for DBF files. Thanks to Nick (tallison: rev 8d24e07fb1245de0e151e9ce3fd516651db1d989)

        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java
        • tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • CHANGES.txt
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFRow.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
        • tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFCell.java
        • tika-test-resources/src/test/resources/test-documents/testDBF_gb18030.dbf
        • tika-test-resources/src/test/resources/test-documents/testDBF.dbf
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #6 (See https://builds.apache.org/job/tika-2.x-windows/6/ ) TIKA-1513 : add mime detection and parser for DBF files. Thanks to Nick (tallison: rev 8d24e07fb1245de0e151e9ce3fd516651db1d989) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml CHANGES.txt tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFRow.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFCell.java tika-test-resources/src/test/resources/test-documents/testDBF_gb18030.dbf tika-test-resources/src/test/resources/test-documents/testDBF.dbf tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java
        Hide
        nicholasc Nick C added a comment -

        I wasn't able to find a way to detect the dbt files. I did find some example/test dbf/dbt files on http://www.clicketyclick.dk/databases/xbase/index.shtml.en Also there were some non 0x03 files in common crawl (000075371.dbf, 000543045.dbf, 000606319.dbf, 001674260.dbf, 002135562.dbf) and in those example files.

        I'm not sure the best way to handle the variants. Maybe have the DBFParser stick it in the metadata (Something like Application name?)

        Show
        nicholasc Nick C added a comment - I wasn't able to find a way to detect the dbt files. I did find some example/test dbf/dbt files on http://www.clicketyclick.dk/databases/xbase/index.shtml.en Also there were some non 0x03 files in common crawl (000075371.dbf, 000543045.dbf, 000606319.dbf, 001674260.dbf, 002135562.dbf) and in those example files. I'm not sure the best way to handle the variants. Maybe have the DBFParser stick it in the metadata (Something like Application name?)
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #100 (See https://builds.apache.org/job/tika-2.x/100/)
        TIKA-1513: add mime detection and parser for DBF files. Thanks to Nick (tallison: rev 8d24e07fb1245de0e151e9ce3fd516651db1d989)

        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java
        • tika-test-resources/src/test/resources/test-documents/testDBF.dbf
        • tika-test-resources/src/test/resources/test-documents/testDBF_gb18030.dbf
        • CHANGES.txt
        • tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFRow.java
        • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFCell.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #100 (See https://builds.apache.org/job/tika-2.x/100/ ) TIKA-1513 : add mime detection and parser for DBF files. Thanks to Nick (tallison: rev 8d24e07fb1245de0e151e9ce3fd516651db1d989) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFFileHeader.java tika-test-resources/src/test/resources/test-documents/testDBF.dbf tika-test-resources/src/test/resources/test-documents/testDBF_gb18030.dbf CHANGES.txt tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFColumnHeader.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFRow.java tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFCell.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you.

        For now, I've added a parameter to the mimetype: application/x-dbf; dbf_version=FoxBASE_plus_with_memo

        Happy to change that if there's consensus. Nick Burch, any recommendations?

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you. For now, I've added a parameter to the mimetype: application/x-dbf; dbf_version=FoxBASE_plus_with_memo Happy to change that if there's consensus. Nick Burch , any recommendations?
        Hide
        gagravarr Nick Burch added a comment - - edited

        I haven't read much on the format, but I'd be tempted to maybe have that more like application/x-dbf; vendor=FoxBASE; type=plus_with_memo, or to have it more in keeping with the BDB / PE / DITA types, maybe application/x-dbf; format=FoxBASE; type=plus_with_memo or application/x-dbf; format=plus_with_memo; vendor=FoxBASE (depending on what the actual variances are)

        Show
        gagravarr Nick Burch added a comment - - edited I haven't read much on the format, but I'd be tempted to maybe have that more like application/x-dbf; vendor=FoxBASE; type=plus_with_memo , or to have it more in keeping with the BDB / PE / DITA types, maybe application/x-dbf; format=FoxBASE; type=plus_with_memo or application/x-dbf; format=plus_with_memo; vendor=FoxBASE (depending on what the actual variances are)
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ivan Ryndin, would you mind if we added your test files (tir_im.dbf, gds_im.dbf, texto*) to our unit tests?

        Show
        tallison@mitre.org Tim Allison added a comment - Ivan Ryndin , would you mind if we added your test files (tir_im.dbf, gds_im.dbf, texto*) to our unit tests?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #1001 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1001/)
        TIKA-1513 – update mime type according to Nick Burch's recommendation, (tallison: rev dcaeccbab69519811e0cdf349873ce2b51e6ca10)

        • tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
        • tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #1001 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1001/ ) TIKA-1513 – update mime type according to Nick Burch's recommendation, (tallison: rev dcaeccbab69519811e0cdf349873ce2b51e6ca10) tika-parsers/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFReader.java tika-parsers/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #7 (See https://builds.apache.org/job/tika-2.x-windows/7/)
        TIKA-1513 – update mime type according to Nick Burch's recommendation, (tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c)

        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #7 (See https://builds.apache.org/job/tika-2.x-windows/7/ ) TIKA-1513 – update mime type according to Nick Burch's recommendation, (tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-2.x #103 (See https://builds.apache.org/job/tika-2.x/103/)
        TIKA-1513 – update mime type according to Nick Burch's recommendation, (tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c)

        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-2.x #103 (See https://builds.apache.org/job/tika-2.x/103/ ) TIKA-1513 – update mime type according to Nick Burch's recommendation, (tallison: rev 15ec358c44867adc44ab0431960d565b3d8a3e2c) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFParser.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/dbf/DBFReader.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/dbf/DBFParserTest.java

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development