Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1302

Let's run Tika against a large batch of docs nightly

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: cli, general, server
    • Labels:
      None

      Description

      Many thanks to Lewis John McGibbney for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.

      One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.

      Any other candidate corpora?
      William Palmer, have anything handy you'd like to contribute?
      http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you, Julien Nioche! I'm unpacking and staging now.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you, Julien Nioche ! I'm unpacking and staging now.
        Hide
        jnioche Julien Nioche added a comment -

        FYI have extracted data from the CommonCrawl dataset using Behemoth and put that on the server. See http://digitalpebble.blogspot.co.uk/2014/11/generating-test-corpus-for-apache-tika.html for a description of the steps. Roughly 220GB of compressed data, 2M documents of all mime-types, mostly non HTML.
        Tim Allison please let me know if you have any problems with the data

        Show
        jnioche Julien Nioche added a comment - FYI have extracted data from the CommonCrawl dataset using Behemoth and put that on the server. See http://digitalpebble.blogspot.co.uk/2014/11/generating-test-corpus-for-apache-tika.html for a description of the steps. Roughly 220GB of compressed data, 2M documents of all mime-types, mostly non HTML. Tim Allison please let me know if you have any problems with the data
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Sure Tim I'll help to get the scientific data files for the corpus. Paul Zimdars can you help here? We want to transfer data off our Amazon box and onto our new VM here for Tika donated by RackSpace. Tim Allison has the details.

        Show
        chrismattmann Chris A. Mattmann added a comment - Sure Tim I'll help to get the scientific data files for the corpus. Paul Zimdars can you help here? We want to transfer data off our Amazon box and onto our new VM here for Tika donated by RackSpace. Tim Allison has the details.
        Hide
        anjackson Andrew Jackson added a comment -

        We have two more sets of data. One is the same as the 1996-2010 stuff, but from 2010 to April 2013, and for each item a copy can generally be accessed via the Internet Archive. We are planning to extend our indexing to the entire 1996-2013 dataset soon, but in reality its going to be a few months yet due to technical difficulties and other priorities. The second set of data runs from 2013 onwards, and due to the legal constraints on that material cannot be made available. However, for the next year or two, most of it will still be available on the live web, so that's the fallback option. That material has been indexed (although with an older Tika version), but we're going to re-index that too shortly, so we should also be able to make that available. (n.b. 'shortly' still means weeks or months!)

        Both of these data sets are large and contain more large files. There were c. 2 billion resources in the 1996-2010 chunk, and there are 1.5-2 billion in the 2010-2013 chunk, and over 2 billion per year since then, and in contrast to the early material, we do not limit the size per resource. So that should be interesting.

        However, it would be good to run against a broader range of material, given that I stop Tika from recursively processing ZIPs etc. and that web archives are rather weak on A/V files, systems files, software, etc. I'm not aware of a good A/V corpus, but on the systems and software side, there are the system images also held at digitalcorpora.org and the various files used by a RedHat dev to regression test the 'file' command. There is also this small corpus of example files that I have been contributing to lately, the evolt browser archive and the disktype filesystem image samples.

        Show
        anjackson Andrew Jackson added a comment - We have two more sets of data. One is the same as the 1996-2010 stuff, but from 2010 to April 2013, and for each item a copy can generally be accessed via the Internet Archive. We are planning to extend our indexing to the entire 1996-2013 dataset soon, but in reality its going to be a few months yet due to technical difficulties and other priorities. The second set of data runs from 2013 onwards, and due to the legal constraints on that material cannot be made available. However, for the next year or two, most of it will still be available on the live web, so that's the fallback option. That material has been indexed (although with an older Tika version), but we're going to re-index that too shortly, so we should also be able to make that available. (n.b. 'shortly' still means weeks or months!) Both of these data sets are large and contain more large files. There were c. 2 billion resources in the 1996-2010 chunk, and there are 1.5-2 billion in the 2010-2013 chunk, and over 2 billion per year since then, and in contrast to the early material, we do not limit the size per resource. So that should be interesting. However, it would be good to run against a broader range of material, given that I stop Tika from recursively processing ZIPs etc. and that web archives are rather weak on A/V files, systems files, software, etc. I'm not aware of a good A/V corpus, but on the systems and software side, there are the system images also held at digitalcorpora.org and the various files used by a RedHat dev to regression test the 'file' command . There is also this small corpus of example files that I have been contributing to lately, the evolt browser archive and the disktype filesystem image samples .
        Hide
        jnioche Julien Nioche added a comment -

        Sure, will get back to you re-details of scp when I have the data ready

        Show
        jnioche Julien Nioche added a comment - Sure, will get back to you re-details of scp when I have the data ready
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Looks like I'll need to rm govdocs1 zips to clear some space or link another drive!

        Julien Nioche, near term, would you be willing to scp some files to the vm we're building for this? Longer term, once we get the process running in a conventional environment, it'd be great to move to hadoop.

        Chris A. Mattmann, same with you?

        300gb-ish sample for each corpus reasonable?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Looks like I'll need to rm govdocs1 zips to clear some space or link another drive! Julien Nioche , near term, would you be willing to scp some files to the vm we're building for this? Longer term, once we get the process running in a conventional environment, it'd be great to move to hadoop. Chris A. Mattmann , same with you? 300gb-ish sample for each corpus reasonable?
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        how about images and scientific data? we are actively crawling NSF and NASA Polar data sites, see:

        https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1

        We have some of this data on Amazon S3 buckets and would be easily able to share. Great work Tim.

        Show
        chrismattmann Chris A. Mattmann added a comment - how about images and scientific data? we are actively crawling NSF and NASA Polar data sites, see: https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1 We have some of this data on Amazon S3 buckets and would be easily able to share. Great work Tim.
        Hide
        jnioche Julien Nioche added a comment -

        Hi Tim Allison
        It would be easy to do that with Behemoth, not sure CC contains many multimedia files but it certainly will have the other types you mentioned. We could either dump the content of the URLs to an archive to process with Tika later or do the Tika parsing with Behemoth as well.

        Show
        jnioche Julien Nioche added a comment - Hi Tim Allison It would be easy to do that with Behemoth, not sure CC contains many multimedia files but it certainly will have the other types you mentioned. We could either dump the content of the URLs to an archive to process with Tika later or do the Tika parsing with Behemoth as well.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Andrew Jackson, I'm attaching some summary stats on the exceptions file you posted. Thank you for sharing.

        In these summary stats, I took the literal exception message, and then I also pared it down to the chunk of text before the first ":". Without the full stacktrace, this will conflate exceptions, but it still might be useful.

        I'm just getting started on the tika-eval code, but one of the things I've run into is that the literal exception message can be problematic if the task is to bin and count exception causes. What I'm currently doing is truncating the message as I did with your data and then running group by on the full stacktrace. One limitation of this, though, is that we can't easily compare exceptions across different versions of the software because line numbers are included, and if one changes, a comparison of "group by" output fails.

        On the SAX exceptions, the XML parser accounted for nearly a quarter of the exceptions on govdocs1 with Tika 1.7-SNAPSHOT. Apologies for the repetition...in tika-server, it looks like we've hard-coded the selection of the more forgiving html parser instead of the XML parser. Depending on your use case, that change in your Tika config might make sense.

        On another note, with govdocs1, we have very few modern pdfs, (ppt|doc|xls)[xm], rtf, msg, open office and multimedia files...Other Tikis, what other formats do we need? I might be willing to crawl for docs, but I don't have a good starting point/list of links, and the search engine APIs aren't as generous as they used to be. So, do you happen to have link data fresher than 2010, by chance? Would you be willing to share a list of links or is it publicly available? Or Julien Nioche, how easy would it be to export a few hundred thousand of those file types from CommonCrawl?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Andrew Jackson , I'm attaching some summary stats on the exceptions file you posted. Thank you for sharing. In these summary stats, I took the literal exception message, and then I also pared it down to the chunk of text before the first ":". Without the full stacktrace, this will conflate exceptions, but it still might be useful. I'm just getting started on the tika-eval code, but one of the things I've run into is that the literal exception message can be problematic if the task is to bin and count exception causes. What I'm currently doing is truncating the message as I did with your data and then running group by on the full stacktrace. One limitation of this, though, is that we can't easily compare exceptions across different versions of the software because line numbers are included, and if one changes, a comparison of "group by" output fails. On the SAX exceptions, the XML parser accounted for nearly a quarter of the exceptions on govdocs1 with Tika 1.7-SNAPSHOT. Apologies for the repetition...in tika-server, it looks like we've hard-coded the selection of the more forgiving html parser instead of the XML parser. Depending on your use case, that change in your Tika config might make sense. On another note, with govdocs1, we have very few modern pdfs, (ppt|doc|xls) [xm] , rtf, msg, open office and multimedia files...Other Tikis, what other formats do we need? I might be willing to crawl for docs, but I don't have a good starting point/list of links, and the search engine APIs aren't as generous as they used to be. So, do you happen to have link data fresher than 2010, by chance? Would you be willing to share a list of links or is it publicly available? Or Julien Nioche , how easy would it be to export a few hundred thousand of those file types from CommonCrawl?
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        HPC is way beyond current status of tika-batch, which is initially aimed at conventional/single-box computing. I heartily welcome tika-batch-hadoop and any other tika-batch-HPC packages!

        If you do want to join in the effort on tika-batch, please do! I need plenty of help in code review, unit tests, usability and edge case (i.e bug) discovery. I'd also love to halve the amount of code and keep the robustness, extensibility and logging.

        You can grab my dev version of tika-batch from my github fork. See some background on the wiki. I finished an initial integration with tika-app, and you should be able to run tika-app with:

        java -jar tika-app.jar <srcDirectory>
        

        That will iterate through the srcDirectory and output files in a directory named "output" with a mirror of the srcDirectory's structure. This sounds underwhelming, I know, but the code is robust against OOM and permanent hangs, it is multi-threaded, and the user should be able to interrupt the process (and child process) gracefully.

        There are lots of commandline arguments available. I'm going to update the usage wiki shortly, but the usual -? from the app will give you some of the options. I've updated the TikaBatchUsage wiki just now. Let me know when you have questions.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited HPC is way beyond current status of tika-batch, which is initially aimed at conventional/single-box computing. I heartily welcome tika-batch-hadoop and any other tika-batch-HPC packages! If you do want to join in the effort on tika-batch, please do! I need plenty of help in code review, unit tests, usability and edge case (i.e bug) discovery. I'd also love to halve the amount of code and keep the robustness, extensibility and logging. You can grab my dev version of tika-batch from my github fork . See some background on the wiki . I finished an initial integration with tika-app, and you should be able to run tika-app with: java -jar tika-app.jar <srcDirectory> That will iterate through the srcDirectory and output files in a directory named "output" with a mirror of the srcDirectory's structure. This sounds underwhelming, I know, but the code is robust against OOM and permanent hangs, it is multi-threaded, and the user should be able to interrupt the process (and child process) gracefully. There are lots of commandline arguments available. I'm going to update the usage wiki shortly, but the usual -? from the app will give you some of the options. I've updated the TikaBatchUsage wiki just now. Let me know when you have questions.
        Hide
        tpalsulich Tyler Palsulich added a comment -

        I just got access to an HPC cluster at NYU. How are you running Tika against the govdocs corpus, Tim? I'm downloading it right now and would like to reproduce your results.

        Show
        tpalsulich Tyler Palsulich added a comment - I just got access to an HPC cluster at NYU. How are you running Tika against the govdocs corpus, Tim? I'm downloading it right now and would like to reproduce your results.
        Hide
        anjackson Andrew Jackson added a comment - - edited

        Tim Allison I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

        Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/*xml) by Tika. i.e. the exceptions do not seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with).

        Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that.

        Show
        anjackson Andrew Jackson added a comment - - edited Tim Allison I've created a download folder on our own site, and included a dump of about 1/8th of the SAX errors, here: http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/ Looking through the SAX exceptions, they do seem to be from resources that are identified as XML (application/*xml) by Tika. i.e. the exceptions do not seem to be coming from malformed HTML, which is consistent with the standard Tika configuration you described above (which I can confirm is what we ran with). Unfortunately, I can't recover the full stack traces from that run, and it's not clear if we'll be able to do that in the future because of the way we're doing the indexing, but we'll look at it and hopefully be able to record the full error in the future. For now, you'll have to re-run the source item through Tika to reproduce the error - sorry about that.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Andrew Jackson, the google docs link is down at the moment, so I can't see the full doc. If there is any way to capture the full stacktrace so that we can compare with our govdocs1 runs, that would be fantastic. You can see our current output format comparing two versions of PDFBox over on TIKA-1442. This is ongoing work (from my perspective), and there's no need to rush. Whichever option is easier for you...thank you for sharing!

        I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check).

        Y, if you could check, I'd be interested. I think the default behavior would be to send XML through the DcXMLParser, which is far stricter than the default HtmlParser. You can see by our choice on tika-server, though, that at least one dev prefers to have our HtmlParser handle xml.

        Thank you, again!

        Show
        tallison@mitre.org Tim Allison added a comment - Andrew Jackson , the google docs link is down at the moment, so I can't see the full doc. If there is any way to capture the full stacktrace so that we can compare with our govdocs1 runs, that would be fantastic. You can see our current output format comparing two versions of PDFBox over on TIKA-1442 . This is ongoing work (from my perspective), and there's no need to rush. Whichever option is easier for you...thank you for sharing! I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). Y, if you could check, I'd be interested. I think the default behavior would be to send XML through the DcXMLParser, which is far stricter than the default HtmlParser. You can see by our choice on tika-server, though, that at least one dev prefers to have our HtmlParser handle xml. Thank you, again!
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        I'd say extract the errors, we'd appreciate them thank you Andrew Jackson

        Show
        chrismattmann Chris A. Mattmann added a comment - I'd say extract the errors, we'd appreciate them thank you Andrew Jackson
        Hide
        anjackson Andrew Jackson added a comment -

        Shall I go ahead and extract the XML errors? Or would you rather I waited until we've re-run with the new version that will catch the permanent hangs and regenerate all the data?

        Show
        anjackson Andrew Jackson added a comment - Shall I go ahead and extract the XML errors? Or would you rather I waited until we've re-run with the new version that will catch the permanent hangs and regenerate all the data?
        Hide
        anjackson Andrew Jackson added a comment - - edited

        Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

        This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs.

        Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check).

        I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you?

        EDIT: Oh, and I meant to say, I'm glad to hear about Giuseppe Totaro and Tim Allison's efforts to run this on GovDocs, and would be interested in comparing results. We already publish format profile data about web archives, and would love to have more data to refer to.

        Show
        anjackson Andrew Jackson added a comment - - edited Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? EDIT: Oh, and I meant to say, I'm glad to hear about Giuseppe Totaro and Tim Allison 's efforts to run this on GovDocs, and would be interested in comparing results. We already publish format profile data about web archives, and would love to have more data to refer to.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Andrew Jackson thanks for sharing. Giuseppe Totaro has been working in this area and is currently running Tika in an HPC environment against govdocs (as is Tim Allison). It would be great to coordinate here in Tika. Thanks for sharing this.

        Show
        chrismattmann Chris A. Mattmann added a comment - Andrew Jackson thanks for sharing. Giuseppe Totaro has been working in this area and is currently running Tika in an HPC environment against govdocs (as is Tim Allison ). It would be great to coordinate here in Tika. Thanks for sharing this.
        Hide
        willp-bl William Palmer added a comment -

        
        I have left the British Library (as of 20th October 2014). Please contact Maureen.Pennock@bl.uk if you need to contact someone.

        Any FOI requests should be sent to FOI-Enquiries@bl.uk.

        ******************************************************************************************************************
        Experience the British Library online at www.bl.uk<http://www.bl.uk/>
        The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html>
        Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook>
        The Library's St Pancras site is WiFi - enabled
        *****************************************************************************************************************
        The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk<postmaster@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent.
        The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
        *****************************************************************************************************************
        Think before you print

        Show
        willp-bl William Palmer added a comment -  I have left the British Library (as of 20th October 2014). Please contact Maureen.Pennock@bl.uk if you need to contact someone. Any FOI requests should be sent to FOI-Enquiries@bl.uk. ****************************************************************************************************************** Experience the British Library online at www.bl.uk< http://www.bl.uk/ > The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html< http://www.bl.uk/aboutus/annrep/index.html > Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook< http://www.bl.uk/adoptabook > The Library's St Pancras site is WiFi - enabled ***************************************************************************************************************** The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmaster@bl.uk< postmaster@bl.uk > : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. ***************************************************************************************************************** Think before you print
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        That would be a fantastic resource. Thank you for sharing! We could do a bit of munging to prioritize most common exceptions in dependencies.

        Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on the govdocs1 corpus, but in the same ballpark. Interesting.

        Do you know how many permanent hangs you had and can you identify those files easily enough? I had about 6 in the govdocs1 corpus.

        Thank you!

        P.S. On the SAXParseExceptions...did those come from the XMLParser or from the HtmlParser? I recently discovered that we hardcode an override in TikaResource within tika-server:

         parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
        

        Not sure that we should hardcode that, but it does make sense to use that configuration!

        Show
        tallison@mitre.org Tim Allison added a comment - - edited That would be a fantastic resource. Thank you for sharing! We could do a bit of munging to prioritize most common exceptions in dependencies. Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on the govdocs1 corpus, but in the same ballpark. Interesting. Do you know how many permanent hangs you had and can you identify those files easily enough? I had about 6 in the govdocs1 corpus. Thank you! P.S. On the SAXParseExceptions...did those come from the XMLParser or from the HtmlParser? I recently discovered that we hardcode an override in TikaResource within tika-server: parsers.put(MediaType.APPLICATION_XML, new HtmlParser()); Not sure that we should hardcode that, but it does make sense to use that configuration!
        Hide
        anjackson Andrew Jackson added a comment -

        I have 2,358,167 errors from one collection (2 billion resources), but the majority are SAXParseExceptions. It's made up of UK web archive content from 1996-2010, so there's lots of broken HTML/XML in there. If I strip out the SAXParseExceptions, there's just 317,548 miscellaneous errors, that are perhaps more interesting.

        Here's an example including the SAX exceptions:

        wayback_date,url,content_length,content_type_tika,parse_error
        20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
        20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,"org.xml.sax.SAXParseException: Open quote is expected for attribute ""ID"" associated with an  element type  ""COMMENT""."
        20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
        20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548&map_only=yes&type=overview,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
        20061020022224,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5&amp;type=rss;action=.xml,org.xml.sax.SAXParseException: The character reference must end with the ';' delimiter.
        20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=courses&sn=26,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
        20060926015856,http://www.qca.org.uk/4412.html,"org.xml.sax.SAXParseException: The entity ""nbsp"" was referenced\, but not declared."
        20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException: -1
        20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=porter&cov=&mode=buy&o=4854130936&code=9123&cu=&,"org.xml.sax.SAXParseException: The element type ""META"" must be terminated by the matching end-tag ""</META>""."
        20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference.
        

        ...and for the others...

        wayback_date,url,content_length,content_type_tika,parse_error
        20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException: null
        20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException: -1
        20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException: Position 698368 past the end of the file
        20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
        20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,"java.io.IOException: Error: Expected a long type\, actual='25#0/'"
        20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException: oversubscribed literal/length tree
        20041121100005,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException: oversubscribed dynamic bit lengths tree
        20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
        20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
        20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException: Default password is invalid for docId/saltData/saltHash
        20051111183952,http://freeweb.co.uk:80/show_nw.php?ref=258&target=B&show=aff&PHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html; charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
        20071025140555,http://www.honleyhigh.kirklees.sch.uk/MFL/MFL_Links/PowerPoint%20Presentations/German/Geryear-9-future-tense.ppt,2664960,application/vnd.ms-powerpoint,"org.apache.poi.hslf.exceptions.OldPowerPointFormatException: Based on the Current User stream\, you seem to have supplied a PowerPoint95 file\, which isn't supported"
        20071207004337,http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt,155136,application/vnd.ms-powerpoint,java.lang.ArrayIndexOutOfBoundsException: 20
        

        The first two columns identify the item. The next two are the size of the item in bytes, and the result of using Tika to identity the format (.detect only, no parse). The last column contains the first line of the parse exception(s).

        Note that to download the original item, you can get them from the Internet archive using this template:

        http://web.archive.org/web/{wayback_date}/{url}
        

        i.e. for the last exception listed above, you can download the item at: http://web.archive.org/web/20071207004337/http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt

        It might take me a while to generate the full output for the 2.3 million, so I'll try to pull out the 300 thousand other errors first. Our Solr index is having some performance issues, so it might a bit slow.

        Show
        anjackson Andrew Jackson added a comment - I have 2,358,167 errors from one collection (2 billion resources), but the majority are SAXParseExceptions. It's made up of UK web archive content from 1996-2010, so there's lots of broken HTML/XML in there. If I strip out the SAXParseExceptions, there's just 317,548 miscellaneous errors, that are perhaps more interesting. Here's an example including the SAX exceptions: wayback_date,url,content_length,content_type_tika,parse_error 20100713041445,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=2737187,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20091017141202,http://www.expedia.co.uk:80/pub/agent.dll/qscr=dspv/nojs=1/htid=34830/crti=4/hotel-pictures,"org.xml.sax.SAXParseException: Open quote is expected for attribute ""ID"" associated with an element type ""COMMENT""." 20091017143741,http://www.madfun.co.uk:80/-10?ref=31,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20061020021825,http://reservations.talkingcities.co.uk:80/nexres/hotels/map_hotels.cgi?hid=10055548&map_only=yes&type=overview,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20061020022224,http://www.ravensportal.co.uk:80/forum/index.php?PHPSESSID=1688184d9bb881cfc73600b1670ecaf5&amp;type=rss;action=.xml,org.xml.sax.SAXParseException: The character reference must end with the ';' delimiter. 20101227142905,http://www.etc-online.co.uk:80/style4.asp?pn=courses&sn=26,org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. 20060926015856,http://www.qca.org.uk/4412.html,"org.xml.sax.SAXParseException: The entity ""nbsp"" was referenced\, but not declared." 20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,java.lang.ArrayIndexOutOfBoundsException: -1 20030124193820,http://www.mgcars.org.uk:80/cgi-bin/gen5?runprog=porter&cov=&mode=buy&o=4854130936&code=9123&cu=&,"org.xml.sax.SAXParseException: The element type ""META"" must be terminated by the matching end-tag ""</META>""." 20100121205831,http://www.epupz.co.uk:80/clas/viewdetails.asp?view=307389,org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference. ...and for the others... wayback_date,url,content_length,content_type_tika,parse_error 20100928070438,http://redtyger.co.uk/discuss/projectexternal.php,7524,application/rss+xml,java.lang.NullPointerException: null 20040827075658,http://users.ox.ac.uk:80/~sedm1731/Work/Ex%20parte%20St%20Germain.doc,44997,application/msword,java.lang.ArrayIndexOutOfBoundsException: -1 20060303154606,http://www.dfes.gov.uk:80/rsgateway/DB/SFR/s000286/sfr37-2001.doc,562004,application/msword,java.lang.IllegalArgumentException: Position 698368 past the end of the file 20041225033311,http://members.lycos.co.uk:80/worldofradio/distance.pdf,57891,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20041121095540,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/PDP2148.pdf,191115,application/pdf,"java.io.IOException: Error: Expected a long type\, actual='25#0/'" 20041121095849,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/SER2549.pdf,157148,application/pdf,java.util.zip.DataFormatException: oversubscribed literal/length tree 20041121100005,http://scom.hud.ac.uk:80/scomzl/conference/chenhua/040528_01E/MSV_Foreword.pdf,12773,application/pdf,java.util.zip.DataFormatException: oversubscribed dynamic bit lengths tree 20060925090249,http://www2.rgu.ac.uk/library_edocs/resource/exam/0405engineering/EN3581%20OFFSHORE%20ENGINEERING.pdf,1684742,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20060925091406,http://www2.rgu.ac.uk/library_edocs/resource/exam/0304engineering/EE31060304s1.pdf,149238,application/pdf,org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document. 20040612212128,http://www.swhst.org.uk:80/Linked%20Files/spr%20contact%20addresses.xls,23040,application/vnd.ms-excel,org.apache.poi.EncryptedDocumentException: Default password is invalid for docId/saltData/saltHash 20051111183952,http://freeweb.co.uk:80/show_nw.php?ref=258&target=B&show=aff&PHPSESSID=a150a130c58fcea048866fb965ef7dfb,232436,text/html; charset=iso-8859-1,org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting 20071025140555,http://www.honleyhigh.kirklees.sch.uk/MFL/MFL_Links/PowerPoint%20Presentations/German/Geryear-9-future-tense.ppt,2664960,application/vnd.ms-powerpoint,"org.apache.poi.hslf.exceptions.OldPowerPointFormatException: Based on the Current User stream\, you seem to have supplied a PowerPoint95 file\, which isn't supported" 20071207004337,http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt,155136,application/vnd.ms-powerpoint,java.lang.ArrayIndexOutOfBoundsException: 20 The first two columns identify the item. The next two are the size of the item in bytes, and the result of using Tika to identity the format (.detect only, no parse). The last column contains the first line of the parse exception(s). Note that to download the original item, you can get them from the Internet archive using this template: http://web.archive.org/web/{wayback_date}/{url} i.e. for the last exception listed above, you can download the item at: http://web.archive.org/web/20071207004337/http://www.jisc.org.uk/uploaded_documents/e-port-brief.ppt It might take me a while to generate the full output for the 2.3 million, so I'll try to pull out the 300 thousand other errors first. Our Solr index is having some performance issues, so it might a bit slow.
        Hide
        kkrugler Ken Krugler added a comment -

        Andrew - that sounds amazing! Could you provide an example of such an exception, so we could see what information is currently being captured? And do you have any idea how many (of the 4B) are failing, and thus the size of the exception list? Thanks.

        Show
        kkrugler Ken Krugler added a comment - Andrew - that sounds amazing! Could you provide an example of such an exception, so we could see what information is currently being captured? And do you have any idea how many (of the 4B) are failing, and thus the size of the exception list? Thanks.
        Hide
        anjackson Andrew Jackson added a comment -

        At the UK Web Archive we run Apache Tika over all our collections (it's been run over about 4 billion resources so far). We record the results in Apache Solr, to act as a search facet, and we also collect the Exceptions that are thrown when Tika fails. We can't make the content available to you directly, but perhaps there are datasets we can produce that would be useful to you? e.g. would a list of the exceptions that we've seen (along with the URL to the resource that caused the exception) be of interest?

        Show
        anjackson Andrew Jackson added a comment - At the UK Web Archive we run Apache Tika over all our collections (it's been run over about 4 billion resources so far). We record the results in Apache Solr, to act as a search facet, and we also collect the Exceptions that are thrown when Tika fails. We can't make the content available to you directly, but perhaps there are datasets we can produce that would be useful to you? e.g. would a list of the exceptions that we've seen (along with the URL to the resource that caused the exception) be of interest?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I just transitioned development on TIKA-1302 subtasks (TIKA-1330 and TIKA-1332) to my fork on github under the TIKA-1302 branch (https://github.com/tballison/tika/tree/TIKA-1302).

        Show
        tallison@mitre.org Tim Allison added a comment - I just transitioned development on TIKA-1302 subtasks ( TIKA-1330 and TIKA-1332 ) to my fork on github under the TIKA-1302 branch ( https://github.com/tballison/tika/tree/TIKA-1302 ).
        Hide
        tpalsulich Tyler Palsulich added a comment -

        Hi Lewis John McGibbney and Tim Allison. I'm definitely interested in helping out with these issues. I'll read up and/or comment on them over the next few days.

        Show
        tpalsulich Tyler Palsulich added a comment - Hi Lewis John McGibbney and Tim Allison . I'm definitely interested in helping out with these issues. I'll read up and/or comment on them over the next few days.
        Hide
        lewismc Lewis John McGibbney added a comment -

        I would love to work with Tyler Palsulich to address the issues as above. If we could address this in the near future then it would be a large step forward for Tika's public exposure and would enable further understading of how to embed Tika in applications based on the REST API and WebService endpoint. Tyler Palsulich, please state whether these issues are of interest to you.

        Show
        lewismc Lewis John McGibbney added a comment - I would love to work with Tyler Palsulich to address the issues as above. If we could address this in the near future then it would be a large step forward for Tika's public exposure and would enable further understading of how to embed Tika in applications based on the REST API and WebService endpoint. Tyler Palsulich , please state whether these issues are of interest to you.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Agreed.

        If there's a grad student with some time on his/her hands interested in helping out on this issue, there's still plenty to do. Especially on TIKA-1332.

        Show
        tallison@mitre.org Tim Allison added a comment - Agreed. If there's a grad student with some time on his/her hands interested in helping out on this issue, there's still plenty to do. Especially on TIKA-1332 .
        Hide
        lewismc Lewis John McGibbney added a comment -

        Tyler Palsulich

        So, we get the nightly build running, then we add this on top?

        Not quite. The aim of the VM established in INFRA-7751 is to get a web application/service running there. This requires work on both TIKA-894 and TIKA-1269

        Show
        lewismc Lewis John McGibbney added a comment - Tyler Palsulich So, we get the nightly build running, then we add this on top? Not quite. The aim of the VM established in INFRA-7751 is to get a web application/service running there. This requires work on both TIKA-894 and TIKA-1269
        Hide
        tpalsulich Tyler Palsulich added a comment -

        Are there any updates with this? We have the VM we need for TIKA-1301 (INFRA-7751). So, we get the nightly build running, then we add this on top?

        Show
        tpalsulich Tyler Palsulich added a comment - Are there any updates with this? We have the VM we need for TIKA-1301 ( INFRA-7751 ). So, we get the nightly build running, then we add this on top?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.6 #29 (See https://builds.apache.org/job/tika-trunk-jdk1.6/29/)
        fix potential null pointer exception in PDFParser; found while working on TIKA-1302 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600996)

        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.6 #29 (See https://builds.apache.org/job/tika-trunk-jdk1.6/29/ ) fix potential null pointer exception in PDFParser; found while working on TIKA-1302 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600996 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #29 (See https://builds.apache.org/job/tika-trunk-jdk1.7/29/)
        fix potential null pointer exception in PDFParser; found while working on TIKA-1302 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600996)

        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #29 (See https://builds.apache.org/job/tika-trunk-jdk1.7/29/ ) fix potential null pointer exception in PDFParser; found while working on TIKA-1302 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1600996 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        +1 this sounds good to me, Tim.

        Show
        chrismattmann Chris A. Mattmann added a comment - +1 this sounds good to me, Tim.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Y, that's an important question. All depends on size of corpus and what we want for processing time.

        Let's assume we start with govdocs1 or a sample of it.

        Complete back of envelope...

        On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 seconds to index 1000 files from govdocs1 (let's assume the time to index is roughly equivalent to the time it'll take to write out the diagnostic stuff we'll want to record for each file).

        That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour and 1M files in 11 hours.

        So, if wanted to start small, we could start with 100k. The full govdocs1 takes up 470GB. A 100k sample would take up roughly 47GB.

        We'd want probably (ballpark) 10x input corpus size to store the output so that we can compare different versions of Tika. So, 0.5 TB. Let's double that for some growth: 1 TB.

        So, with a modest 4 cores, let's say 4 GB RAM, and 1 TB of storage, we could run Tika against 100k files in a bit more than an hour. Add another few minutes to compare output for comparison statistics.

        ***These numbers are based on a purely in-memory run. We'll probably want to run against a server (not the public one, of course) so that'll add some to the time.

        Do these numbers jibe with what others are experiencing?

        The big gotcha, of course, is that we'll want to harden the server and/or create a server daemon to restart the server(s) for OOM and infinite hangs. But I think those features are badly needed and this project will give good motivation for these improvements.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Y, that's an important question. All depends on size of corpus and what we want for processing time. Let's assume we start with govdocs1 or a sample of it. Complete back of envelope... On my laptop (4 cores with -Xmx1g), it takes a multithreaded indexer ~40 seconds to index 1000 files from govdocs1 (let's assume the time to index is roughly equivalent to the time it'll take to write out the diagnostic stuff we'll want to record for each file). That would be 10k files in 6.6 minutes, 100k files in a bit more than an hour and 1M files in 11 hours. So, if wanted to start small, we could start with 100k. The full govdocs1 takes up 470GB. A 100k sample would take up roughly 47GB. We'd want probably (ballpark) 10x input corpus size to store the output so that we can compare different versions of Tika. So, 0.5 TB. Let's double that for some growth: 1 TB. So, with a modest 4 cores, let's say 4 GB RAM, and 1 TB of storage, we could run Tika against 100k files in a bit more than an hour. Add another few minutes to compare output for comparison statistics. ***These numbers are based on a purely in-memory run. We'll probably want to run against a server (not the public one, of course) so that'll add some to the time. Do these numbers jibe with what others are experiencing? The big gotcha, of course, is that we'll want to harden the server and/or create a server daemon to restart the server(s) for OOM and infinite hangs. But I think those features are badly needed and this project will give good motivation for these improvements.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Tim Allison this is a good question – the VM that lewis set up I believe is so that anyone can try out Tika via the JAX-RS service. I would imagine if we do the large batch of docs nightly test (which I think would be awesome, btw) we'll need to figure out the specs we would need and then compare it to the VM that lewis just had set up. How much RAM, CPU, disk etc do you think we'll need Tim?

        Show
        chrismattmann Chris A. Mattmann added a comment - Tim Allison this is a good question – the VM that lewis set up I believe is so that anyone can try out Tika via the JAX-RS service. I would imagine if we do the large batch of docs nightly test (which I think would be awesome, btw) we'll need to figure out the specs we would need and then compare it to the VM that lewis just had set up. How much RAM, CPU, disk etc do you think we'll need Tim?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Chris A. Mattmann, Nick Burch, Lewis John McGibbney and All,
        Would it be ok to start trying to work on this on the vm that Lewis just had set up for TIKA-1301? I figure we can take baby steps on that and if this kind of process turns out to be useful to the community and we need more resources then we can set up a separate vm.

        Show
        tallison@mitre.org Tim Allison added a comment - Chris A. Mattmann , Nick Burch , Lewis John McGibbney and All, Would it be ok to start trying to work on this on the vm that Lewis just had set up for TIKA-1301 ? I figure we can take baby steps on that and if this kind of process turns out to be useful to the community and we need more resources then we can set up a separate vm.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Ok, I think we might be talking about different things. For example, when I pull the metadata out of 002454 with Tika 1.5, I see:

        [{
        dcterms:modified":["2004-05-26T15:31:39Z"],
        "meta:creation-date":["2004-05-26T15:31:31Z"],
        "meta:save-date":["2004-05-26T15:31:39Z"],
        "dc:creator":["Slimjimbob"],
        "Last-Modified":["2004-05-26T15:31:39Z"],
        "Author":["Slimjimbob"],
        "dcterms:created":["2004-05-26T15:31:31Z"],
        date":["2004-05-26T15:31:39Z"],
        "modified":["2004-05-26T15:31:39Z"],
        "creator":["Slimjimbob"],
        "xmpTPg:NPages":["1"],
        "Creation-Date":["2004-05-26T15:31:31Z"],
        "title":["CoverMay/June04.qxd"],
        "meta:author":["Slimjimbob"],
        "created":["Wed May 26 11:31:31 EDT 2004"],
        "producer":["Acrobat Distiller 5.00 for Macintosh"],
        "Content-Type":["application/pdf"],
        "xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"],
        "Last-Save-Date":["2004-05-26T15:31:39Z"],
        "dc:title":["CoverMay/June04.qxd"]
        }]
        

        This includes more than is available here:
        [ 002454 | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ]

        Are you saying that there is no metadata truth set against which to evaluate or are we using "metadata" to mean different things?

        Thank you again, and I look forward to seeing your paper!

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Ok, I think we might be talking about different things. For example, when I pull the metadata out of 002454 with Tika 1.5, I see: [{ dcterms:modified":["2004-05-26T15:31:39Z"], "meta:creation-date":["2004-05-26T15:31:31Z"], "meta:save-date":["2004-05-26T15:31:39Z"], "dc:creator":["Slimjimbob"], "Last-Modified":["2004-05-26T15:31:39Z"], "Author":["Slimjimbob"], "dcterms:created":["2004-05-26T15:31:31Z"], date":["2004-05-26T15:31:39Z"], "modified":["2004-05-26T15:31:39Z"], "creator":["Slimjimbob"], "xmpTPg:NPages":["1"], "Creation-Date":["2004-05-26T15:31:31Z"], "title":["CoverMay/June04.qxd"], "meta:author":["Slimjimbob"], "created":["Wed May 26 11:31:31 EDT 2004"], "producer":["Acrobat Distiller 5.00 for Macintosh"], "Content-Type":["application/pdf"], "xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"], "Last-Save-Date":["2004-05-26T15:31:39Z"], "dc:title":["CoverMay/June04.qxd"] }] This includes more than is available here: [ 002454 | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ] Are you saying that there is no metadata truth set against which to evaluate or are we using "metadata" to mean different things? Thank you again, and I look forward to seeing your paper!
        Hide
        gostep Giuseppe Totaro added a comment -

        Hi Tim,

        I refer to metadata schema of each govdocs1 file. In http://digitalcorpora.org/corpora/files, you can read:

        The following metadata is provided for each of the files:

        The URL from which the file was downloaded.
        The date and time of the download.
        The search term that was used.
        The search engine that provided the document.
        The length and SHA1 of the file.
        A Simple Dublin Core for the file.

        Of course when our paper will be published I'll try to explain more detail our work and dataset.

        Show
        gostep Giuseppe Totaro added a comment - Hi Tim, I refer to metadata schema of each govdocs1 file. In http://digitalcorpora.org/corpora/files , you can read: The following metadata is provided for each of the files: The URL from which the file was downloaded. The date and time of the download. The search term that was used. The search engine that provided the document. The length and SHA1 of the file. A Simple Dublin Core for the file. Of course when our paper will be published I'll try to explain more detail our work and dataset.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Julien Nioche, very cool corpus. My dream would be to run Tika via hadoop against a corpus that big, diverse and noisy whenever there's a commit (or better yet, let developers upload their mods as a tika-server, go for a cup of coffee and come back to see the results). For initial steps, govdocs1 seems like a great start...perhaps we could include a random sample from CC downloaded to Apache servers if that is consistent with both Apache and CC licenses?

        Chris A. Mattmann, thank you for pointing out Giuseppe Totaro's work!

        Giuseppe Totaro, please post a link to your work to this issue when it is published. Are there any evaluation components that you'd like to contribute? Do you think there would be a way to share your datasets? And, finally, I'm not sure what you mean by all of the metadata is the same. I am just getting started with the govdocs corpus...when I open up two different pdf files, the have different pdf versions, different authors, different producers.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Julien Nioche , very cool corpus. My dream would be to run Tika via hadoop against a corpus that big, diverse and noisy whenever there's a commit (or better yet, let developers upload their mods as a tika-server, go for a cup of coffee and come back to see the results). For initial steps, govdocs1 seems like a great start...perhaps we could include a random sample from CC downloaded to Apache servers if that is consistent with both Apache and CC licenses? Chris A. Mattmann , thank you for pointing out Giuseppe Totaro 's work! Giuseppe Totaro , please post a link to your work to this issue when it is published. Are there any evaluation components that you'd like to contribute? Do you think there would be a way to share your datasets? And, finally, I'm not sure what you mean by all of the metadata is the same. I am just getting started with the govdocs corpus...when I open up two different pdf files, the have different pdf versions, different authors, different producers.
        Hide
        gostep Giuseppe Totaro added a comment -

        Thank you Chris.

        I'm working with Tika against large set of data. Govdocs1 represents a good choice in order to test Tika performance with a large amount of heterogeneous documents. By using Tika 1.4, I noticed that about 10% of govdocs1 files are not correctly parsed.
        Unfortunately, each file of govdocs1 corpus has the same metadata properties (http://digitalcorpora.org/corpora/files), so this corpus does not provide the possibility to realistically test metadata extraction.
        After our paper will be published in proceedings, I will happy to describe results more in details and share them with the community.
        Now, we're working also with other corpora constructed by ourself starting from realistic (unclassified) disk images.

        I hope to meet your interests.

        Show
        gostep Giuseppe Totaro added a comment - Thank you Chris. I'm working with Tika against large set of data. Govdocs1 represents a good choice in order to test Tika performance with a large amount of heterogeneous documents. By using Tika 1.4, I noticed that about 10% of govdocs1 files are not correctly parsed. Unfortunately, each file of govdocs1 corpus has the same metadata properties ( http://digitalcorpora.org/corpora/files ), so this corpus does not provide the possibility to realistically test metadata extraction. After our paper will be published in proceedings, I will happy to describe results more in details and share them with the community. Now, we're working also with other corpora constructed by ourself starting from realistic (unclassified) disk images. I hope to meet your interests.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        GovDocs - Giuseppe Totaro from the University of Rome (PhD student there) has been working on this dataset and even has a paper to report out on. Giuseppe Totaro can you comment on this one?

        Show
        chrismattmann Chris A. Mattmann added a comment - GovDocs - Giuseppe Totaro from the University of Rome (PhD student there) has been working on this dataset and even has a paper to report out on. Giuseppe Totaro can you comment on this one?
        Hide
        willp-bl William Palmer added a comment -

        Ross Spencer has made the openplanets format-corpus I mentioned above more usable, see https://github.com/ross-spencer/opf-format-corpus/

        Ref: https://twitter.com/beet_keeper/status/468626971337838593

        Show
        willp-bl William Palmer added a comment - Ross Spencer has made the openplanets format-corpus I mentioned above more usable, see https://github.com/ross-spencer/opf-format-corpus/ Ref: https://twitter.com/beet_keeper/status/468626971337838593
        Hide
        jnioche Julien Nioche added a comment -

        How large do you want that batch to be? If we are talking millions of pages, one option would be to use the Tika module of Behemoth on the CommonCrawl dataset. See http://digitalpebble.blogspot.co.uk/2011/05/processing-enron-dataset-using-behemoth.html for a comparable work we did some time ago on the Enron dataset. Behemoth already has a module for ingesting data from CommonCrawl. This means of course having Hadoop up and running.

        Alternatively it would be simple to extract the documents from the CC dataset into the server's filesystem and use the TikaServer without Hadoop. Not sure what the legal implications of using these documents would be though.

        The beauty of using the CommonCrawl dataset is that apart from volume, it is a good sample of the web with all the weird and beautiful things it contains (broken documents, large ones, etc...)

        Show
        jnioche Julien Nioche added a comment - How large do you want that batch to be? If we are talking millions of pages, one option would be to use the Tika module of Behemoth on the CommonCrawl dataset. See http://digitalpebble.blogspot.co.uk/2011/05/processing-enron-dataset-using-behemoth.html for a comparable work we did some time ago on the Enron dataset. Behemoth already has a module for ingesting data from CommonCrawl. This means of course having Hadoop up and running. Alternatively it would be simple to extract the documents from the CC dataset into the server's filesystem and use the TikaServer without Hadoop. Not sure what the legal implications of using these documents would be though. The beauty of using the CommonCrawl dataset is that apart from volume, it is a good sample of the web with all the weird and beautiful things it contains (broken documents, large ones, etc...)
        Hide
        willp-bl William Palmer added a comment -

        This one might be worth a look - https://github.com/openplanets/format-corpus - Some of the files there are (intentionally) broken, and some are there as examples of format features (i.e. PDF with password, embedded fonts etc) If the license is not clear enough for any files then please raise an issue, sure people will be glad to help.

        Unfortunately I can't share any of the web content I describe using in that blog post.

        Show
        willp-bl William Palmer added a comment - This one might be worth a look - https://github.com/openplanets/format-corpus - Some of the files there are (intentionally) broken, and some are there as examples of format features (i.e. PDF with password, embedded fonts etc) If the license is not clear enough for any files then please raise an issue, sure people will be glad to help. Unfortunately I can't share any of the web content I describe using in that blog post.

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:

              Development