Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: cli, general, server
    • Labels:
      None

      Description

      For this issue, we can start with code to gather statistics on each run (# of exceptions per file type, most common exceptions per file type, number of metadata items, total text extracted, etc). We should also be able to compare one run against another. Going forward, there's plenty of room to improve.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        To my mind, there are three families of things that can go wrong:

        1) Parser can fail
        1a) throw an exception
        1b) hang forever

        2) Fail to extract text and/or metadata from documents
        2a) nothing is extracted
        2b) some document components or attachments are not extracted: TIKA-1317 and TIKA-1228

        3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs in .docx, etc), in which case there are two options:
        3a) We can do better.
        3b) We can't...the document is just plain broken.

        We can easily count and compare 1). By easily, I mean that I haven't fully worked it out, but it should be fairly straightforward.

        Without a truth set or a comparison parser, we cannot easily measure 2a or 2b. For 2a, if there is no text, maybe there really is no text (image only pdfs or just a docx that contains images). For 2b, we're really out of luck without other resources.

        For 3), there's lots of room for work. In short, I'd think we'd want to calculate how "languagey" the extracted text is. Some indicators that occur to me:

        a) Type/token ratio or token entropy
        b) Average word length (with an exception for non-whitespace languages)
        c) Ratio of alphanumerics to total string length
        d) Analysis of language id confidence scores...if the string is long enough, you'd expect a langid component to return a very high score for the best language and then far lower scores for the 2nd and 3rd best languages. If the langid component returns flat scores, then that might be an indicator that something didn't go well.

        What do you think? Are there other things that can go wrong? What else should we try to measure, in a supervised (not ideal) or semi-supervised (better) or unsupervised (best)?

        Show
        tallison@mitre.org Tim Allison added a comment - To my mind, there are three families of things that can go wrong: 1) Parser can fail 1a) throw an exception 1b) hang forever 2) Fail to extract text and/or metadata from documents 2a) nothing is extracted 2b) some document components or attachments are not extracted: TIKA-1317 and TIKA-1228 3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs in .docx, etc), in which case there are two options: 3a) We can do better. 3b) We can't...the document is just plain broken. We can easily count and compare 1). By easily, I mean that I haven't fully worked it out, but it should be fairly straightforward. Without a truth set or a comparison parser, we cannot easily measure 2a or 2b. For 2a, if there is no text, maybe there really is no text (image only pdfs or just a docx that contains images). For 2b, we're really out of luck without other resources. For 3), there's lots of room for work. In short, I'd think we'd want to calculate how "languagey" the extracted text is. Some indicators that occur to me: a) Type/token ratio or token entropy b) Average word length (with an exception for non-whitespace languages) c) Ratio of alphanumerics to total string length d) Analysis of language id confidence scores...if the string is long enough, you'd expect a langid component to return a very high score for the best language and then far lower scores for the 2nd and 3rd best languages. If the langid component returns flat scores, then that might be an indicator that something didn't go well. What do you think? Are there other things that can go wrong? What else should we try to measure, in a supervised (not ideal) or semi-supervised (better) or unsupervised (best)?
        Hide
        thaichat04 Hong-Thai Nguyen added a comment -

        What you are describing is something alike functional tests for Tika. Kinds of Cucumber, Fitness tools may help tests more readable and we can obtain report at output ?

        Show
        thaichat04 Hong-Thai Nguyen added a comment - What you are describing is something alike functional tests for Tika. Kinds of Cucumber, Fitness tools may help tests more readable and we can obtain report at output ?
        Hide
        mkrio Matthias Krueger added a comment -

        It might be good to distinguish between the regression testing aspect of nightly runs and the "extraction gap discovery" aspect of running Tika against a large batch of previously untested docs.

        For regression testing it would be good to generate stats on a run and compare them with the last known "good" stats. These stats could include:

        • Number/distribution of detected mime types
        • Number of thrown exceptions thrown per type of exception
        • Frequencies of metadata key-value pairs
        • Frequencies of different word lengths extracted from content (per file type)
          This could be run unsupervised with the delta to the last known "good" run summarized in a daily report.

        Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's cases) sounds more like "gap discovery" which I guess would always need some supervision.

        Show
        mkrio Matthias Krueger added a comment - It might be good to distinguish between the regression testing aspect of nightly runs and the "extraction gap discovery" aspect of running Tika against a large batch of previously untested docs. For regression testing it would be good to generate stats on a run and compare them with the last known "good" stats. These stats could include: Number/distribution of detected mime types Number of thrown exceptions thrown per type of exception Frequencies of metadata key-value pairs Frequencies of different word lengths extracted from content (per file type) This could be run unsupervised with the delta to the last known "good" run summarized in a daily report. Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's cases) sounds more like "gap discovery" which I guess would always need some supervision.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        In a personal communication, I asked Sergey Beryozkin for recommendations for handling static content in the jax-rs framework. For the UI component of the eval code – how the user interacts with the results of the eval – Is there an easy equivalent in JAX-RS that allows for the user to browse a directory of files and click on desired files for download as easily as one can with Jetty's ResourceHandler.

        With permission, I'm posting/summarizing Sergey Beryozkin's responses. If anyone else has a recommendation leveraging the JAX-RS framework for dynamic data and still using something so easy as Jetty's ResourceHandler for static content, please let us know.

        Option 1:
        Handcode a JAX-RS handler that mimics Jetty's ResourceHandler
        > That can be easily enough though with JAX-RS if you'd like to explore
        > this path, something like this I guess:
        >

         @Path("eval")
         public class TikaEvaluation {
               @Context
               private UriInfo ui;
               @GET
               @Path("list")
               @Produces("text/html")
               public Response getListOfResultURIs() {
                   List<URI> uris = new LinkedList<URI>();
                   for (File f : getResultFiles()) {
        
                       uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build());
                  }
                   // uris list now how a list of links to individual files
                   // next we need to decide how to convert that to HTML
                   // one option is to return the list as is and redirect that to
                   // JSP, another option is to build a basic HTML string right here in the
                   // method, another option is to register a MessageBodyWriter that will
                   // convert the list into HTML
                   // the individual links will be managed by getFile() method
        
                   return Response.ok(uris).build();
               }
        
               @GET
               @Path("list/{name}")
               @Produces("application/json", "multipart/mixed")
               public Response getFile(@PathParam("name") String name) {
                   ...
               }
        
        

        Option 2:
        Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting the JAX-RS code.
        > This link would probably be the best one: link

        > Tika JAX-RS server actually runs on top of Jetty right now too, but in
        > this case we have a direct Jetty server setup.
        >
        > The server registers a CXF servlet and Jetty handlers too. CXF servlet
        > also redirect to default handlers like a default handler for serving the
        > static content. This is not needed if the result files are accessible
        > over URI that does not overlap with a CXF servlet URI pattern.
        > In fact, I wonder if a Tika JAXRS style of the registration may also do
        > ? If you register a CXF endpoint at /eval and the results are accessible
        > over /results then it should work ? Unless Jetty ContentHandler is not
        > installed by default - then the linked to code would def do

        > the only possible downside here is that as far as the consistent URI
        > space management is concerned we'd have one part of it (the static
        > resources) controlled natively by Jetty and the rest - by JAX-RS. so it
        > can be trickier to provide a support for searching the results,
        > enforcing the common security rules (when/if needed).
        > That said may be it is not of a real concern, it can always be removed
        > in the future if needed.

        Other options?

        Show
        tallison@mitre.org Tim Allison added a comment - In a personal communication, I asked Sergey Beryozkin for recommendations for handling static content in the jax-rs framework. For the UI component of the eval code – how the user interacts with the results of the eval – Is there an easy equivalent in JAX-RS that allows for the user to browse a directory of files and click on desired files for download as easily as one can with Jetty's ResourceHandler. With permission, I'm posting/summarizing Sergey Beryozkin 's responses. If anyone else has a recommendation leveraging the JAX-RS framework for dynamic data and still using something so easy as Jetty's ResourceHandler for static content, please let us know. Option 1: Handcode a JAX-RS handler that mimics Jetty's ResourceHandler > That can be easily enough though with JAX-RS if you'd like to explore > this path, something like this I guess: > @Path("eval") public class TikaEvaluation { @Context private UriInfo ui; @GET @Path("list") @Produces("text/html") public Response getListOfResultURIs() { List<URI> uris = new LinkedList<URI>(); for (File f : getResultFiles()) { uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build()); } // uris list now how a list of links to individual files // next we need to decide how to convert that to HTML // one option is to return the list as is and redirect that to // JSP, another option is to build a basic HTML string right here in the // method, another option is to register a MessageBodyWriter that will // convert the list into HTML // the individual links will be managed by getFile() method return Response.ok(uris).build(); } @GET @Path("list/{name}") @Produces("application/json", "multipart/mixed") public Response getFile(@PathParam("name") String name) { ... } Option 2: Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting the JAX-RS code. > This link would probably be the best one: link > Tika JAX-RS server actually runs on top of Jetty right now too, but in > this case we have a direct Jetty server setup. > > The server registers a CXF servlet and Jetty handlers too. CXF servlet > also redirect to default handlers like a default handler for serving the > static content. This is not needed if the result files are accessible > over URI that does not overlap with a CXF servlet URI pattern. > In fact, I wonder if a Tika JAXRS style of the registration may also do > ? If you register a CXF endpoint at /eval and the results are accessible > over /results then it should work ? Unless Jetty ContentHandler is not > installed by default - then the linked to code would def do > the only possible downside here is that as far as the consistent URI > space management is concerned we'd have one part of it (the static > resources) controlled natively by Jetty and the rest - by JAX-RS. so it > can be trickier to provide a support for searching the results, > enforcing the common security rules (when/if needed). > That said may be it is not of a real concern, it can always be removed > in the future if needed. Other options?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I got a simple Jetty ResourceHandler up and running on the vm today, but it kept failing on large archive files ~250MB. I set idleTimeout and stopTimeout to large values, but still had no luck. Has anyone had luck with Jetty's ResourceHandler for large files.

        Has anyone had luck with Jetty's ResourceHandler and large files? How about jax-rs for files of that size?

        I notice that govdocs1 is using httpd. Perhaps we'll want separate static/archive server ports vs. active jax-rs browsing?

        I plan to start publishing static results of single runs and comparisons of runs over the next few weeks.

        Show
        tallison@mitre.org Tim Allison added a comment - I got a simple Jetty ResourceHandler up and running on the vm today, but it kept failing on large archive files ~250MB. I set idleTimeout and stopTimeout to large values, but still had no luck. Has anyone had luck with Jetty's ResourceHandler for large files. Has anyone had luck with Jetty's ResourceHandler and large files? How about jax-rs for files of that size? I notice that govdocs1 is using httpd. Perhaps we'll want separate static/archive server ports vs. active jax-rs browsing? I plan to start publishing static results of single runs and comparisons of runs over the next few weeks.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I gave up on that, and we're now using httpd.

        The eval code currently exists as commandline calls. I'm using h2 as the backend database, which appears to be compatible with ASL 2.0. As with all development cycles, I started with a flat file, moved to an unfortunately complex db structure and will probably have to move to nosql if we want this to scale...but not yet...

        As above, there are two modes.
        1) Profile a single run
        a) run tika-app on a directory of files, output with -J -t (Json representation of List<Metadata> with text as the content)
        b) run the profiling code, which populates an h2 db
        c) run xml-configured reports db

        2) Compare two runs
        a) run two versions of tika-app on a directory of files
        b) run the comparison code, which populates an h2 db
        c) run xml-configured reports against the db

        I've pretty much given up on the notion of automatic testing. A human has to look at the reports and make sense of them.

        Given the feedback I received at ApacheCon (egads, a year ago), I think I'd like to transition this code into Tika for 1.14.

        When the code is ready for review, I'll let y'all know. Any and all feedback on the reports to date would be great.

        Show
        tallison@mitre.org Tim Allison added a comment - I gave up on that, and we're now using httpd. The eval code currently exists as commandline calls. I'm using h2 as the backend database, which appears to be compatible with ASL 2.0. As with all development cycles, I started with a flat file, moved to an unfortunately complex db structure and will probably have to move to nosql if we want this to scale...but not yet... As above, there are two modes. 1) Profile a single run a) run tika-app on a directory of files, output with -J -t (Json representation of List<Metadata> with text as the content) b) run the profiling code, which populates an h2 db c) run xml-configured reports db 2) Compare two runs a) run two versions of tika-app on a directory of files b) run the comparison code, which populates an h2 db c) run xml-configured reports against the db I've pretty much given up on the notion of automatic testing. A human has to look at the reports and make sense of them. Given the feedback I received at ApacheCon (egads, a year ago), I think I'd like to transition this code into Tika for 1.14. When the code is ready for review, I'll let y'all know. Any and all feedback on the reports to date would be great.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I'm attaching the current version of the xml that drives the reporting from the db.

        Show
        tallison@mitre.org Tim Allison added a comment - I'm attaching the current version of the xml that drives the reporting from the db.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Some more work is required, but I think tika-eval is getting close to being ready to commit.

        If anyone has a chance to review, code is on my github fork and the beginnings of wiki documentation are now up on our wiki.

        Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Some more work is required, but I think tika-eval is getting close to being ready to commit. If anyone has a chance to review, code is on my github fork and the beginnings of wiki documentation are now up on our wiki . Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Are there any licensing objections to adding a dependency in the tika-eval module for the H2 database? This is dual licensed MPL2.0 and EPL 1.0. These are both "weak copyleft" and should be ok if we document them according to https://www.apache.org/legal/resolved#category-b.

        As a side note, this dependency will only exist for the tika-eval module, not for any of the other modules.

        Show
        tallison@mitre.org Tim Allison added a comment - Are there any licensing objections to adding a dependency in the tika-eval module for the H2 database? This is dual licensed MPL2.0 and EPL 1.0. These are both "weak copyleft" and should be ok if we document them according to https://www.apache.org/legal/resolved#category-b . As a side note, this dependency will only exist for the tika-eval module, not for any of the other modules.
        Hide
        gagravarr Nick Burch added a comment -

        Apache Ignite seems to use H2, and a google of H2 + apache.org shows quite a few other projects with connectors to it at least.

        That said, there's also Apache Derby which might cover the same use-case

        Show
        gagravarr Nick Burch added a comment - Apache Ignite seems to use H2, and a google of H2 + apache.org shows quite a few other projects with connectors to it at least. That said, there's also Apache Derby which might cover the same use-case
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you, Nick Burch!

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you, Nick Burch !
        Hide
        tallison@mitre.org Tim Allison added a comment -

        That only took two years. #conferencedrivendevelopment

        The tika-eval module should be viewed as experimental and subject to fairly drastic changes.

        Give it a try and let's make it better!

        Documentation is available here: https://wiki.apache.org/tika/TikaEval

        Show
        tallison@mitre.org Tim Allison added a comment - That only took two years. #conferencedrivendevelopment The tika-eval module should be viewed as experimental and subject to fairly drastic changes. Give it a try and let's make it better! Documentation is available here: https://wiki.apache.org/tika/TikaEval
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build Tika-trunk #1198 (See https://builds.apache.org/job/Tika-trunk/1198/)
        TIKA-1332 – initial commit for tika-eval module. More work remains. (tallison: rev aa7a0c353362d56cb1b8e77297f0807626b0246c)

        • (add) tika-eval/src/test/java/org/apache/tika/eval/util/MimeUtilTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/ContrastStatistics.java
        • (add) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file3_attachBNotA.doc
        • (add) tika-eval/src/test/resources/log4j.properties
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file2_attachANotB.doc.json
        • (add) tika-eval/pom.xml
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file11_oom.txt.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenStatistics.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file3_attachBNotA.doc.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/DBConsumersManager.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/AbstractDBBuffer.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/ColInfo.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file5_emptyA.pdf
        • (add) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file8_IOEx.pdf
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file1.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/H2Util.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file13_attachANotB.doc.txt
        • (add) tika-eval/src/test/java/org/apache/tika/eval/reports/ResultsReporterTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCountPriorityQueue.java
        • (add) tika-eval/src/test/resources/commontokens/zh-cn
        • (add) tika-eval/src/main/resources/tika-eval-comparison-config.xml
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCounter.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file12_es.txt.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file10_permahang.txt.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXHREFFormatter.java
        • (add) tika-eval/src/main/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory
        • (add) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogMsgHandler.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file7_badJson.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file2_attachANotB.doc
        • (add) tika-eval/src/test/java/org/apache/tika/eval/ProfilerBatchTest.java
        • (add) tika-eval/src/test/resources/commontokens/en
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file4_emptyB.pdf.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/ComparerBatchTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
        • (add) tika-eval/src/test/resources/test-dirs/batch-logs/batch-process-fatal.xml
        • (add) tika-eval/src/main/resources/tika-eval-profiler-config.xml
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file1.pdf
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file11_oom.txt
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/TableInfo.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
        • (add) tika-eval/src/test/resources/commontokens/zh-tw
        • (add) tika-eval/src/test/resources/commontokens/es
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file9_noextract.txt
        • (add) tika-eval/src/main/java/org/apache/tika/eval/EvalFilePaths.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenContraster.java
        • (add) tika-eval/src/main/resources/profile-reports.xml
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/ResultsReporter.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file6_accessEx.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file8_IOEx.pdf.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/LuceneTokenCounter.java
        • (edit) CHANGES.txt
        • (add) tika-eval/src/test/java/org/apache/tika/eval/io/ExtractReaderTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/util/LanguageIDWrapper.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file1.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file6_accessEx.pdf
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenResult.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file11_oom.txt.json
        • (add) tika-eval/src/main/resources/lucene-char-mapping.txt
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file5_emptyA.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file6_accessEx.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file7_badJson.pdf
        • (add) tika-eval/src/test/java/org/apache/tika/MockDBWriter.java
        • (edit) LICENSE.txt
        • (add) tika-eval/src/main/resources/comparison-reports.xml
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file2_attachANotB.doc.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/FileComparerBuilder.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBBuffer.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XSLXCellFormatter.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/io/FatalExceptionReaderTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/IDBWriter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file12_es.txt.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file3_attachBNotA.doc.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file4_emptyB.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file5_emptyA.pdf.json
        • (edit) pom.xml
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXNumFormatter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file4_emptyB.pdf
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file13_attachANotB.doc.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/MimeBuffer.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
        • (add) tika-eval/src/test/resources/log4j_process.properties
        • (add) tika-eval/src/main/resources/lucene-analyzers.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file7_badJson.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file8_IOEx.pdf.json
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Tika-trunk #1198 (See https://builds.apache.org/job/Tika-trunk/1198/ ) TIKA-1332 – initial commit for tika-eval module. More work remains. (tallison: rev aa7a0c353362d56cb1b8e77297f0807626b0246c) (add) tika-eval/src/test/java/org/apache/tika/eval/util/MimeUtilTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/ContrastStatistics.java (add) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml (add) tika-eval/src/test/resources/test-dirs/raw_input/file3_attachBNotA.doc (add) tika-eval/src/test/resources/log4j.properties (add) tika-eval/src/test/resources/test-dirs/extractsA/file2_attachANotB.doc.json (add) tika-eval/pom.xml (add) tika-eval/src/test/resources/test-dirs/extractsB/file11_oom.txt.json (add) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenStatistics.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file3_attachBNotA.doc.json (add) tika-eval/src/main/java/org/apache/tika/eval/batch/DBConsumersManager.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/AbstractDBBuffer.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/ColInfo.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file5_emptyA.pdf (add) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file8_IOEx.pdf (add) tika-eval/src/test/resources/test-dirs/extractsA/file1.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/db/H2Util.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file13_attachANotB.doc.txt (add) tika-eval/src/test/java/org/apache/tika/eval/reports/ResultsReporterTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCountPriorityQueue.java (add) tika-eval/src/test/resources/commontokens/zh-cn (add) tika-eval/src/main/resources/tika-eval-comparison-config.xml (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCounter.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file12_es.txt.json (add) tika-eval/src/test/resources/test-dirs/extractsA/file10_permahang.txt.json (add) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXHREFFormatter.java (add) tika-eval/src/main/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory (add) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogMsgHandler.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file7_badJson.pdf.json (add) tika-eval/src/test/resources/test-dirs/raw_input/file2_attachANotB.doc (add) tika-eval/src/test/java/org/apache/tika/eval/ProfilerBatchTest.java (add) tika-eval/src/test/resources/commontokens/en (add) tika-eval/src/test/resources/test-dirs/extractsA/file4_emptyB.pdf.json (add) tika-eval/src/test/java/org/apache/tika/eval/ComparerBatchTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java (add) tika-eval/src/test/resources/test-dirs/batch-logs/batch-process-fatal.xml (add) tika-eval/src/main/resources/tika-eval-profiler-config.xml (add) tika-eval/src/test/resources/test-dirs/raw_input/file1.pdf (add) tika-eval/src/test/resources/test-dirs/raw_input/file11_oom.txt (add) tika-eval/src/main/java/org/apache/tika/eval/db/TableInfo.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java (add) tika-eval/src/test/resources/commontokens/zh-tw (add) tika-eval/src/test/resources/commontokens/es (add) tika-eval/src/test/resources/test-dirs/raw_input/file9_noextract.txt (add) tika-eval/src/main/java/org/apache/tika/eval/EvalFilePaths.java (add) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenContraster.java (add) tika-eval/src/main/resources/profile-reports.xml (add) tika-eval/src/main/java/org/apache/tika/eval/reports/ResultsReporter.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file6_accessEx.pdf.json (add) tika-eval/src/test/resources/test-dirs/extractsA/file8_IOEx.pdf.json (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/LuceneTokenCounter.java (edit) CHANGES.txt (add) tika-eval/src/test/java/org/apache/tika/eval/io/ExtractReaderTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/util/LanguageIDWrapper.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file1.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file6_accessEx.pdf (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenResult.java (add) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file11_oom.txt.json (add) tika-eval/src/main/resources/lucene-char-mapping.txt (add) tika-eval/src/test/resources/test-dirs/extractsA/file5_emptyA.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file6_accessEx.pdf.json (add) tika-eval/src/test/resources/test-dirs/raw_input/file7_badJson.pdf (add) tika-eval/src/test/java/org/apache/tika/MockDBWriter.java (edit) LICENSE.txt (add) tika-eval/src/main/resources/comparison-reports.xml (add) tika-eval/src/test/resources/test-dirs/extractsB/file2_attachANotB.doc.json (add) tika-eval/src/main/java/org/apache/tika/eval/batch/FileComparerBuilder.java (add) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBBuffer.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XSLXCellFormatter.java (add) tika-eval/src/test/java/org/apache/tika/eval/io/FatalExceptionReaderTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/io/IDBWriter.java (add) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file12_es.txt.json (add) tika-eval/src/test/resources/test-dirs/extractsA/file3_attachBNotA.doc.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file4_emptyB.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file5_emptyA.pdf.json (edit) pom.xml (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXNumFormatter.java (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file4_emptyB.pdf (add) tika-eval/src/test/resources/test-dirs/extractsA/file13_attachANotB.doc.json (add) tika-eval/src/main/java/org/apache/tika/eval/db/MimeBuffer.java (add) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java (add) tika-eval/src/test/resources/log4j_process.properties (add) tika-eval/src/main/resources/lucene-analyzers.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file7_badJson.pdf.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file8_IOEx.pdf.json
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ha. So, Lucene requires Java 8.

        Any preference for a) rolling back to Lucene 5.x which requires Java 7 or b) requiring Java 8 for the tika-eval module?

        Other options?

        Show
        tallison@mitre.org Tim Allison added a comment - Ha. So, Lucene requires Java 8. Any preference for a) rolling back to Lucene 5.x which requires Java 7 or b) requiring Java 8 for the tika-eval module? Other options?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Rolled back to Lucene 5.5.3 for now.

        Show
        tallison@mitre.org Tim Allison added a comment - Rolled back to Lucene 5.5.3 for now.
        Hide
        gagravarr Nick Burch added a comment -

        Unless we really need a Lucene 6 feature, for now to avoid surprises / confusion, I'd suggest rolling back to Lucene 5.x

        Show
        gagravarr Nick Burch added a comment - Unless we really need a Lucene 6 feature, for now to avoid surprises / confusion, I'd suggest rolling back to Lucene 5.x
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for the feedback. I agree. Lucene is now downgraded to 5.x.

        Will wait for a clean build to resolve....this time.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for the feedback. I agree. Lucene is now downgraded to 5.x. Will wait for a clean build to resolve....this time.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x #215 (See https://builds.apache.org/job/tika-2.x/215/)
        TIKA-1332 initial commit of tika-eval. More work remains. (tallison: rev 5e49c33087bbf03763b05efda3bbb96d8cc20ea4)

        • (edit) pom.xml
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file8_IOEx.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file6_accessEx.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/MimeBuffer.java
        • (add) tika-eval/src/main/resources/profile-reports.xml
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file10_permahang.txt.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file5_emptyA.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file13_attachANotB.doc.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file4_emptyB.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file13_attachANotB.doc.txt
        • (add) tika-eval/src/test/java/org/apache/tika/eval/reports/ResultsReporterTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/ContrastStatistics.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenStatistics.java
        • (add) tika-eval/src/test/resources/commontokens/zh-tw
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/io/ExtractReaderTest.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file1.pdf.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCountPriorityQueue.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/util/MimeUtilTest.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file9_noextract.txt
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file6_accessEx.pdf
        • (add) tika-eval/src/main/java/org/apache/tika/eval/EvalFilePaths.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XSLXCellFormatter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/ResultsReporter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenContraster.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenResult.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file8_IOEx.pdf
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/H2Util.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/TableInfo.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/ProfilerBatchTest.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file11_oom.txt
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCounter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/AbstractDBBuffer.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file7_badJson.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file4_emptyB.pdf
        • (edit) LICENSE.txt
        • (add) tika-eval/src/main/resources/lucene-analyzers.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/ComparerBatchTest.java
        • (add) tika-eval/src/main/resources/lucene-char-mapping.txt
        • (add) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/util/LanguageIDWrapper.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBBuffer.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file3_attachBNotA.doc.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file11_oom.txt.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/io/FatalExceptionReaderTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file5_emptyA.pdf
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/DBConsumersManager.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/IDBWriter.java
        • (add) tika-eval/src/test/java/org/apache/tika/MockDBWriter.java
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file7_badJson.pdf
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file2_attachANotB.doc
        • (add) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXNumFormatter.java
        • (add) tika-eval/src/test/resources/commontokens/en
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file12_es.txt.json
        • (add) tika-eval/src/test/resources/test-dirs/batch-logs/batch-process-fatal.xml
        • (add) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file12_es.txt.json
        • (add) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/db/ColInfo.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file2_attachANotB.doc.json
        • (add) tika-eval/src/main/resources/comparison-reports.xml
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file3_attachBNotA.doc
        • (add) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXHREFFormatter.java
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/FileComparerBuilder.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file11_oom.txt.json
        • (add) tika-eval/src/test/resources/test-dirs/raw_input/file1.pdf
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogMsgHandler.java
        • (add) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
        • (add) tika-eval/src/test/resources/commontokens/es
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java
        • (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/LuceneTokenCounter.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file3_attachBNotA.doc.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file7_badJson.pdf.json
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file8_IOEx.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
        • (add) tika-eval/src/test/resources/commontokens/zh-cn
        • (add) tika-eval/src/test/resources/log4j.properties
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file5_emptyA.pdf.json
        • (add) tika-eval/pom.xml
        • (add) tika-eval/src/main/resources/tika-eval-profiler-config.xml
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file1.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file2_attachANotB.doc.json
        • (add) tika-eval/src/test/resources/log4j_process.properties
        • (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java
        • (edit) CHANGES.txt
        • (add) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java
        • (add) tika-eval/src/main/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory
        • (add) tika-eval/src/test/resources/test-dirs/extractsB/file4_emptyB.pdf.json
        • (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java
        • (add) tika-eval/src/main/resources/tika-eval-comparison-config.xml
        • (add) tika-eval/src/test/resources/test-dirs/extractsA/file6_accessEx.pdf.json
          TIKA-1332 fix one profiler report and whitespace (tallison: rev 69dd0328b9f6d7825f2b74610c0e2abe9c2e8f33)
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
        • (edit) tika-eval/src/main/resources/comparison-reports.xml
        • (edit) tika-eval/src/main/resources/lucene-analyzers.json
        • (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
        • (edit) tika-eval/src/main/resources/profile-reports.xml
          TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7 (tallison: rev 0d04b499a6c305c6c0656f37abfd6f78440ea309)
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java
        • (edit) tika-eval/pom.xml
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x #215 (See https://builds.apache.org/job/tika-2.x/215/ ) TIKA-1332 initial commit of tika-eval. More work remains. (tallison: rev 5e49c33087bbf03763b05efda3bbb96d8cc20ea4) (edit) pom.xml (add) tika-eval/src/test/resources/test-dirs/extractsA/file8_IOEx.pdf.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file6_accessEx.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/db/MimeBuffer.java (add) tika-eval/src/main/resources/profile-reports.xml (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file10_permahang.txt.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file5_emptyA.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file13_attachANotB.doc.json (add) tika-eval/src/test/resources/test-dirs/extractsA/file4_emptyB.pdf.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file13_attachANotB.doc.txt (add) tika-eval/src/test/java/org/apache/tika/eval/reports/ResultsReporterTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/ContrastStatistics.java (add) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java (add) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenStatistics.java (add) tika-eval/src/test/resources/commontokens/zh-tw (add) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java (add) tika-eval/src/test/java/org/apache/tika/eval/io/ExtractReaderTest.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file1.pdf.json (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCountPriorityQueue.java (add) tika-eval/src/test/java/org/apache/tika/eval/util/MimeUtilTest.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file9_noextract.txt (add) tika-eval/src/test/resources/test-dirs/raw_input/file6_accessEx.pdf (add) tika-eval/src/main/java/org/apache/tika/eval/EvalFilePaths.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XSLXCellFormatter.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/ResultsReporter.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenContraster.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenResult.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file8_IOEx.pdf (add) tika-eval/src/main/java/org/apache/tika/eval/db/H2Util.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/TableInfo.java (add) tika-eval/src/test/java/org/apache/tika/eval/ProfilerBatchTest.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file11_oom.txt (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCounter.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/AbstractDBBuffer.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file7_badJson.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file4_emptyB.pdf (edit) LICENSE.txt (add) tika-eval/src/main/resources/lucene-analyzers.json (add) tika-eval/src/test/java/org/apache/tika/eval/ComparerBatchTest.java (add) tika-eval/src/main/resources/lucene-char-mapping.txt (add) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java (add) tika-eval/src/main/java/org/apache/tika/eval/util/LanguageIDWrapper.java (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBBuffer.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file3_attachBNotA.doc.json (add) tika-eval/src/main/java/org/apache/tika/eval/reports/Report.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file11_oom.txt.json (add) tika-eval/src/test/java/org/apache/tika/eval/io/FatalExceptionReaderTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file5_emptyA.pdf (add) tika-eval/src/main/java/org/apache/tika/eval/batch/DBConsumersManager.java (add) tika-eval/src/main/java/org/apache/tika/eval/io/IDBWriter.java (add) tika-eval/src/test/java/org/apache/tika/MockDBWriter.java (add) tika-eval/src/test/resources/test-dirs/raw_input/file7_badJson.pdf (add) tika-eval/src/test/resources/test-dirs/raw_input/file2_attachANotB.doc (add) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXNumFormatter.java (add) tika-eval/src/test/resources/commontokens/en (add) tika-eval/src/test/resources/test-dirs/extractsA/file12_es.txt.json (add) tika-eval/src/test/resources/test-dirs/batch-logs/batch-process-fatal.xml (add) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml (add) tika-eval/src/test/resources/test-dirs/extractsB/file12_es.txt.json (add) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java (add) tika-eval/src/main/java/org/apache/tika/eval/db/ColInfo.java (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumerBuilder.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file2_attachANotB.doc.json (add) tika-eval/src/main/resources/comparison-reports.xml (add) tika-eval/src/test/resources/test-dirs/raw_input/file3_attachBNotA.doc (add) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXHREFFormatter.java (add) tika-eval/src/main/java/org/apache/tika/eval/batch/FileComparerBuilder.java (add) tika-eval/src/test/resources/test-dirs/extractsB/file11_oom.txt.json (add) tika-eval/src/test/resources/test-dirs/raw_input/file1.pdf (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogMsgHandler.java (add) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml (add) tika-eval/src/test/resources/commontokens/es (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/LuceneTokenCounter.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file3_attachBNotA.doc.json (add) tika-eval/src/test/resources/test-dirs/extractsA/file7_badJson.pdf.json (add) tika-eval/src/test/resources/test-dirs/extractsB/file8_IOEx.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java (add) tika-eval/src/test/resources/commontokens/zh-cn (add) tika-eval/src/test/resources/log4j.properties (add) tika-eval/src/test/resources/test-dirs/extractsA/file5_emptyA.pdf.json (add) tika-eval/pom.xml (add) tika-eval/src/main/resources/tika-eval-profiler-config.xml (add) tika-eval/src/test/resources/test-dirs/extractsA/file1.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java (add) tika-eval/src/test/resources/test-dirs/extractsA/file2_attachANotB.doc.json (add) tika-eval/src/test/resources/log4j_process.properties (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java (edit) CHANGES.txt (add) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java (add) tika-eval/src/main/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory (add) tika-eval/src/test/resources/test-dirs/extractsB/file4_emptyB.pdf.json (add) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java (add) tika-eval/src/main/resources/tika-eval-comparison-config.xml (add) tika-eval/src/test/resources/test-dirs/extractsA/file6_accessEx.pdf.json TIKA-1332 fix one profiler report and whitespace (tallison: rev 69dd0328b9f6d7825f2b74610c0e2abe9c2e8f33) (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml (edit) tika-eval/src/main/resources/comparison-reports.xml (edit) tika-eval/src/main/resources/lucene-analyzers.json (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml (edit) tika-eval/src/main/resources/profile-reports.xml TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7 (tallison: rev 0d04b499a6c305c6c0656f37abfd6f78440ea309) (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java (edit) tika-eval/pom.xml (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1199 (See https://builds.apache.org/job/Tika-trunk/1199/)
        TIKA-1332 – fix one report for eval profiler and clean up whitespace (tallison: rev 506b572560f6c7f44270b55877f110719a7d4b1f)

        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
        • (edit) tika-eval/src/main/resources/comparison-reports.xml
        • (edit) tika-eval/src/main/resources/lucene-analyzers.json
        • (edit) tika-eval/src/main/resources/profile-reports.xml
        • (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
          TIKA-1332 – downgrade Lucene to 5.x to allow for Java 7 (tallison: rev d194ba4022dffa61cad2a12ea0092f6ec00588d2)
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java
        • (edit) tika-eval/pom.xml
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1199 (See https://builds.apache.org/job/Tika-trunk/1199/ ) TIKA-1332 – fix one report for eval profiler and clean up whitespace (tallison: rev 506b572560f6c7f44270b55877f110719a7d4b1f) (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml (edit) tika-eval/src/main/resources/comparison-reports.xml (edit) tika-eval/src/main/resources/lucene-analyzers.json (edit) tika-eval/src/main/resources/profile-reports.xml (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml TIKA-1332 – downgrade Lucene to 5.x to allow for Java 7 (tallison: rev d194ba4022dffa61cad2a12ea0092f6ec00588d2) (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java (edit) tika-eval/pom.xml (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x #216 (See https://builds.apache.org/job/tika-2.x/216/)
        TIKA-1332 fix pom for 2.0 (tallison: rev 44612ae405d1342661387f74320e13c96301754b)

        • (edit) tika-eval/pom.xml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x #216 (See https://builds.apache.org/job/tika-2.x/216/ ) TIKA-1332 fix pom for 2.0 (tallison: rev 44612ae405d1342661387f74320e13c96301754b) (edit) tika-eval/pom.xml
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1200 (See https://builds.apache.org/job/Tika-trunk/1200/)
        TIKA-1332 – clean up commons-io version mgmt (tallison: rev 6c6b77b4159d4e7bbebd883cb52f2160be9cc5a6)

        • (edit) tika-eval/pom.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1200 (See https://builds.apache.org/job/Tika-trunk/1200/ ) TIKA-1332 – clean up commons-io version mgmt (tallison: rev 6c6b77b4159d4e7bbebd883cb52f2160be9cc5a6) (edit) tika-eval/pom.xml
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #217 (See https://builds.apache.org/job/tika-2.x/217/)
        TIKA-1332 3rd time's the charm. Fix dependencies with IOUtils. (tallison: rev 61532258f2ff44787050f0f3a0bb8ba17d8e50b0)

        • (edit) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #217 (See https://builds.apache.org/job/tika-2.x/217/ ) TIKA-1332 3rd time's the charm. Fix dependencies with IOUtils. (tallison: rev 61532258f2ff44787050f0f3a0bb8ba17d8e50b0) (edit) tika-eval/src/main/java/org/apache/tika/eval/io/DBWriter.java (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java (edit) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java (edit) tika-eval/src/main/java/org/apache/tika/eval/XMLErrorLogUpdater.java (edit) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1201 (See https://builds.apache.org/job/Tika-trunk/1201/)
        TIKA-1332 – fix analyzer chain for common tokens, clean up UTF-8 (tallison: rev a2d214c71602f4f5a84635adc38c43182a39a390)

        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java
        • (edit) tika-eval/src/main/resources/lucene-analyzers.json
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1201 (See https://builds.apache.org/job/Tika-trunk/1201/ ) TIKA-1332 – fix analyzer chain for common tokens, clean up UTF-8 (tallison: rev a2d214c71602f4f5a84635adc38c43182a39a390) (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java (edit) tika-eval/src/main/resources/lucene-analyzers.json (edit) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #218 (See https://builds.apache.org/job/tika-2.x/218/)
        TIKA-1332 – add English Spanish common tokens; fix logging (tallison: rev 81150859bdb25fe7faec575f5b916c8efad963cb)

        • (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
        • (delete) tika-eval/src/test/resources/commontokens/zh-tw
        • (add) tika-eval/src/test/resources/common_tokens/zh-cn
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
        • (add) tika-eval/src/test/resources/common_tokens/zh-tw
        • (add) tika-eval/src/test/resources/common_tokens/en
        • (add) tika-eval/src/test/resources/common_tokens/es
        • (add) tika-eval/src/main/resources/log4j.properties
        • (delete) tika-eval/src/test/resources/commontokens/zh-cn
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java
        • (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
        • (delete) tika-eval/src/test/resources/commontokens/es
        • (add) tika-eval/src/main/resources/common_tokens/es
        • (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml
        • (add) tika-eval/src/main/resources/common_tokens/en
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
        • (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
        • (delete) tika-eval/src/test/resources/log4j_process.properties
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java
        • (delete) tika-eval/src/test/resources/commontokens/en
        • (delete) tika-eval/src/test/resources/log4j.properties
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #218 (See https://builds.apache.org/job/tika-2.x/218/ ) TIKA-1332 – add English Spanish common tokens; fix logging (tallison: rev 81150859bdb25fe7faec575f5b916c8efad963cb) (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml (delete) tika-eval/src/test/resources/commontokens/zh-tw (add) tika-eval/src/test/resources/common_tokens/zh-cn (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java (add) tika-eval/src/test/resources/common_tokens/zh-tw (add) tika-eval/src/test/resources/common_tokens/en (add) tika-eval/src/test/resources/common_tokens/es (add) tika-eval/src/main/resources/log4j.properties (delete) tika-eval/src/test/resources/commontokens/zh-cn (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java (delete) tika-eval/src/test/resources/commontokens/es (add) tika-eval/src/main/resources/common_tokens/es (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml (add) tika-eval/src/main/resources/common_tokens/en (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java (delete) tika-eval/src/test/resources/log4j_process.properties (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java (edit) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java (delete) tika-eval/src/test/resources/commontokens/en (delete) tika-eval/src/test/resources/log4j.properties
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1202 (See https://builds.apache.org/job/Tika-trunk/1202/)
        TIKA-1332 – add English/Spanish common tokens, fix logging (tallison: rev dc2dcd4ccc7bca640bb362f72729d0b6ba22a890)

        • (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
        • (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
        • (add) tika-eval/src/test/resources/common_tokens/zh-cn
        • (add) tika-eval/src/main/resources/common_tokens/en
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java
        • (add) tika-eval/src/test/resources/common_tokens/zh-tw
        • (delete) tika-eval/src/test/resources/commontokens/zh-tw
        • (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml
        • (delete) tika-eval/src/test/resources/commontokens/es
        • (delete) tika-eval/src/test/resources/commontokens/zh-cn
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java
        • (add) tika-eval/src/test/resources/common_tokens/en
        • (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
        • (delete) tika-eval/src/test/resources/log4j.properties
        • (add) tika-eval/src/main/resources/common_tokens/es
        • (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java
        • (delete) tika-eval/src/test/resources/commontokens/en
        • (delete) tika-eval/src/test/resources/log4j_process.properties
        • (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml
        • (add) tika-eval/src/main/resources/log4j.properties
        • (add) tika-eval/src/test/resources/common_tokens/es
        • (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1202 (See https://builds.apache.org/job/Tika-trunk/1202/ ) TIKA-1332 – add English/Spanish common tokens, fix logging (tallison: rev dc2dcd4ccc7bca640bb362f72729d0b6ba22a890) (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/EvalConsumersBuilder.java (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java (edit) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java (add) tika-eval/src/test/resources/common_tokens/zh-cn (add) tika-eval/src/main/resources/common_tokens/en (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java (add) tika-eval/src/test/resources/common_tokens/zh-tw (delete) tika-eval/src/test/resources/commontokens/zh-tw (edit) tika-eval/src/main/resources/tika-eval-profiler-config.xml (delete) tika-eval/src/test/resources/commontokens/es (delete) tika-eval/src/test/resources/commontokens/zh-cn (edit) tika-eval/src/main/java/org/apache/tika/eval/TikaEvalCLI.java (add) tika-eval/src/test/resources/common_tokens/en (edit) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml (delete) tika-eval/src/test/resources/log4j.properties (add) tika-eval/src/main/resources/common_tokens/es (edit) tika-eval/src/main/java/org/apache/tika/eval/batch/SingleFileConsumerBuilder.java (delete) tika-eval/src/test/resources/commontokens/en (delete) tika-eval/src/test/resources/log4j_process.properties (edit) tika-eval/src/main/resources/tika-eval-comparison-config.xml (add) tika-eval/src/main/resources/log4j.properties (add) tika-eval/src/test/resources/common_tokens/es (edit) tika-eval/src/test/java/org/apache/tika/eval/TikaEvalCLITest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        More work remains, but I'll open separate issues.

        Show
        tallison@mitre.org Tim Allison added a comment - More work remains, but I'll open separate issues.

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development