Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
So far, tika-eval has been focused on processing "extracts", that is, the result of Tika or another text extractor. I think it would be useful to add a basic FileProfiler that handles the raw input files only but does not parse them. This is useful as a first step when profiling a directory of files before going through the costly process of parsing.
Without parsing, we can get file length, digest and file type detection.