Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
ManifoldCF 1.8, ManifoldCF 2.0
-
None
Description
Boilerpipe has the ability to process content further than our current Tika extractor implementation allows. Specifically, we should be allowing a user to specify a BoilerPipe extractor class, from within the following package (or other places too, one expects):
If the extractor is specified, then our ContentHandler creation code in the Tika extractor changes from:
ContentHandler handler = new BodyContentHandler(w);
to:
ContentHandler handler = new BodyContentHandler(w); boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe; try { ClassLoader loader = BoilerpipeExtractor.class.getClassLoader(); Class extractorClass = loader.loadClass(boilerpipe); BoilerpipeExtractor boilerpipeExtractor = (BoilerpipeExtractor)extractorClass.newInstance(); handler = new BoilerpipeContentHandler(handler, boilerpipeExtractor); } catch (ClassNotFoundException e) { log.warn("BoilerpipeExtractor " + boilerpipe + " not found!"); } catch (InstantiationException e) { log.warn("Could not instantiate " + boilerpipe); } catch (Exception e) { log.warn(e.toString()); }