Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1088

Augment Tika extractor to allow full use of boilerpipe content extraction

    XMLWordPrintableJSON

Details

    Description

      Boilerpipe has the ability to process content further than our current Tika extractor implementation allows. Specifically, we should be allowing a user to specify a BoilerPipe extractor class, from within the following package (or other places too, one expects):

      http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html

      If the extractor is specified, then our ContentHandler creation code in the Tika extractor changes from:

                  ContentHandler handler = new BodyContentHandler(w);
      

      to:

                  ContentHandler handler = new BodyContentHandler(w);
                  boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
                  try {
                    ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
                    Class extractorClass = loader.loadClass(boilerpipe);
      
                    BoilerpipeExtractor boilerpipeExtractor = (BoilerpipeExtractor)extractorClass.newInstance();
                    handler = new BoilerpipeContentHandler(handler, boilerpipeExtractor);
      
                   } catch (ClassNotFoundException e) {
                      log.warn("BoilerpipeExtractor " + boilerpipe + " not found!");
                   } catch (InstantiationException e) {
                      log.warn("Could not instantiate " + boilerpipe);
                   } catch (Exception e) {
                      log.warn(e.toString());
                   }
      

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            kwright@metacarta.com Karl Wright
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: