Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1704

org.apache.lucene.ant.HtmlDocument added Tidy config file passthrough availability

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 2.4.1
    • None
    • modules/other
    • None
    • New

    Description

      Parsing HTML documents using the org.apache.lucene.ant.HtmlDocument.Document method resulted in many error messages such as this:

      line 152 column 725 - Error: <as-html> is not recognized!
      This document has errors that must be fixed before
      using HTML Tidy to generate a tidied up version.

      The solution is to configure Tidy to accept these abnormal tags by adding the tag name to the "new-inline-tags" option in the Tidy config file (or the command line which does not make sense in this context), like so:

      new-inline-tags: as-html

      Tidy needs to know where the configuration file is, so a new constructor and Document method can be added. Here is the code:

          /**                                                                                                                                                                                            
           *  Constructs an <code>HtmlDocument</code> from a {@link                                                                                                                                      
           *  java.io.File}.                                                                                                                                                                             
           *                                                                                                                                                                                             
           *@param  file             the <code>File</code> containing the                                                                                                                                
           *      HTML to parse                                                                                                                                                                          
           *@param  tidyConfigFile   the <code>String</code> containing                                                                                                                                  
           *      the full path to the Tidy config file                                                                                                                                                  
           *@exception  IOException  if an I/O exception occurs                                                                                                                                          
           */
          public HtmlDocument(File file, String tidyConfigFile) throws IOException {
              Tidy tidy = new Tidy();
              tidy.setConfigurationFromFile(tidyConfigFile);
              tidy.setQuiet(true);
              tidy.setShowWarnings(false);
              org.w3c.dom.Document root =
                      tidy.parseDOM(new FileInputStream(file), null);
              rawDoc = root.getDocumentElement();
          }
      
          /**                                                                                                                                                                                            
           *  Creates a Lucene <code>Document</code> from a {@link                                                                                                                                       
           *  java.io.File}.                                                                                                                                                                             
           *                                                                                                                                                                                             
           *@param  file                                                                                                                                                                                 
           *@param  tidyConfigFile the full path to the Tidy config file                                                                                                                                 
           *@exception  IOException                                                                                                                                                                      
           */
          public static org.apache.lucene.document.Document
              Document(File file, String tidyConfigFile) throws IOException {
      
              HtmlDocument htmlDoc = new HtmlDocument(file, tidyConfigFile);
      
              org.apache.lucene.document.Document luceneDoc = new org.apache.lucene.document.Document();
      
              luceneDoc.add(new Field("title", htmlDoc.getTitle(), Field.Store.YES, Field.Index.ANALYZED));
              luceneDoc.add(new Field("contents", htmlDoc.getBody(), Field.Store.YES, Field.Index.ANALYZED));
      
              String contents = null;
              BufferedReader br =
                  new BufferedReader(new FileReader(file));
              StringWriter sw = new StringWriter();
              String line = br.readLine();
              while (line != null) {
                  sw.write(line);
                  line = br.readLine();
              }
              br.close();
              contents = sw.toString();
              sw.close();
      
              luceneDoc.add(new Field("rawcontents", contents, Field.Store.YES, Field.Index.NO));
      
              return luceneDoc;
          }
      

      I am using this now and it is working fine. The configuration file is being passed to Tidy and now I am able to index thousands of HTML pages with no more Tidy tag errors.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ksprochi Keith Sprochi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 0.5h
                0.5h
                Remaining:
                Remaining Estimate - 0.5h
                0.5h
                Logged:
                Time Spent - Not Specified
                Not Specified