Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-253

Add text similarity / relevance / syntactic match component based on parse trees

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Parser
    • Labels:
      None
    • Environment:
      jave

      Description

      Proposed component relies on openNLP parser, and gives search engineers a simple relevance verification tool which relies on machine learning of syntactic parse trees.

      The value for search engineers community is that they dont have to be familiar with NLP to use syntactic generalization component, which does parsing/chunking by openNLP and then graph-based learning for relevance assessment (proposed component).

      One of the expected usage scenario is that a search library like lucene is used, and this component would accept / reject irrelevant search results (according to the proposed syntactic generalization measure).

      This code has been deployed commercially over last 2 years at datran.com and zvents.com and is serving > 20 mln users monthly.

      There is a number of publications on this project, including

      http://portal.acm.org/citation.cfm?id=1881190

      http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/view/2573

        Attachments

        1. text_similarity_proposal_for_opennlp.test.zip
          9 kB
          Boris Galitsky
        2. text_similarity_proposal_for_opennlp.zip
          149 kB
          Boris Galitsky

          Activity

            People

            • Assignee:
              joern Jörn Kottmann
              Reporter:
              bgalitsky Boris Galitsky
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified