Derby
  1. Derby
  2. DERBY-472

Full Text Indexing / Full Text Search

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Environment:
      All environments
    • Urgency:
      Low

      Description

      Efficiently support full text search of string datatyped columns. Mag Gam raised this issue on the user's mailing list on 24 July 2005; the email thread is titled 'Full Text Indexing'. Mag wants to see something akin to the functionality in tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/). Dan points out that we may be able to re-use index building technology exposed by the apache Lucene project (http://lucene.apache.org/).

      Presumably we want to build inverted indexes on all string datatyped columns: CHAR, VARCHAR, LONG VARCHAR, CLOB,, and their national variants (when they are implemented). We should consider the following additional issues when specifying this feature:

      1) Do we also want to support text search on XML columns?

      2) Which human languages do we support initially? Each language has its own rules for lexing words and its own list of "noise" words which should not be indexed. Hopefully, we can plug-in some existing packages of lexers and noise filters. We should encourage users to donate additional lexers/fitlers.

      3) The CREATE INDEX syntax (for these new inverted indexes) should let us bind a lexing human language to a string-datatyped column.

      4) How do we express the search condition? For case-sensitive searches we can get away with boolean expressions built out of standard LIKE clauses. However, in my opinion, case-sensitive searches are an edge case. The more useful situation is a case-insensitive search. Can we get away with introducing a non-standard function here or do we need to push a proposal through the standards commitees? Even more useful and non-standard are fuzzy searches, which tolerate bad spellers.

        Issue Links

          Activity

          Hide
          Daniel John Debrunner added a comment -

          Rather than invent text indexing, Derby should use the Apache text search library - Lucene.

          http://lucene.apache.org/

          Then you would have a text query language like Google, complete with language support etc.

          SELECT * FOM ARTICLES where CONTAINS('+Apache +Derby -hat') where rank > 0.8 order by rank

          or something like that.

          Show
          Daniel John Debrunner added a comment - Rather than invent text indexing, Derby should use the Apache text search library - Lucene. http://lucene.apache.org/ Then you would have a text query language like Google, complete with language support etc. SELECT * FOM ARTICLES where CONTAINS('+Apache +Derby -hat') where rank > 0.8 order by rank or something like that.
          Hide
          Rick Hillegas added a comment -

          If Lucene's query syntax satisfies Derby's standards requirements, this is great. The Lucene syntax is compact and powerful.

          Show
          Rick Hillegas added a comment - If Lucene's query syntax satisfies Derby's standards requirements, this is great. The Lucene syntax is compact and powerful.
          Hide
          Daniel John Debrunner added a comment -

          I'm hoping the work I'm doing in DERBY-571 and beyond will provide the framework for integrating lucene into Derby.
          With the query,

          SELECT * FROM ARTICLES where CONTAINS(summary, '+Apache +Derby -hat') where rank > 0.8 order by rank

          I'm thinking that the CONTAINS function would map to a vritual table (implementation of java..sql.ResultSet) that took the arguments and ran against the lucene index.

          Show
          Daniel John Debrunner added a comment - I'm hoping the work I'm doing in DERBY-571 and beyond will provide the framework for integrating lucene into Derby. With the query, SELECT * FROM ARTICLES where CONTAINS(summary, '+Apache +Derby -hat') where rank > 0.8 order by rank I'm thinking that the CONTAINS function would map to a vritual table (implementation of java..sql.ResultSet) that took the arguments and ran against the lucene index.
          Hide
          Rick Hillegas added a comment -

          Linking this issue to a later request for Lucene integration.

          Show
          Rick Hillegas added a comment - Linking this issue to a later request for Lucene integration.
          Hide
          Rick Hillegas added a comment -

          The following wiki page tracks this evolving proposal: http://wiki.apache.org/db-derby/LuceneIntegration.

          Show
          Rick Hillegas added a comment - The following wiki page tracks this evolving proposal: http://wiki.apache.org/db-derby/LuceneIntegration .
          Hide
          Dan Scott added a comment -

          Has anyone looked at this (old, now) patch for Lucene integration into Derby:
          http://issues.apache.org/jira/browse/LUCENE-434

          Full-text search would be really nice, and Lucene is clearly the way to go.

          Show
          Dan Scott added a comment - Has anyone looked at this (old, now) patch for Lucene integration into Derby: http://issues.apache.org/jira/browse/LUCENE-434 Full-text search would be really nice, and Lucene is clearly the way to go.
          Hide
          Suran Jayathilaka added a comment -

          Hi,

          The postgraduate course I'm following requires us to complete a development/research project which will approximately span the duration of one year. Having had a wonderful experience with working with the Derby community for my Google Summer of Code project, I am considering taking up some feature addition to Derby as my degree project.

          After discussing a bit with Kathey Marsden, I think this issue, Lucene integration for Derby for full text indexing / full text search might be a good issue to take up.

          But I am not quite clear on the amount of workload/timeframes that might be required to see it to completion.
          Therefor, I would be grateful if I could get some input as to how this task could be broken up into viable subtasks, and a brief idea of what they might involve.

          Thanks.
          Suran

          Show
          Suran Jayathilaka added a comment - Hi, The postgraduate course I'm following requires us to complete a development/research project which will approximately span the duration of one year. Having had a wonderful experience with working with the Derby community for my Google Summer of Code project, I am considering taking up some feature addition to Derby as my degree project. After discussing a bit with Kathey Marsden, I think this issue, Lucene integration for Derby for full text indexing / full text search might be a good issue to take up. But I am not quite clear on the amount of workload/timeframes that might be required to see it to completion. Therefor, I would be grateful if I could get some input as to how this task could be broken up into viable subtasks, and a brief idea of what they might involve. Thanks. Suran
          Hide
          Rick Hillegas added a comment -

          Hi Suran,

          I think that this would be a great contribution. It might be good to start out by defining what you mean by Lucene/Derby integration. Here are some possibilities. Full disclosure: it has been a long time since I looked at Lucene and I have not prototyped any of the possibilities listed below:

          1) Be able to run Lucene queries against text stored in Derby.

          I believe that this has been working for a long time. Lucene can index documents that are stored inside RDBMSes, including Derby. Documents are persisted in Derby and the customer application periodically tells Lucene to re-index those documents (say once a night). The application runs queries against the Lucene indexes and then retrieves the relevant documents from Derby when the customer wants to drill down.

          If that's all you want to do, then you may just need to write a primer, explaining how a developer would set up a combined Derby/Lucene installation.

          2) Be able to do (1) and join Lucene results against other Derby data.

          I believe that you can do this today without modifying Lucene or Derby. You need to do something like the following:

          a) Write triggers which update the Lucene indexes when the customer inserts or updates a document.

          b) Write a table function which is handed a Lucene query. The table function returns the relevant document ids. Those ids can then be joined back to Derby tables to retrieve documents.

          No doubt there are a lot of interesting issues involved with keeping Derby and Lucene in sync.

          The table function might look like this:

          create function luceneQuery
          (
          queryText varchar( 32672 ),
          documentDomain varchar( 32672 ),
          rankCutoff double
          )
          returns table
          (
          documentDomain varchar( 32672 ),
          documentID int,
          rank double
          )
          language java
          parameter style DERBY_JDBC_RESULT_SET
          contains sql
          external name 'LuceneSupport.luceneQuery'

          Then you could pose Derby queries like this:

          select reports.id, reports.document, luceneResults.rank
          from docSchema.reports,
          table( luceneQuery( '+Apache +Derby -hat', 'docSchema.reports', 0.8 ) ) luceneResults
          where reports.id = luceneResults.documentID
          order by luceneResults.rank

          If that's good enough, then your task might just be providing a template trigger and table function plus writing a whitepaper explaining how to wire the templates into an application.

          3) Be able to do (2) with great performance.

          It may be that (2) has some performance or scalability problems. Lucene or Derby might need to be enhanced to make (2) perform well.

          4) Be able to do (3) easily.

          A DBA might want help in setting up a combined Derby/Lucene installation. A DBA might want to backup and restore the Derby and Lucene databases together. Making this easily usable might require writing some tools and/or putting some hooks into Lucene and Derby.

          5) Something else. What's on your mind?

          Show
          Rick Hillegas added a comment - Hi Suran, I think that this would be a great contribution. It might be good to start out by defining what you mean by Lucene/Derby integration. Here are some possibilities. Full disclosure: it has been a long time since I looked at Lucene and I have not prototyped any of the possibilities listed below: 1) Be able to run Lucene queries against text stored in Derby. I believe that this has been working for a long time. Lucene can index documents that are stored inside RDBMSes, including Derby. Documents are persisted in Derby and the customer application periodically tells Lucene to re-index those documents (say once a night). The application runs queries against the Lucene indexes and then retrieves the relevant documents from Derby when the customer wants to drill down. If that's all you want to do, then you may just need to write a primer, explaining how a developer would set up a combined Derby/Lucene installation. 2) Be able to do (1) and join Lucene results against other Derby data. I believe that you can do this today without modifying Lucene or Derby. You need to do something like the following: a) Write triggers which update the Lucene indexes when the customer inserts or updates a document. b) Write a table function which is handed a Lucene query. The table function returns the relevant document ids. Those ids can then be joined back to Derby tables to retrieve documents. No doubt there are a lot of interesting issues involved with keeping Derby and Lucene in sync. The table function might look like this: create function luceneQuery ( queryText varchar( 32672 ), documentDomain varchar( 32672 ), rankCutoff double ) returns table ( documentDomain varchar( 32672 ), documentID int, rank double ) language java parameter style DERBY_JDBC_RESULT_SET contains sql external name 'LuceneSupport.luceneQuery' Then you could pose Derby queries like this: select reports.id, reports.document, luceneResults.rank from docSchema.reports, table( luceneQuery( '+Apache +Derby -hat', 'docSchema.reports', 0.8 ) ) luceneResults where reports.id = luceneResults.documentID order by luceneResults.rank If that's good enough, then your task might just be providing a template trigger and table function plus writing a whitepaper explaining how to wire the templates into an application. 3) Be able to do (2) with great performance. It may be that (2) has some performance or scalability problems. Lucene or Derby might need to be enhanced to make (2) perform well. 4) Be able to do (3) easily. A DBA might want help in setting up a combined Derby/Lucene installation. A DBA might want to backup and restore the Derby and Lucene databases together. Making this easily usable might require writing some tools and/or putting some hooks into Lucene and Derby. 5) Something else. What's on your mind?
          Hide
          Rick Hillegas added a comment -

          Geoff Hendrey describes benefits he'd like to see from a Lucene integration: http://www.nabble.com/vote-for-Lucene-integration!-td22407382.html#a22407382

          Show
          Rick Hillegas added a comment - Geoff Hendrey describes benefits he'd like to see from a Lucene integration: http://www.nabble.com/vote-for-Lucene-integration!-td22407382.html#a22407382
          Hide
          geoff hendrey added a comment -

          as background on my interest in this, I just finished a three month project implementing search with lucene on 1,950,000 documents. It works fantastically, but requires careful local specific configuration.

          Based on what I learned I would recommend a loose coupling in which a server can listen on the wire for tx logs like the ones shipped to the asynch replication server. The server simply receives the tx log and indexes the information in lucene. I could code the listener if you have docs on the replication protocol.

          Show
          geoff hendrey added a comment - as background on my interest in this, I just finished a three month project implementing search with lucene on 1,950,000 documents. It works fantastically, but requires careful local specific configuration. Based on what I learned I would recommend a loose coupling in which a server can listen on the wire for tx logs like the ones shipped to the asynch replication server. The server simply receives the tx log and indexes the information in lucene. I could code the listener if you have docs on the replication protocol.
          Hide
          Knut Anders Hatlen added a comment -

          I'm not aware of any documentation of the replication protocol, but my understanding is that it is just using ObjectOutput.writeObject()/ObjectInput.readObject() to transport ReplicationMessage objects (http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/replication/net/ReplicationMessage.html). The interesting messages are the ones where type=TYPE_LOG. Those messages contains the transaction log records, which have the same format as the files in the log directory in the database. The replication protocol doesn't know the meaning of the transaction log records, it just forwards the raw bytes to the recovery subsystem. I think the format of the transaction logs is also based on writeObject()/readObject(), so the easiest way to learn the protocol is probably to study the writeExternal() and readExternal() methods of the different log operation classes listed at the bottom of this page: http://db.apache.org/derby/papers/recovery.html (of course, you don't need to know the exact format if you use readObject()/writeObject() yourself).

          Show
          Knut Anders Hatlen added a comment - I'm not aware of any documentation of the replication protocol, but my understanding is that it is just using ObjectOutput.writeObject()/ObjectInput.readObject() to transport ReplicationMessage objects ( http://db.apache.org/derby/javadoc/engine/org/apache/derby/impl/store/replication/net/ReplicationMessage.html ). The interesting messages are the ones where type=TYPE_LOG. Those messages contains the transaction log records, which have the same format as the files in the log directory in the database. The replication protocol doesn't know the meaning of the transaction log records, it just forwards the raw bytes to the recovery subsystem. I think the format of the transaction logs is also based on writeObject()/readObject(), so the easiest way to learn the protocol is probably to study the writeExternal() and readExternal() methods of the different log operation classes listed at the bottom of this page: http://db.apache.org/derby/papers/recovery.html (of course, you don't need to know the exact format if you use readObject()/writeObject() yourself).
          Hide
          Rick Hillegas added a comment -

          A little more information on standards which apply to this work. The ANSI/ISO SQL standard does address full-text search in part 2 of the section titled SQL/MM (SQL Multimedia and Application Packages). The approach centers on a special full-text datatype, which is an ADT. The ADT has many methods defined on it, some of which return BOOLEAN values. SQL/MM is part of the 1999 version of the standard. I asked the ANSI SQL committee if anyone on the committee had implemented the spec. No-one replied. I think that means that no major vendor has implemented this part of the SQL spec.

          You can get a copy of part 2 of SQL/MM here: http://www.wiscorp.com/SQLStandards.html

          At this time, I don't recommend building Derby text search capabilities on top of the SQL standard--for these reasons:

          1) I question the usefulness of a spec which has not been implemented by any major vendor.

          2) Derby does not currently support ADTs.

          3) Derby does not currently support a BOOLEAN datatype.

          Show
          Rick Hillegas added a comment - A little more information on standards which apply to this work. The ANSI/ISO SQL standard does address full-text search in part 2 of the section titled SQL/MM (SQL Multimedia and Application Packages). The approach centers on a special full-text datatype, which is an ADT. The ADT has many methods defined on it, some of which return BOOLEAN values. SQL/MM is part of the 1999 version of the standard. I asked the ANSI SQL committee if anyone on the committee had implemented the spec. No-one replied. I think that means that no major vendor has implemented this part of the SQL spec. You can get a copy of part 2 of SQL/MM here: http://www.wiscorp.com/SQLStandards.html At this time, I don't recommend building Derby text search capabilities on top of the SQL standard--for these reasons: 1) I question the usefulness of a spec which has not been implemented by any major vendor. 2) Derby does not currently support ADTs. 3) Derby does not currently support a BOOLEAN datatype.

            People

            • Assignee:
              Unassigned
              Reporter:
              Rick Hillegas
            • Votes:
              15 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development