Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: v1.2.0, v1.4.0
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      Some use cases need near real time full text indexing of data through Flume into Solr, where a Flume sink can write directly to a Solr search server. This is a scalable way to provide low latency querying and data acquisition. It complements (rather than replaces) use cases based on Map Reduce batch analysis of HDFS data.

      Apache Solr has a client API that uses REST to add documents to a Solr server, which in turn is based on Lucene. A Solr Sink can extract documents from flume events and forward them to Solr.

      1. flume-new-features-1.3.1-sources.jar
        5 kB
        Israel Ekpo
      2. flume-new-feature-dependencies.zip
        1.57 MB
        Israel Ekpo
      3. flume-new-features-1.3.1.jar
        9 kB
        Israel Ekpo

        Issue Links

          Activity

          Hide
          Israel Ekpo added a comment -

          I think this is a cool idea.

          This could be a great alternative to the ElasticSearchSink.

          There are some folks that have experience with Apache Solr but do not necessarily understand how to get ElasticSearch up and running.

          Having a SolrSink as an alternative could be very helpful in creating a user interface for searching through event and log data collected with Flume using Apache Solr.

          In ElasticSearch, the data sent to the Sink can be partitioned using the date (yyyy-MM-dd). With the SolrSink, the partitioning of the captured data by date can be done in a manner similar to ElasticSearch via the CREATE INDEX feature of CoreAdmin

          http://wiki.apache.org/solr/CoreAdmin#CREATE

          The only downside is that unlike ElasticSearch, where no pre-existing schemas are required, with Apache Solr, the new core can only be created based on a pre-existing instanceDir, solrconfig.xml, and schema.xml files.

          Show
          Israel Ekpo added a comment - I think this is a cool idea. This could be a great alternative to the ElasticSearchSink. There are some folks that have experience with Apache Solr but do not necessarily understand how to get ElasticSearch up and running. Having a SolrSink as an alternative could be very helpful in creating a user interface for searching through event and log data collected with Flume using Apache Solr. In ElasticSearch, the data sent to the Sink can be partitioned using the date (yyyy-MM-dd). With the SolrSink, the partitioning of the captured data by date can be done in a manner similar to ElasticSearch via the CREATE INDEX feature of CoreAdmin http://wiki.apache.org/solr/CoreAdmin#CREATE The only downside is that unlike ElasticSearch, where no pre-existing schemas are required, with Apache Solr, the new core can only be created based on a pre-existing instanceDir, solrconfig.xml, and schema.xml files.
          Hide
          Israel Ekpo added a comment -

          The RESTClient can be used to send bulk requests containing a batch of events to Apache Solr.

          Show
          Israel Ekpo added a comment - The RESTClient can be used to send bulk requests containing a batch of events to Apache Solr.
          Hide
          Israel Ekpo added a comment -

          Instructions on how to test drive this new feature is available here

          https://cwiki.apache.org/confluence/display/FLUME/How+to+Setup+Solr+Sink+for+Flume

          Show
          Israel Ekpo added a comment - Instructions on how to test drive this new feature is available here https://cwiki.apache.org/confluence/display/FLUME/How+to+Setup+Solr+Sink+for+Flume
          Hide
          Mike Percy added a comment -

          Israel, cool patch! I have some high level feedback and some nitpicky feedback.

          High level:

          • Can we abstract the SolrEventSerializer concept a bit more broadly to be a SolrIndexer? The idea is that people may want to do more than simply map one event to one document, as well as use implementations other than ConcurrentUpdateSolrServer. In order to support more complex indexing use cases in the future, one way to do it could be adding an interface like:
          public interface SolrIndexer extends Configurable {
            public void configure(Context ctx);
            public void init();
            public void load(Event event) throws IOException, SolrServerException;
            public void beginSolrTransaction() throws IOException, SolrServerException;
            public void commitSolrTransaction() throws IOException, SolrServerException;
            public void rollbackSolrTransaction() throws IOException, SolrServerException;
            public void shutdown();
          }
          

          So stuff like docs.add(eventSerializer.prepareInputDocument(event)) would be abstracted into indexer.load(event), and solrServer.add(docs) + solrServer.commit() would be abstracted into indexer.commitSolrTransaction().

          Thoughts?

          Aside from this suggestion, could you also do the following?

          1. Attach a .patch file that compiles instead of a jar
          2. Ensure indentation is consistent and kept to 2 lines
          3. How about some unit tests?

          Regards,
          Mike

          Show
          Mike Percy added a comment - Israel, cool patch! I have some high level feedback and some nitpicky feedback. High level: Can we abstract the SolrEventSerializer concept a bit more broadly to be a SolrIndexer? The idea is that people may want to do more than simply map one event to one document, as well as use implementations other than ConcurrentUpdateSolrServer. In order to support more complex indexing use cases in the future, one way to do it could be adding an interface like: public interface SolrIndexer extends Configurable { public void configure(Context ctx); public void init(); public void load(Event event) throws IOException, SolrServerException; public void beginSolrTransaction() throws IOException, SolrServerException; public void commitSolrTransaction() throws IOException, SolrServerException; public void rollbackSolrTransaction() throws IOException, SolrServerException; public void shutdown(); } So stuff like docs.add(eventSerializer.prepareInputDocument(event)) would be abstracted into indexer.load(event), and solrServer.add(docs) + solrServer.commit() would be abstracted into indexer.commitSolrTransaction(). Thoughts? Aside from this suggestion, could you also do the following? Attach a .patch file that compiles instead of a jar Ensure indentation is consistent and kept to 2 lines How about some unit tests? Regards, Mike
          Hide
          Mike Percy added a comment -

          Oops a couple more things:

          4. Please add Apache license headers to the top of each new file
          5. Please remove @author annotations

          Show
          Mike Percy added a comment - Oops a couple more things: 4. Please add Apache license headers to the top of each new file 5. Please remove @author annotations
          Hide
          Mike Percy added a comment -

          Hi Israel, wondering if you got a chance to review my comments?

          Show
          Mike Percy added a comment - Hi Israel, wondering if you got a chance to review my comments?
          Hide
          Israel Ekpo added a comment -

          Thanks Mike.

          I did get a chance to review your comments. I think that is a good idea.

          I would like to add load(List<Event> events) that accepts multiple events at once.

          I was unavailable these last few weeks.

          It is probably late now to make 1.4.

          I can submit an update patch in a week that we can put in the next release.

          Another thing I am also working on is a HTTP client that we can use for both Solr and ElasticSearch sinks so that we are not tightly coupled with dependencies that can break if the server version is not the same as what the client/sink is using.

          Show
          Israel Ekpo added a comment - Thanks Mike. I did get a chance to review your comments. I think that is a good idea. I would like to add load(List<Event> events) that accepts multiple events at once. I was unavailable these last few weeks. It is probably late now to make 1.4. I can submit an update patch in a week that we can put in the next release. Another thing I am also working on is a HTTP client that we can use for both Solr and ElasticSearch sinks so that we are not tightly coupled with dependencies that can break if the server version is not the same as what the client/sink is using.
          Hide
          Gopal Patwa added a comment -

          I was just curious to know how this feature compare with "Flume Morphline Solr Sink" https://issues.apache.org/jira/browse/FLUME-2070

          Should I use "Morphline Solr Sink" or "ApacheSolrSink" for genrating Solr index using Flume?

          Show
          Gopal Patwa added a comment - I was just curious to know how this feature compare with "Flume Morphline Solr Sink" https://issues.apache.org/jira/browse/FLUME-2070 Should I use "Morphline Solr Sink" or "ApacheSolrSink" for genrating Solr index using Flume?
          Hide
          wolfgang hoschek added a comment -

          My understanding of FLUME-1687 is that it simply forwards the flume headers as-is to Solr, i.e. it essentially expects an upstream component to send flume events that conform and are formatted exactly as required by Solr. I think it also doesn't support SolrCloud.

          In contrast, Morphline Solr Sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr. In particular, the Morphline Solr Sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications. The ETL functionality is customizable using a morphline configuration file that defines a chain of pluggable transformation commands that pipe event records from one command to another. The Morphline Solr Sink also supports SolrCloud and transactional batching and Solr for more scalability, and Solr collection aliases (e.g. for transparent expiry of old index partitions).

          Morphline Solr Sink can do everything that FLUME-1687 can do, and more.

          Would be nice to merge those two efforts into one.

          Show
          wolfgang hoschek added a comment - My understanding of FLUME-1687 is that it simply forwards the flume headers as-is to Solr, i.e. it essentially expects an upstream component to send flume events that conform and are formatted exactly as required by Solr. I think it also doesn't support SolrCloud. In contrast, Morphline Solr Sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr. In particular, the Morphline Solr Sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications. The ETL functionality is customizable using a morphline configuration file that defines a chain of pluggable transformation commands that pipe event records from one command to another. The Morphline Solr Sink also supports SolrCloud and transactional batching and Solr for more scalability, and Solr collection aliases (e.g. for transparent expiry of old index partitions). Morphline Solr Sink can do everything that FLUME-1687 can do, and more. Would be nice to merge those two efforts into one.

            People

            • Assignee:
              Israel Ekpo
              Reporter:
              wolfgang hoschek
            • Votes:
              4 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:

                Development