[CONNECTORS-1219] Lucene Output Connector - ASF JIRA

Shinichiro Abe added a comment - 30/Jun/15 03:49

I'll attached patch this week. I'm testing this connector.

Shinichiro Abe added a comment - 30/Jun/15 03:49 I'll attached patch this week. I'm testing this connector.

Karl Wright added a comment - 30/Jun/15 06:18

Lucene's memory model is not bounded at all, so I'd worry that that a connector that directly wrote to Lucene indexes might well have memory problems that would be very difficult to address.

Karl Wright added a comment - 30/Jun/15 06:18 Lucene's memory model is not bounded at all, so I'd worry that that a connector that directly wrote to Lucene indexes might well have memory problems that would be very difficult to address.

Shinichiro Abe added a comment - 02/Jul/15 16:59

strawman patch, still be improved more.
I think this connector will need to have much heap memory for working well. Where are memory problems you said? Multiple threads are writing to an index? If so, I took it into account the below.
In tika connector, on the other hand, BodyContentHandler should be replaced with WriteOutContentHandler because any connectors might treat big string object. WriteOutContentHandler has writeLimit param and have used by tika facade or jackrabbit oak's solr integration to avoid consuming more memory. Also, I have a plan to introduce mcf-search-api-service.war based on this connector, since mcf would be able to have a search engine with pull-agent, it's just an idea for me though. As to Lucene memory, multiple connections of this connector share one client instance per local path because of those, and I also have an idea to use it from search-api side.

Shinichiro Abe added a comment - 02/Jul/15 16:59 strawman patch, still be improved more. I think this connector will need to have much heap memory for working well. Where are memory problems you said? Multiple threads are writing to an index? If so, I took it into account the below. In tika connector, on the other hand, BodyContentHandler should be replaced with WriteOutContentHandler because any connectors might treat big string object. WriteOutContentHandler has writeLimit param and have used by tika facade or jackrabbit oak's solr integration to avoid consuming more memory. Also, I have a plan to introduce mcf-search-api-service.war based on this connector, since mcf would be able to have a search engine with pull-agent, it's just an idea for me though. As to Lucene memory, multiple connections of this connector share one client instance per local path because of those, and I also have an idea to use it from search-api side.

Karl Wright added a comment - 03/Jul/15 12:21

Would you like to create a branch, branches/~~CONNECTORS-1219~~, and commit your changes, so that it is easier to review the code?

I will need to ask on the Lucene list whether or not Lucene's memory usage can be constrained to a fixed limit without too much trouble. In the past this was not easy.

Karl Wright added a comment - 03/Jul/15 12:21 Would you like to create a branch, branches/ CONNECTORS-1219 , and commit your changes, so that it is easier to review the code? I will need to ask on the Lucene list whether or not Lucene's memory usage can be constrained to a fixed limit without too much trouble. In the past this was not easy.

Shinichiro Abe added a comment - 04/Jul/15 00:58

Yes. I'll create a branch and add some fixes.

Shinichiro Abe added a comment - 04/Jul/15 00:58 Yes. I'll create a branch and add some fixes.

Shinichiro Abe added a comment - 04/Jul/15 07:36

I created a branch. r1689110.
And added some fixes. r1689113.

I have known issue or limitations in the branch.

indexing big contents with parallel output connections may happen OOM. to avoid this:
- reduce throttling size
- make tika to cut content out by limit(not implemented)
- make term vector off (not implemented)
can not reflect online schema changes until being called by IConnector.disconnect() or poll() expiration.
not implemented analyzer resource path.
not implemented other field types except for string and text.

Please review the branch. LuceneClientTest.java by maven and luke index browser might be helpful for test. Thank you.

Shinichiro Abe added a comment - 04/Jul/15 07:36 I created a branch. r1689110. And added some fixes. r1689113. I have known issue or limitations in the branch. indexing big contents with parallel output connections may happen OOM. to avoid this: reduce throttling size make tika to cut content out by limit(not implemented) make term vector off (not implemented) can not reflect online schema changes until being called by IConnector.disconnect() or poll() expiration. not implemented analyzer resource path. not implemented other field types except for string and text. Please review the branch. LuceneClientTest.java by maven and luke index browser might be helpful for test. Thank you.

Shinichiro Abe added a comment - 04/Jul/15 07:38

attached patch was committed to the branch.

Shinichiro Abe added a comment - 04/Jul/15 07:38 attached patch was committed to the branch.

Karl Wright added a comment - 04/Jul/15 13:09

Thanks I will look at this on Monday.

Karl Wright added a comment - 04/Jul/15 13:09 Thanks I will look at this on Monday.

Karl Wright added a comment - 06/Jul/15 12:13

Hi Abe-san,

Looking at this code:

  private LuceneDocument buildDocument(String documentURI, RepositoryDocument document) throws Exception {
    LuceneDocument doc = new LuceneDocument();

    doc = LuceneDocument.addField(doc, client.idField(), documentURI, client.fieldsInfo());

    try
    {
      Reader r = new InputStreamReader(document.getBinaryStream(), StandardCharsets.UTF_8);
      StringBuilder sb = new StringBuilder((int)document.getBinaryLength());
      char[] buffer = new char[65536];
      while (true)
      {
        int amt = r.read(buffer,0,buffer.length);
        if (amt == -1)
          break;
        sb.append(buffer,0,amt);
      }
      doc = LuceneDocument.addField(doc, client.contentField(), sb.toString(), client.fieldsInfo());
    } catch (Exception e) {
      if (e instanceof IOException) {
        Logging.connectors.error("[Parsing Content]Content is not text plain, verify you are properly using Apache Tika Transformer " + documentURI, e);
      } else {
        throw e;
      }
    }

    Iterator<String> it = document.getFields();
    while (it.hasNext()) {
      String rdField = it.next();
      if (client.fieldsInfo().containsKey(rdField)) {
        try
        {
          String[] values = document.getFieldAsStrings(rdField);
          for (String value : values) {
            doc = LuceneDocument.addField(doc, rdField, value, client.fieldsInfo());
          }
        } catch (IOException e) {
          Logging.connectors.error("[Getting Field Values]Impossible to read value for metadata " + rdField + " " + documentURI, e);
        }
      }
    }
    return doc;
  }

As you can see, there is no limit on the amount of memory that would be required to index a single document. A 10gb document would require 10gb or more of memory. The potential amount of memory varies also by the number of worker threads – if all 30 worker threads all happen to want to index a 10gb document at the same time, the memory requirement would be 300gb. Indeed, there is no memory size that you could set which would work reliably.

We have this problem also with the Solr connector when the extracting update handler is not used – in that case, we require the user to set a maximum file length value. Even that is not a good solution, but it is the only one possible given Solr's standard update handler architecture. For a Lucene connector, we would need to have similar required constraints.

Karl Wright added a comment - 06/Jul/15 12:13 Hi Abe-san, Looking at this code: private LuceneDocument buildDocument( String documentURI, RepositoryDocument document) throws Exception { LuceneDocument doc = new LuceneDocument(); doc = LuceneDocument.addField(doc, client.idField(), documentURI, client.fieldsInfo()); try { Reader r = new InputStreamReader(document.getBinaryStream(), StandardCharsets.UTF_8); StringBuilder sb = new StringBuilder(( int )document.getBinaryLength()); char [] buffer = new char [65536]; while ( true ) { int amt = r.read(buffer,0,buffer.length); if (amt == -1) break ; sb.append(buffer,0,amt); } doc = LuceneDocument.addField(doc, client.contentField(), sb.toString(), client.fieldsInfo()); } catch (Exception e) { if (e instanceof IOException) { Logging.connectors.error( "[Parsing Content]Content is not text plain, verify you are properly using Apache Tika Transformer " + documentURI, e); } else { throw e; } } Iterator< String > it = document.getFields(); while (it.hasNext()) { String rdField = it.next(); if (client.fieldsInfo().containsKey(rdField)) { try { String [] values = document.getFieldAsStrings(rdField); for ( String value : values) { doc = LuceneDocument.addField(doc, rdField, value, client.fieldsInfo()); } } catch (IOException e) { Logging.connectors.error( "[Getting Field Values]Impossible to read value for metadata " + rdField + " " + documentURI, e); } } } return doc; } As you can see, there is no limit on the amount of memory that would be required to index a single document. A 10gb document would require 10gb or more of memory. The potential amount of memory varies also by the number of worker threads – if all 30 worker threads all happen to want to index a 10gb document at the same time, the memory requirement would be 300gb. Indeed, there is no memory size that you could set which would work reliably. We have this problem also with the Solr connector when the extracting update handler is not used – in that case, we require the user to set a maximum file length value. Even that is not a good solution, but it is the only one possible given Solr's standard update handler architecture. For a Lucene connector, we would need to have similar required constraints.

Shinichiro Abe added a comment - 06/Jul/15 19:04

Thank you for the review. Added Maximumdocumentlength params and field, r1689479 to the branch.

It seems to me that isInteger() function at editconnection.jsp doesn't strictly check for integer value IIUC, is it expected? Solr connector's max length check on the jsp could be also passed to long value.
BTW, if it was used Integer.MAX_VALUE on the field, StringBuilder init would raise OOM when adding big binary in the connection because char array exceeded max capacity.

And big binary was be able to reject to ingest by having max length, but I found another OOMs which were caused by Lucene.

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$FieldData.<init>(CompressingTermVectorsWriter.java:157)
	at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$DocData.addField(CompressingTermVectorsWriter.java:106)
	at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.startField(CompressingTermVectorsWriter.java:287)
	at org.apache.lucene.index.TermVectorsConsumerPerField.finishDocument(TermVectorsConsumerPerField.java:81)
	at org.apache.lucene.index.TermVectorsConsumer.finishDocument(TermVectorsConsumer.java:110)
	at org.apache.lucene.index.TermsHash.finishDocument(TermsHash.java:93)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:316)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
	at org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321)
	at org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221)

I will add term_vector true|false option on the fields.

Caused by: java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:345)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.writeField(CompressingStoredFieldsWriter.java:297)
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:361)
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
	at org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321)
	at org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester

This OOM could be resolved by tika write limit.

Shinichiro Abe added a comment - 06/Jul/15 19:04 Thank you for the review. Added Maximumdocumentlength params and field, r1689479 to the branch. It seems to me that isInteger() function at editconnection.jsp doesn't strictly check for integer value IIUC, is it expected? Solr connector's max length check on the jsp could be also passed to long value. BTW, if it was used Integer.MAX_VALUE on the field, StringBuilder init would raise OOM when adding big binary in the connection because char array exceeded max capacity. And big binary was be able to reject to ingest by having max length, but I found another OOMs which were caused by Lucene. Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$FieldData.<init>(CompressingTermVectorsWriter.java:157) at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter$DocData.addField(CompressingTermVectorsWriter.java:106) at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.startField(CompressingTermVectorsWriter.java:287) at org.apache.lucene.index.TermVectorsConsumerPerField.finishDocument(TermVectorsConsumerPerField.java:81) at org.apache.lucene.index.TermVectorsConsumer.finishDocument(TermVectorsConsumer.java:110) at org.apache.lucene.index.TermsHash.finishDocument(TermsHash.java:93) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:316) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) at org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321) at org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221) I will add term_vector true|false option on the fields. Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:345) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.writeField(CompressingStoredFieldsWriter.java:297) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:361) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) at org.apache.manifoldcf.agents.output.lucene.LuceneClient.addOrReplace(LuceneClient.java:321) at org.apache.manifoldcf.agents.output.lucene.LuceneConnector.addOrReplaceDocumentWithException(LuceneConnector.java:333) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester This OOM could be resolved by tika write limit.

Karl Wright added a comment - 06/Jul/15 19:41

This OOM could be resolved by tika write limit.

I don't think so, because it occurs after the LuceneDocument structure has been built already. It occurs on the client.addOrReplace() line:

      LuceneDocument inputDoc = buildDocument(documentURI, document);
      client.addOrReplace(documentURI, inputDoc);

This is likely because Lucene needs some multiple of the maximum size of a document in order to compress field values. But as long as memory consumption overall is limited by some user-controllable means, it's still OK, and the file size limit should do that.

Karl Wright added a comment - 06/Jul/15 19:41 This OOM could be resolved by tika write limit. I don't think so, because it occurs after the LuceneDocument structure has been built already. It occurs on the client.addOrReplace() line: LuceneDocument inputDoc = buildDocument(documentURI, document); client.addOrReplace(documentURI, inputDoc); This is likely because Lucene needs some multiple of the maximum size of a document in order to compress field values. But as long as memory consumption overall is limited by some user-controllable means, it's still OK, and the file size limit should do that.

Shinichiro Abe added a comment - 06/Jul/15 20:04

r1689485.

StringBuilder(int capacity) , this capacity was approximately 700 MB. In the past I realized Solrj also have the same limitation, even though Solrj doesn't use StringBuilder, but use String.getBytes().

org.apache.lucene.lucene.util.ArrayUtil.grow is also using byte array, maybe occurs OOM when exceeding that size.

Shinichiro Abe added a comment - 06/Jul/15 20:04 r1689485. StringBuilder(int capacity) , this capacity was approximately 700 MB. In the past I realized Solrj also have the same limitation, even though Solrj doesn't use StringBuilder, but use String.getBytes(). org.apache.lucene.lucene.util.ArrayUtil.grow is also using byte array, maybe occurs OOM when exceeding that size.

Karl Wright added a comment - 09/Jul/15 15:00

Hi shinichiro abe, I talked with Mike McCandless (Lucene committer) about this yesterday. He says that there are Reader versions of all the addField() methods for a LuceneDocument. The only time they cannot be used is if the field is stored but not indexed (if I recall correctly)?

Karl Wright added a comment - 09/Jul/15 15:00 Hi shinichiro abe , I talked with Mike McCandless (Lucene committer) about this yesterday. He says that there are Reader versions of all the addField() methods for a LuceneDocument. The only time they cannot be used is if the field is stored but not indexed (if I recall correctly)?

Shinichiro Abe added a comment - 10/Jul/15 06:59

Thanks Karl, I'll try to change String argument into Reader. But Reader is used if the field is indexed but not stored. I'll add StoredField when storing value, but there is byte array capacity, approximately 2g. Anyway maxDocumentLength is required, I think.

Shinichiro Abe added a comment - 10/Jul/15 06:59 Thanks Karl, I'll try to change String argument into Reader. But Reader is used if the field is indexed but not stored . I'll add StoredField when storing value, but there is byte array capacity, approximately 2g. Anyway maxDocumentLength is required, I think.

Karl Wright added a comment - 10/Jul/15 10:09

I know that stored fields can't use Readers, but I am not clear yet why. Maybe a Lucene patch would be possible? Perhaps mikemccand knows the reason for this?

Karl Wright added a comment - 10/Jul/15 10:09 I know that stored fields can't use Readers, but I am not clear yet why. Maybe a Lucene patch would be possible? Perhaps mikemccand knows the reason for this?

Michael McCandless added a comment - 10/Jul/15 13:40

We could possibly patch Lucene to allow stored=true for Reader as well ... this is probably quite tricky, e.g. the codec APIs (StoredFieldsFormat) would need to accept Reader too.

Even if we did that, though, a very large document can still be problematic. You should test using Reader just for indexing: it could also be even this still puts too much heap pressure because IndexWriter must store all tokens for that one document in heap before it can write a new segment.

Michael McCandless added a comment - 10/Jul/15 13:40 We could possibly patch Lucene to allow stored=true for Reader as well ... this is probably quite tricky, e.g. the codec APIs (StoredFieldsFormat) would need to accept Reader too. Even if we did that, though, a very large document can still be problematic. You should test using Reader just for indexing: it could also be even this still puts too much heap pressure because IndexWriter must store all tokens for that one document in heap before it can write a new segment.

Karl Wright added a comment - 10/Jul/15 23:30

Thanks for the response.

We're interested mainly on being able to "bound" the memory usage for indexing. That implies that there is nothing that consumes memory O where n is the size of the document.

Karl Wright added a comment - 10/Jul/15 23:30 Thanks for the response. We're interested mainly on being able to "bound" the memory usage for indexing. That implies that there is nothing that consumes memory O where n is the size of the document.

Karl Wright added a comment - 14/Jul/15 12:54

Hi Abe-san,

I noticed you made some more commits. Do you think this connector now manages memory reasonably? Do you want to try it with some large crawls to be sure?

If you are ready, I would be happy to help merge it into trunk.

Karl Wright added a comment - 14/Jul/15 12:54 Hi Abe-san, I noticed you made some more commits. Do you think this connector now manages memory reasonably? Do you want to try it with some large crawls to be sure? If you are ready, I would be happy to help merge it into trunk.

Shinichiro Abe added a comment - 14/Jul/15 18:22

Thanks for the review as for my commits. I added term vector option, and replaced string args with reader, since then, I didn't see any errors about OOM. And I did big crawling in my chaos manifoldcf directory many times, it is okay for managing memory as long as setting proper max doc length.
I'm ready to merge this, but before merging, please check this patch which impls a simple search handler working on jetty, i confirmed it worked well a few hours ago(And I want to add highlighting response this week). I plan to create a search servet and its api in the future, which has a role of distributed searching for multiple mcf instance on multiple nodes. the servet will have to send requests to more than one jetty search handler. i'd like to add this, but is this a too much feature for users?

Shinichiro Abe added a comment - 14/Jul/15 18:22 Thanks for the review as for my commits. I added term vector option, and replaced string args with reader, since then, I didn't see any errors about OOM. And I did big crawling in my chaos manifoldcf directory many times, it is okay for managing memory as long as setting proper max doc length. I'm ready to merge this, but before merging, please check this patch which impls a simple search handler working on jetty, i confirmed it worked well a few hours ago(And I want to add highlighting response this week). I plan to create a search servet and its api in the future, which has a role of distributed searching for multiple mcf instance on multiple nodes. the servet will have to send requests to more than one jetty search handler. i'd like to add this, but is this a too much feature for users?

Shinichiro Abe added a comment - 14/Jul/15 18:23

lucene search handler patch for the branch.

Shinichiro Abe added a comment - 14/Jul/15 18:23 lucene search handler patch for the branch.

Karl Wright added a comment - 14/Jul/15 18:32

Hi Abe-san,

It doesn't seem like a good idea to make the mcf jetty runner know about stuff from an individual connector. I also think that the Lucene search server is really not part of ManifoldCF proper either. I could imagine it being constructed as a ManifoldCF "Lucene plugin". Would you be willing to code it in that way?

Karl Wright added a comment - 14/Jul/15 18:32 Hi Abe-san, It doesn't seem like a good idea to make the mcf jetty runner know about stuff from an individual connector. I also think that the Lucene search server is really not part of ManifoldCF proper either. I could imagine it being constructed as a ManifoldCF "Lucene plugin". Would you be willing to code it in that way?

Shinichiro Abe added a comment - 15/Jul/15 07:09

Hi Karl,

Ok, I'll take a lucene plugin way. I think I have to put the plugin on mcf jetty runner somehow because the search handler is required to work on the same jetty runner process, near real time indexsearcher takes the indexreader which is using indexwriter's memory buffer, so the search handler needs to take the indexwriter that is working on mcf crawler agent. I'll think of this on another issue later.
This week I'll merge into trunk at the moment in the branch. Thanks.

Shinichiro Abe added a comment - 15/Jul/15 07:09 Hi Karl, Ok, I'll take a lucene plugin way. I think I have to put the plugin on mcf jetty runner somehow because the search handler is required to work on the same jetty runner process, near real time indexsearcher takes the indexreader which is using indexwriter's memory buffer, so the search handler needs to take the indexwriter that is working on mcf crawler agent. I'll think of this on another issue later. This week I'll merge into trunk at the moment in the branch. Thanks.

Karl Wright added a comment - 15/Jul/15 08:21

Hi Abe-san,

From what you say, only the single-process example can possibly work with the Lucene output connector that you have proposed. None of the multi-process or distributed models will work with it properly.

Before you commit to trunk, we really have to think this through, because this would be the first connector with such a restriction. It might be better, for instance, to have a secondary process in which Lucene runs, and a socket (maybe with a REST API?) where the documents are sent and/or requests are made. It is more work but it is also more consistent with ManifoldCF operating model.

Karl Wright added a comment - 15/Jul/15 08:21 Hi Abe-san, From what you say, only the single-process example can possibly work with the Lucene output connector that you have proposed. None of the multi-process or distributed models will work with it properly. Before you commit to trunk, we really have to think this through, because this would be the first connector with such a restriction. It might be better, for instance, to have a secondary process in which Lucene runs, and a socket (maybe with a REST API?) where the documents are sent and/or requests are made. It is more work but it is also more consistent with ManifoldCF operating model.

Shinichiro Abe added a comment - 15/Jul/15 08:48 - edited

File System Output Connector doesn't work on multi-process as well. I can't create a Lucene rest server because there are already another many rest search servers. I'd like to deal with this connector as well as FS output connector.

Shinichiro Abe added a comment - 15/Jul/15 08:48 - edited File System Output Connector doesn't work on multi-process as well. I can't create a Lucene rest server because there are already another many rest search servers. I'd like to deal with this connector as well as FS output connector.

Karl Wright added a comment - 15/Jul/15 09:02

Hi Abe-san,
The File System Output connector can be used to write to distributed file systems such as Windows shares and Unix file systems like AFS. Plus, it does not require other services to run in the same process space. So it really does fit the MCF model as-is. The Lucene Output Connector cannot be used in its current form in any multiprocess model, AND we need to make special allowance for it at the framework level because of that process constraint. So that makes it unique right now, and we need to figure out how best to deal with that.

Karl Wright added a comment - 15/Jul/15 09:02 Hi Abe-san, The File System Output connector can be used to write to distributed file systems such as Windows shares and Unix file systems like AFS. Plus, it does not require other services to run in the same process space. So it really does fit the MCF model as-is. The Lucene Output Connector cannot be used in its current form in any multiprocess model, AND we need to make special allowance for it at the framework level because of that process constraint. So that makes it unique right now, and we need to figure out how best to deal with that.

Shinichiro Abe added a comment - 15/Jul/15 09:15

I understand. FS output connector can't work on multi-process model unless it uses NFS. Unfortunately, as to Lucene index, NFS doesn't recommended. I'm troubled.

Shinichiro Abe added a comment - 15/Jul/15 09:15 I understand. FS output connector can't work on multi-process model unless it uses NFS. Unfortunately, as to Lucene index, NFS doesn't recommended. I'm troubled.

Karl Wright added a comment - 15/Jul/15 09:58

This is why I think we need a different process architecture.

There's a technology we use for Documentum and FileNet that might help here, called RMI. Each of these connectors has two "sidecar" processes that are required – one is a service process, and the other is a registry process. There is only one of each process for a connector for all of the ManifoldCF processes.

If there is a Lucene sidecar process, it could also run Jetty and provide search services, so it would all work.

RMI uses Java serialization to work, so I don't know whether streams would do the right thing or not. I will have to do some research into how to do it. But if Java streams do not work there still should be a way to do it, because the underlying idea is just a socket that connects objects on either side of the process boundary.

Karl Wright added a comment - 15/Jul/15 09:58 This is why I think we need a different process architecture. There's a technology we use for Documentum and FileNet that might help here, called RMI. Each of these connectors has two "sidecar" processes that are required – one is a service process, and the other is a registry process. There is only one of each process for a connector for all of the ManifoldCF processes. If there is a Lucene sidecar process, it could also run Jetty and provide search services, so it would all work. RMI uses Java serialization to work, so I don't know whether streams would do the right thing or not. I will have to do some research into how to do it. But if Java streams do not work there still should be a way to do it, because the underlying idea is just a socket that connects objects on either side of the process boundary.

Shinichiro Abe added a comment - 15/Jul/15 10:44

If you said about Java serialization of indexwriter, I know indexwriter cannot be serialized. I tried that before. my test case .

Shinichiro Abe added a comment - 15/Jul/15 10:44 If you said about Java serialization of indexwriter, I know indexwriter cannot be serialized. I tried that before. my test case .

Karl Wright added a comment - 15/Jul/15 12:09

Hi Abe-san,

No, it is not necessary to serialize indexwriter. I think you may misunderstand the proposal. So to make it clear:

(1) ALL lucene activity would happen in one sidecar process, including the Lucene searcher and a separate Jetty instance it would run under
(2) ManifoldCF would have multiple processes
(3) Communication between the ManifoldCF processes and the Lucene process would be via a socket
(4) The socket protocol would either be Java-serialization-based RMI (which I would need to research), or some other low-level protocol. The goal would be to NOT use REST or XML or JSON or any other heavyweight, open protocol.
(5) The reason an open protocol is undesirable is because we definitely don't want to reinvent ElasticSearch, Solr, or any other Lucene wrapper. The reason, though, to have a separate process is because Lucene's memory and disk model is inconsistent with ManifoldCF's.

Does this make sense?

Karl Wright added a comment - 15/Jul/15 12:09 Hi Abe-san, No, it is not necessary to serialize indexwriter. I think you may misunderstand the proposal. So to make it clear: (1) ALL lucene activity would happen in one sidecar process, including the Lucene searcher and a separate Jetty instance it would run under (2) ManifoldCF would have multiple processes (3) Communication between the ManifoldCF processes and the Lucene process would be via a socket (4) The socket protocol would either be Java-serialization-based RMI (which I would need to research), or some other low-level protocol. The goal would be to NOT use REST or XML or JSON or any other heavyweight, open protocol. (5) The reason an open protocol is undesirable is because we definitely don't want to reinvent ElasticSearch, Solr, or any other Lucene wrapper. The reason, though, to have a separate process is because Lucene's memory and disk model is inconsistent with ManifoldCF's. Does this make sense?

Karl Wright added a comment - 15/Jul/15 12:09

Hi Abe-san,

No, it is not necessary to serialize indexwriter. I think you may misunderstand the proposal. So to make it clear:

(1) ALL lucene activity would happen in one sidecar process, including the Lucene searcher and a separate Jetty instance it would run under
(2) ManifoldCF would have multiple processes
(3) Communication between the ManifoldCF processes and the Lucene process would be via a socket
(4) The socket protocol would either be Java-serialization-based RMI (which I would need to research), or some other low-level protocol. The goal would be to NOT use REST or XML or JSON or any other heavyweight, open protocol.
(5) The reason an open protocol is undesirable is because we definitely don't want to reinvent ElasticSearch, Solr, or any other Lucene wrapper. The reason, though, to have a separate process is because Lucene's memory and disk model is inconsistent with ManifoldCF's.

Does this make sense?

Karl Wright added a comment - 15/Jul/15 12:09 Hi Abe-san, No, it is not necessary to serialize indexwriter. I think you may misunderstand the proposal. So to make it clear: (1) ALL lucene activity would happen in one sidecar process, including the Lucene searcher and a separate Jetty instance it would run under (2) ManifoldCF would have multiple processes (3) Communication between the ManifoldCF processes and the Lucene process would be via a socket (4) The socket protocol would either be Java-serialization-based RMI (which I would need to research), or some other low-level protocol. The goal would be to NOT use REST or XML or JSON or any other heavyweight, open protocol. (5) The reason an open protocol is undesirable is because we definitely don't want to reinvent ElasticSearch, Solr, or any other Lucene wrapper. The reason, though, to have a separate process is because Lucene's memory and disk model is inconsistent with ManifoldCF's. Does this make sense?

Shinichiro Abe added a comment - 16/Jul/15 02:42

Yes, it does for separate process and RMI. But there still has a serialization problem.
I'm not sure about RMI, read mcf in action yesterday though, but when mcf'connection invokes the method which will add or replace a document via RMI, the class having that method have to be implemented serializable. This class may have LuceneClient which has a indexwriter. Is this correct? If so, maybe it will not work. If correct, it works well if the method is implemented by not having LuceneClient in that class, and the method just puts to something object like queue, then LuceneClient picks up from the queue. But this case is not enough for me in indexing latency-wise.
A few month ago I was looking for lowerest indexing latency implementation as pull crawler model. At that time, I used apache spark, ignite working on distributed nodes, which require to implement serializable class. I used lucene indexes with local disk version or hdfs version, but all I did ended up with a failure because of indexwriter serialization. After that I thought mcf could become the the best lowest indexing latency application when we set up mcf single processes to each node. The each node has each index. But this thought does not meet mcf multi process model though.

Shinichiro Abe added a comment - 16/Jul/15 02:42 Yes, it does for separate process and RMI. But there still has a serialization problem. I'm not sure about RMI, read mcf in action yesterday though, but when mcf'connection invokes the method which will add or replace a document via RMI, the class having that method have to be implemented serializable. This class may have LuceneClient which has a indexwriter. Is this correct? If so, maybe it will not work. If correct, it works well if the method is implemented by not having LuceneClient in that class, and the method just puts to something object like queue, then LuceneClient picks up from the queue. But this case is not enough for me in indexing latency-wise. A few month ago I was looking for lowerest indexing latency implementation as pull crawler model. At that time, I used apache spark, ignite working on distributed nodes, which require to implement serializable class. I used lucene indexes with local disk version or hdfs version, but all I did ended up with a failure because of indexwriter serialization. After that I thought mcf could become the the best lowest indexing latency application when we set up mcf single processes to each node. The each node has each index. But this thought does not meet mcf multi process model though.

Karl Wright added a comment - 16/Jul/15 07:51

After that I thought mcf could become the the best lowest indexing latency application when we set up mcf single processes to each node. The each node has each index.

Hi Abe-san,

Thank you, this makes it more clear what you are trying to do. I will need to think about the whole problem carefully for a time to be sure there is a solution that meets your goal. But it is worth mentioning that a separate process that you communicate to over a socket is not necessarily slow. On unix systems, at least, this can be very very fast on localhost, and even when not on localhost it can be made fast too with proper network architecture.

The alternative is really to create a Lucene application that wraps MCF, rather than the other way around. I'd have to think carefully about that but I believe you'd want to create your own war, something like combined.war, which would include your lucene service as well as the crawler UI. It's not ideal because the lucene connector would not work like other connectors, but there would at least be a possibility of deployment under tomcat, and there would not be a Lucene dependency for most people who aren't doing real-time work.

So, if using a sidecar process is where you choose to go:

My original idea was to serialize the document, not the LuceneClient or IndexWriter. But with RMI that would require two things: first, document would have to be written to a temporary disk file, and second, somewhere we would need a persistent LuceneClient class created in the sidecar process. That is not typical with RMI, and writing to disk is slower too than using a stream over a socket.

The sidecar process would, though, have jetty anyway. So you could have a servlet that listened for three things: HTTP POST of a multipart document, HTTP DELETE given a document ID, and HTTP GET to get status. Streaming a multipart document using HttpClient from the Lucene connector would be straightforward and would not involve a temporary disk file. On the sidecar process side, I also believe you would be able to wrap the incoming post and its metadata in Reader objects if you were careful. The LuceneClient would be present in the sidecar Jetty process only, and could be initialized as part of servlet initialization, so no serialization would be needed. The Lucene Connector would only have to stream the document using HttpClient.

Some coding would be needed to figure out which of these possibilities works best for your purpose. But I think those are your main choices.

Thoughts?

Karl Wright added a comment - 16/Jul/15 07:51 After that I thought mcf could become the the best lowest indexing latency application when we set up mcf single processes to each node. The each node has each index. Hi Abe-san, Thank you, this makes it more clear what you are trying to do. I will need to think about the whole problem carefully for a time to be sure there is a solution that meets your goal. But it is worth mentioning that a separate process that you communicate to over a socket is not necessarily slow. On unix systems, at least, this can be very very fast on localhost, and even when not on localhost it can be made fast too with proper network architecture. The alternative is really to create a Lucene application that wraps MCF, rather than the other way around. I'd have to think carefully about that but I believe you'd want to create your own war, something like combined.war, which would include your lucene service as well as the crawler UI. It's not ideal because the lucene connector would not work like other connectors, but there would at least be a possibility of deployment under tomcat, and there would not be a Lucene dependency for most people who aren't doing real-time work. So, if using a sidecar process is where you choose to go: My original idea was to serialize the document, not the LuceneClient or IndexWriter. But with RMI that would require two things: first, document would have to be written to a temporary disk file, and second, somewhere we would need a persistent LuceneClient class created in the sidecar process. That is not typical with RMI, and writing to disk is slower too than using a stream over a socket. The sidecar process would, though, have jetty anyway. So you could have a servlet that listened for three things: HTTP POST of a multipart document, HTTP DELETE given a document ID, and HTTP GET to get status. Streaming a multipart document using HttpClient from the Lucene connector would be straightforward and would not involve a temporary disk file. On the sidecar process side, I also believe you would be able to wrap the incoming post and its metadata in Reader objects if you were careful. The LuceneClient would be present in the sidecar Jetty process only, and could be initialized as part of servlet initialization, so no serialization would be needed. The Lucene Connector would only have to stream the document using HttpClient. Some coding would be needed to figure out which of these possibilities works best for your purpose. But I think those are your main choices. Thoughts?

Aingaran Pillai added a comment - 16/Jul/15 19:35 - edited

shinichiro abe for a low latency crawl solution you may want to look at Apache Storm. Here's a pull crawler implementation based on Apache Storm: https://github.com/DigitalPebble/storm-crawler. It doesn't do permissions though.

Aingaran Pillai added a comment - 16/Jul/15 19:35 - edited shinichiro abe for a low latency crawl solution you may want to look at Apache Storm. Here's a pull crawler implementation based on Apache Storm: https://github.com/DigitalPebble/storm-crawler . It doesn't do permissions though.

Shinichiro Abe added a comment - 17/Jul/15 05:17

Thanks apillaiz, I'd like to collect not only web content but also manifold repositories content.

DaddyWri, I discovered the OakDirectory which extends Lucene Directory class. I saw the below comment, they also had multi process(cluster) problem as to Lucene index, and they put the index to Blob object that means mongodb or rdb storage. From that, I come to switching Directory impl, for instance, we use FSDirectory on mcf single process, and use HdfsDirectory on mcf multi process. The writes to Hdfs was slow when I tried to use before. But this will be expected to improve.
I don't want to use RMI because... first: to avoid complexable operation or increase 2 steps for bootstrap on single process mode, second: I don't know how to write the test code, third: around me, only one user uses multi process and everyone will hope to run mcf as OOTB as possible, fourth: jackrabbit 2 has RMI api but oak doesn't have one. I think RMI is not cool as well as CMIS rather than JCR , fifth: I want to make mcf easy to use. These are not technical reason, but HdfsDirectory will help us.

Shinichiro Abe added a comment - 17/Jul/15 05:17 Thanks apillaiz , I'd like to collect not only web content but also manifold repositories content. DaddyWri , I discovered the OakDirectory which extends Lucene Directory class. I saw the below comment, they also had multi process(cluster) problem as to Lucene index, and they put the index to Blob object that means mongodb or rdb storage. From that, I come to switching Directory impl, for instance, we use FSDirectory on mcf single process, and use HdfsDirectory on mcf multi process. The writes to Hdfs was slow when I tried to use before. But this will be expected to improve. I don't want to use RMI because... first: to avoid complexable operation or increase 2 steps for bootstrap on single process mode, second: I don't know how to write the test code, third: around me, only one user uses multi process and everyone will hope to run mcf as OOTB as possible, fourth: jackrabbit 2 has RMI api but oak doesn't have one. I think RMI is not cool as well as CMIS rather than JCR , fifth: I want to make mcf easy to use. These are not technical reason, but HdfsDirectory will help us.

Karl Wright added a comment - 17/Jul/15 05:38

Hi Abe-san,
This sounds like a workable solution to the cluster problem. Can you
also write your lucene searcher to use the same technology?

Sent from my Windows Phone
From: Shinichiro Abe (JIRA)
Sent: 7/17/2015 1:18 AM
To: dev@manifoldcf.apache.org
Subject: [jira] [Commented] (~~CONNECTORS-1219~~) Lucene Output Connector

[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630787#comment-14630787
]

Shinichiro Abe commented on ~~CONNECTORS-1219~~:
--------------------------------------------

Thanks apillaiz, I'd like to collect not only web content but also
manifold repositories content.

DaddyWri, I discovered the
OakDirectory
which extends Lucene Directory class. I saw the below comment, they
also had multi process(cluster) problem as to Lucene index, and they
put the index to Blob object that means mongodb or rdb storage. From
that, I come to switching Directory impl, for instance, we use
FSDirectory on mcf single process, and use
HdfsDirectory
on mcf multi process. The writes to Hdfs was
slow
when I tried to use before. But this will be expected to improve.
I don't want to use RMI because... first: to avoid complexable
operation or increase 2 steps for bootstrap on single process mode,
second: I don't know how to write the test code, third: around me,
only one user uses multi process and everyone will hope to run mcf as
OOTB as possible, fourth: jackrabbit 2 has RMI api but oak doesn't
have one. I think RMI is not cool as well as CMIS rather than JCR ,
fifth: I want to make mcf easy to use. These are not technical reason,
but HdfsDirectory will help us.

–
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Karl Wright added a comment - 17/Jul/15 05:38 Hi Abe-san, This sounds like a workable solution to the cluster problem. Can you also write your lucene searcher to use the same technology? Sent from my Windows Phone From: Shinichiro Abe (JIRA) Sent: 7/17/2015 1:18 AM To: dev@manifoldcf.apache.org Subject: [jira] [Commented] ( CONNECTORS-1219 ) Lucene Output Connector [ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630787#comment-14630787 ] Shinichiro Abe commented on CONNECTORS-1219 : -------------------------------------------- Thanks apillaiz , I'd like to collect not only web content but also manifold repositories content. DaddyWri , I discovered the OakDirectory which extends Lucene Directory class. I saw the below comment, they also had multi process(cluster) problem as to Lucene index, and they put the index to Blob object that means mongodb or rdb storage. From that, I come to switching Directory impl, for instance, we use FSDirectory on mcf single process, and use HdfsDirectory on mcf multi process. The writes to Hdfs was slow when I tried to use before. But this will be expected to improve. I don't want to use RMI because... first: to avoid complexable operation or increase 2 steps for bootstrap on single process mode, second: I don't know how to write the test code, third: around me, only one user uses multi process and everyone will hope to run mcf as OOTB as possible, fourth: jackrabbit 2 has RMI api but oak doesn't have one. I think RMI is not cool as well as CMIS rather than JCR , fifth: I want to make mcf easy to use. These are not technical reason, but HdfsDirectory will help us. – This message was sent by Atlassian JIRA (v6.3.4#6332)

Shinichiro Abe added a comment - 18/Jul/15 02:52

it will work if we just create new indexsearcher with new indexreader which takes HdfsDirectory.

as to searcher it depends on using near realtime search or not.
(1) coexist writer and searcher
this is a approach like solr/solrcloud or elasticsearch.
indexsearcher can search the documents indexwriter has.
even if to write to hdfs is slow, indexsearcher can search in-memory uncommitted documents from indexwriter
(2) separate into writer side and searcher side.
this is a approach like solr's legacy style, master(writer)-slave(searcher) architecture, so we can't use near realtime search.
indexsearcher searches the documents from hdfs in which there are the documents committed by indexwriter.

which are fitted to mcf standard?

in solr, elasticsearch, oak and sling, documents are searchable as soon as clients post the documents. oak and sling are content repository with search index by push model(posts a document from client, then stores it to repository and index it simultaneously), these are bounded by jcr standard though. on the other hand, mcf is pull model. the search applications through output connector have a responsibility for whether documents are searchable soon. so according to mcf standard, lucene connector will have to choose (2) with the plugin but near realtime searching is lost. I intended to (1) in the v0.3 patch.

btw, alfresco, liferay and drupal are also content repository with pull model clawls, I heard it from someone, but these differs from mcf's doc version checking, these can index documents using something like transaction info about CRUD documents which is managed by repository side, so documents are indexed soon and are searchable soon. mcf is bounded by a limitation of repository side, e.g. concurrent access limit(shared drive, web, alfresco, cmis, sharpoint… almost all repository?) or heavy cpu load on repo side by multi-threading access. unfortunately, I heard mcf crawls is slow from some users sometimes so far, of course I knew and explained them that is not in mcf's taking care of, then adjusted repo side or customize existing connectors. as my first approach for those, I had an idea to index documents to local disk by using lucene without any http transport and use near realtime search with writer's buffered document, i.e. (1) approach. currently, I have no idea for repository side limitation though.

Shinichiro Abe added a comment - 18/Jul/15 02:52 it will work if we just create new indexsearcher with new indexreader which takes HdfsDirectory. as to searcher it depends on using near realtime search or not. (1) coexist writer and searcher this is a approach like solr/solrcloud or elasticsearch. indexsearcher can search the documents indexwriter has. even if to write to hdfs is slow, indexsearcher can search in-memory uncommitted documents from indexwriter (2) separate into writer side and searcher side. this is a approach like solr's legacy style, master(writer)-slave(searcher) architecture, so we can't use near realtime search. indexsearcher searches the documents from hdfs in which there are the documents committed by indexwriter. which are fitted to mcf standard? in solr, elasticsearch, oak and sling, documents are searchable as soon as clients post the documents. oak and sling are content repository with search index by push model(posts a document from client, then stores it to repository and index it simultaneously), these are bounded by jcr standard though. on the other hand, mcf is pull model. the search applications through output connector have a responsibility for whether documents are searchable soon. so according to mcf standard, lucene connector will have to choose (2) with the plugin but near realtime searching is lost. I intended to (1) in the v0.3 patch. btw, alfresco, liferay and drupal are also content repository with pull model clawls, I heard it from someone, but these differs from mcf's doc version checking, these can index documents using something like transaction info about CRUD documents which is managed by repository side, so documents are indexed soon and are searchable soon. mcf is bounded by a limitation of repository side, e.g. concurrent access limit(shared drive, web, alfresco, cmis, sharpoint… almost all repository?) or heavy cpu load on repo side by multi-threading access. unfortunately, I heard mcf crawls is slow from some users sometimes so far, of course I knew and explained them that is not in mcf's taking care of, then adjusted repo side or customize existing connectors. as my first approach for those, I had an idea to index documents to local disk by using lucene without any http transport and use near realtime search with writer's buffered document, i.e. (1) approach. currently, I have no idea for repository side limitation though.

Karl Wright added a comment - 18/Jul/15 07:59

Hi Abe-san,
Repository problem is hard to fix because it is a characteristic of the
repository. Only model_add_change_delete connectors would be expected
to work in real time. And none of our connectors have this model
because no repositories support it.

Maybe you could write a repository connector to a push technology that
other repository manufacturers might make an effort to integrate with,
but that would be for the future anyhow.

Sent from my Windows Phone
From: Shinichiro Abe (JIRA)
Sent: 7/17/2015 10:53 PM
To: dev@manifoldcf.apache.org
Subject: [jira] [Commented] (~~CONNECTORS-1219~~) Lucene Output Connector

[ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632230#comment-14632230
]

Shinichiro Abe commented on ~~CONNECTORS-1219~~:
--------------------------------------------

it will work if we just create new indexsearcher with new indexreader
which takes HdfsDirectory.

as to searcher it depends on using near realtime search or not.
(1) coexist writer and searcher
this is a approach like solr/solrcloud or elasticsearch.
indexsearcher can search the documents indexwriter has.
even if to write to hdfs is slow, indexsearcher can search in-memory
uncommitted documents from indexwriter
(2) separate into writer side and searcher side.
this is a approach like solr's legacy style,
master(writer)-slave(searcher) architecture, so we can't use near
realtime search.
indexsearcher searches the documents from hdfs in which there are the
documents committed by indexwriter.

which are fitted to mcf standard?

in solr, elasticsearch, oak and sling, documents are searchable as
soon as clients post the documents. oak and sling are content
repository with search index by push model(posts a document from
client, then stores it to repository and index it simultaneously),
these are bounded by jcr standard though. on the other hand, mcf is
pull model. the search applications through output connector have a
responsibility for whether documents are searchable soon. so according
to mcf standard, lucene connector will have to choose (2) with the
plugin but near realtime searching is lost. I intended to (1) in the
v0.3 patch.

btw, alfresco, liferay and drupal are also content repository with
pull model clawls, I heard it from someone, but these differs from
mcf's doc version checking, these can index documents using something
like transaction info about CRUD documents which is managed by
repository side, so documents are indexed soon and are searchable
soon. mcf is bounded by a limitation of repository side, e.g.
concurrent access limit(shared drive, web, alfresco, cmis, sharpoint…
almost all repository?) or heavy cpu load on repo side by
multi-threading access. unfortunately, I heard mcf crawls is slow from
some users sometimes so far, of course I knew and explained them that
is not in mcf's taking care of, then adjusted repo side or customize
existing connectors. as my first approach for those, I had an idea to
index documents to local disk by using lucene without any http
transport and use near realtime search with writer's buffered
document, i.e. (1) approach. currently, I have no idea for repository
side limitation though.

–
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Karl Wright added a comment - 18/Jul/15 07:59 Hi Abe-san, Repository problem is hard to fix because it is a characteristic of the repository. Only model_add_change_delete connectors would be expected to work in real time. And none of our connectors have this model because no repositories support it. Maybe you could write a repository connector to a push technology that other repository manufacturers might make an effort to integrate with, but that would be for the future anyhow. Sent from my Windows Phone From: Shinichiro Abe (JIRA) Sent: 7/17/2015 10:53 PM To: dev@manifoldcf.apache.org Subject: [jira] [Commented] ( CONNECTORS-1219 ) Lucene Output Connector [ https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632230#comment-14632230 ] Shinichiro Abe commented on CONNECTORS-1219 : -------------------------------------------- it will work if we just create new indexsearcher with new indexreader which takes HdfsDirectory. as to searcher it depends on using near realtime search or not. (1) coexist writer and searcher this is a approach like solr/solrcloud or elasticsearch. indexsearcher can search the documents indexwriter has. even if to write to hdfs is slow, indexsearcher can search in-memory uncommitted documents from indexwriter (2) separate into writer side and searcher side. this is a approach like solr's legacy style, master(writer)-slave(searcher) architecture, so we can't use near realtime search. indexsearcher searches the documents from hdfs in which there are the documents committed by indexwriter. which are fitted to mcf standard? in solr, elasticsearch, oak and sling, documents are searchable as soon as clients post the documents. oak and sling are content repository with search index by push model(posts a document from client, then stores it to repository and index it simultaneously), these are bounded by jcr standard though. on the other hand, mcf is pull model. the search applications through output connector have a responsibility for whether documents are searchable soon. so according to mcf standard, lucene connector will have to choose (2) with the plugin but near realtime searching is lost. I intended to (1) in the v0.3 patch. btw, alfresco, liferay and drupal are also content repository with pull model clawls, I heard it from someone, but these differs from mcf's doc version checking, these can index documents using something like transaction info about CRUD documents which is managed by repository side, so documents are indexed soon and are searchable soon. mcf is bounded by a limitation of repository side, e.g. concurrent access limit(shared drive, web, alfresco, cmis, sharpoint… almost all repository?) or heavy cpu load on repo side by multi-threading access. unfortunately, I heard mcf crawls is slow from some users sometimes so far, of course I knew and explained them that is not in mcf's taking care of, then adjusted repo side or customize existing connectors. as my first approach for those, I had an idea to index documents to local disk by using lucene without any http transport and use near realtime search with writer's buffered document, i.e. (1) approach. currently, I have no idea for repository side limitation though. – This message was sent by Atlassian JIRA (v6.3.4#6332)

Shinichiro Abe added a comment - 02/Aug/15 09:40

r1693798 to the branch.
The multiprocess mode works with hdfs indexes. I've tested zk and file processes example.
The hdfs indexes have an index per a processId this time since an indexwriter works per a process, if I make indexwriters to index across processes, indexwriter throws LockObtainException. In this condition, removeDocument could not work properly because the connections don't know processId, know only documentURI. Please advice for me.

Shinichiro Abe added a comment - 02/Aug/15 09:40 r1693798 to the branch. The multiprocess mode works with hdfs indexes. I've tested zk and file processes example. The hdfs indexes have an index per a processId this time since an indexwriter works per a process, if I make indexwriters to index across processes, indexwriter throws LockObtainException. In this condition, removeDocument could not work properly because the connections don't know processId, know only documentURI. Please advice for me.

Karl Wright added a comment - 02/Aug/15 10:04

There is no guarantee that ManifoldCF will issue a delete for a document from the same process that indexed it. Each process assumes that it can index or remove all documents.

I have not looked at your code, but I thought the whole reason for having indexes in HDFS was to be able to access them from multiple processes? Or maybe I misunderstand something?

Karl Wright added a comment - 02/Aug/15 10:04 There is no guarantee that ManifoldCF will issue a delete for a document from the same process that indexed it. Each process assumes that it can index or remove all documents. I have not looked at your code, but I thought the whole reason for having indexes in HDFS was to be able to access them from multiple processes? Or maybe I misunderstand something?

Shinichiro Abe added a comment - 03/Aug/15 00:41

currently I took a solrcloud on hdfs way where the relationship between an indexwriter and an index directory is 1 by 1. In HdfsDirectory, I can replace HdfsLockFactory with NoLockFactory which OakDirectory applies. If I do, perhaps multiple indexwriters will happen some errors about when updating an index segment because each indexwriter has each segment info, when updating index segment, then an indexwriter realizes the difference between a segment info itself and an existing segment info which other' writer wrote, as a result some exception will throw. It is worth to try to change NoLockFactory for me, but I think this impl is risky thouth, I'll look into this next weekend. Thanks.

Shinichiro Abe added a comment - 03/Aug/15 00:41 currently I took a solrcloud on hdfs way where the relationship between an indexwriter and an index directory is 1 by 1. In HdfsDirectory, I can replace HdfsLockFactory with NoLockFactory which OakDirectory applies. If I do, perhaps multiple indexwriters will happen some errors about when updating an index segment because each indexwriter has each segment info, when updating index segment, then an indexwriter realizes the difference between a segment info itself and an existing segment info which other' writer wrote, as a result some exception will throw. It is worth to try to change NoLockFactory for me, but I think this impl is risky thouth, I'll look into this next weekend. Thanks.

Shinichiro Abe added a comment - 11/Aug/15 04:12

Progress report: multiple indexwriters to an index with NoLockFactory lead to corrupt the index.

ERROR 2015-08-04 08:17:27,565 (Worker thread '32') - Exception tossed: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1768776044 vs expected footer=-1071082520 (resource=_br_Lucene50_0.pos)
org.apache.manifoldcf.core.interfaces.ManifoldCFException: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1768776044 vs expected footer=-1071082520 (resource=_br_Lucene50_0.pos)

In Oak even if there are multiple indexwriters, in fact a single thread writes to an index in the cluster.
http://markmail.org/thread/2awr5or54vpexzx2

In MCF I think we can have three alternatives.

use LockManager.enterWriteLock() in multiprocess mode to get global lock and to guarantee single writer when writing.
(But it didn't work when I tried. Maybe it was incorrect for me to write the code. Also, multiple fast indexing is lost by single writer, so I don't want to use that.)
use RMI.
(Because there is no way except for this at this time, this will require much time to implement.)
This connector doesn't support multiprocess mode unless mcf supports removeDocument per process.
(Is this violate for mcf's multiprocess specification?)
I'm likely to give up this connector unless any help. I'll postpone this ticket for the time being.

Shinichiro Abe added a comment - 11/Aug/15 04:12 Progress report: multiple indexwriters to an index with NoLockFactory lead to corrupt the index. ERROR 2015-08-04 08:17:27,565 (Worker thread '32') - Exception tossed: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1768776044 vs expected footer=-1071082520 (resource=_br_Lucene50_0.pos) org.apache.manifoldcf.core.interfaces.ManifoldCFException: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1768776044 vs expected footer=-1071082520 (resource=_br_Lucene50_0.pos) In Oak even if there are multiple indexwriters, in fact a single thread writes to an index in the cluster. http://markmail.org/thread/2awr5or54vpexzx2 In MCF I think we can have three alternatives. use LockManager.enterWriteLock() in multiprocess mode to get global lock and to guarantee single writer when writing. (But it didn't work when I tried. Maybe it was incorrect for me to write the code. Also, multiple fast indexing is lost by single writer, so I don't want to use that.) use RMI. (Because there is no way except for this at this time, this will require much time to implement.) This connector doesn't support multiprocess mode unless mcf supports removeDocument per process. (Is this violate for mcf's multiprocess specification?) I'm likely to give up this connector unless any help. I'll postpone this ticket for the time being.

Karl Wright added a comment - 11/Aug/15 05:29

I think it is a good idea to postpone.

Meanwhile I will talk with Mike McCandless to see if there is any possibility you may have overlooked.

Karl Wright added a comment - 11/Aug/15 05:29 I think it is a good idea to postpone. Meanwhile I will talk with Mike McCandless to see if there is any possibility you may have overlooked.

Karl Wright added a comment - 11/Aug/15 10:10

I talked with Mike McCandless. He too does not see any good way forward, other than to possibly use sharding. But, as you have discovered, deletes must occur to the same shard that a document is indexed to, which breaks the model under ManifoldCF.

Karl Wright added a comment - 11/Aug/15 10:10 I talked with Mike McCandless. He too does not see any good way forward, other than to possibly use sharding. But, as you have discovered, deletes must occur to the same shard that a document is indexed to, which breaks the model under ManifoldCF.

Shinichiro Abe added a comment - 25/Feb/18 17:19

Feel free to revisit if volunteers step forward.

Shinichiro Abe added a comment - 25/Feb/18 17:19 Feel free to revisit if volunteers step forward.

ManifoldCF

Lucene Output Connector

Details

Description

Attachments

Attachments

Activity

People

Dates