Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: ManifoldCF 0.3
    • Component/s: Installers
    • Labels:
      None

      Description

      The current requirement that the user install and deploy a PostgreSQL server complicates the installation and deployment of LCF for the user. Installation and deployment of LCF should be as simple as Solr itself. QuickStart is great for the low-end and basic evaluation, but a comparable level of simplified installation and deployment is still needed for full-blown, high-end environments that need the full performance of a ProstgreSQL-class database server. So, PostgreSQL should be bundled with the packaged release of LCF so that installation and deployment of LCF will automatically install and deploy a subset of the full PostgreSQL distribution that is sufficient for the needs of LCF. Starting LCF, with or without the LCF UI, should automatically start the database server. Shutting down LCF should also shutdown the database server process.

      A typical use case would be for a non-developer who is comfortable with Solr and simply wants to crawl documents from, for example, a SharePoint repository and feed them into Solr. QuickStart should work well for the low end or in the early stages of evaluation, but the user would prefer to evaluate "the real thing" with something resembling a production crawl of thousands of documents. Such a user might not be a hard-core developer or be comfortable fiddling with a lot of software components simply to do one conceptually simple operation.

      It should still be possible for the user to supply database server settings to override the defaults, but the LCF package should have all of the best-practice settings deemed appropriate for use with LCF.

      One downside is that installation and deployment will be platform-specific since there are multiple processes and PostgreSQL itself requires a platform-specific installation.

      This proposal presumes that PostgreSQL is the best option for the foreseeable future, but nothing here is intended to preclude support for other database servers in futures releases.

      This proposal should not have any impact on QuickStart packaging or deployment.

      Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue.

        Activity

        Hide
        Karl Wright added a comment -

        Hi jack,
        This seems to me to be beyond the scope of most open-source installers. I've constructed installers involving Postgres before and the integration possibilities are very limited. Furthermore, you would need a totally different installer for windows, debian, redhat, solaris, the mac, etc. Many of these platforms do not work well with bundles but instead use a dependency model in any case.

        — original message —
        From: "ext Jack Krupansky (JIRA)" <jira@apache.org>
        Subject: [jira] Created: (CONNECTORS-55) Bundle database server with LCF packaged product
        Date: July 8, 2010
        Time: 4:35:20 PM

        Bundle database server with LCF packaged product
        ------------------------------------------------

        Key: CONNECTORS-55
        URL: https://issues.apache.org/jira/browse/CONNECTORS-55
        Project: Lucene Connector Framework
        Issue Type: Improvement
        Components: Framework core
        Reporter: Jack Krupansky

        The current requirement that the user install and deploy a PostgreSQL server complicates the installation and deployment of LCF for the user. Installation and deployment of LCF should be as simple as Solr itself. QuickStart is great for the low-end and basic evaluation, but a comparable level of simplified installation and deployment is still needed for full-blown, high-end environments that need the full performance of a ProstgreSQL-class database server. So, PostgreSQL should be bundled with the packaged release of LCF so that installation and deployment of LCF will automatically install and deploy a subset of the full PostgreSQL distribution that is sufficient for the needs of LCF. Starting LCF, with or without the LCF UI, should automatically start the database server. Shutting down LCF should also shutdown the database server process.

        A typical use case would be for a non-developer who is comfortable with Solr and simply wants to crawl documents from, for example, a SharePoint repository and feed them into Solr. QuickStart should work well for the low end or in the early stages of evaluation, but the user would prefer to evaluate "the real thing" with something resembling a production crawl of thousands of documents. Such a user might not be a hard-core developer or be comfortable fiddling with a lot of software components simply to do one conceptually simple operation.

        It should still be possible for the user to supply database server settings to override the defaults, but the LCF package should have all of the best-practice settings deemed appropriate for use with LCF.

        One downside is that installation and deployment will be platform-specific since there are multiple processes and PostgreSQL itself requires a platform-specific installation.

        This proposal presumes that PostgreSQL is the best option for the foreseeable future, but nothing here is intended to preclude support for other database servers in futures releases.

        This proposal should not have any impact on QuickStart packaging or deployment.

        Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue.


        This message is automatically generated by JIRA.
        -
        You can reply to this email to add a comment to the issue online.

        Show
        Karl Wright added a comment - Hi jack, This seems to me to be beyond the scope of most open-source installers. I've constructed installers involving Postgres before and the integration possibilities are very limited. Furthermore, you would need a totally different installer for windows, debian, redhat, solaris, the mac, etc. Many of these platforms do not work well with bundles but instead use a dependency model in any case. — original message — From: "ext Jack Krupansky (JIRA)" <jira@apache.org> Subject: [jira] Created: ( CONNECTORS-55 ) Bundle database server with LCF packaged product Date: July 8, 2010 Time: 4:35:20 PM Bundle database server with LCF packaged product ------------------------------------------------ Key: CONNECTORS-55 URL: https://issues.apache.org/jira/browse/CONNECTORS-55 Project: Lucene Connector Framework Issue Type: Improvement Components: Framework core Reporter: Jack Krupansky The current requirement that the user install and deploy a PostgreSQL server complicates the installation and deployment of LCF for the user. Installation and deployment of LCF should be as simple as Solr itself. QuickStart is great for the low-end and basic evaluation, but a comparable level of simplified installation and deployment is still needed for full-blown, high-end environments that need the full performance of a ProstgreSQL-class database server. So, PostgreSQL should be bundled with the packaged release of LCF so that installation and deployment of LCF will automatically install and deploy a subset of the full PostgreSQL distribution that is sufficient for the needs of LCF. Starting LCF, with or without the LCF UI, should automatically start the database server. Shutting down LCF should also shutdown the database server process. A typical use case would be for a non-developer who is comfortable with Solr and simply wants to crawl documents from, for example, a SharePoint repository and feed them into Solr. QuickStart should work well for the low end or in the early stages of evaluation, but the user would prefer to evaluate "the real thing" with something resembling a production crawl of thousands of documents. Such a user might not be a hard-core developer or be comfortable fiddling with a lot of software components simply to do one conceptually simple operation. It should still be possible for the user to supply database server settings to override the defaults, but the LCF package should have all of the best-practice settings deemed appropriate for use with LCF. One downside is that installation and deployment will be platform-specific since there are multiple processes and PostgreSQL itself requires a platform-specific installation. This proposal presumes that PostgreSQL is the best option for the foreseeable future, but nothing here is intended to preclude support for other database servers in futures releases. This proposal should not have any impact on QuickStart packaging or deployment. Note: This issue is part of Phase 1 of the CONNECTORS-50 umbrella issue. – This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
        Hide
        Jack Krupansky added a comment -

        I was using the term "install" loosely, not so much the way a typical package has a GUI wizard and lots of stuff going on, but more in the sense of raw Solr where you download, unzip, and files are in sub directories right where they need to be. In that sense, the theory is that a subset of PostgreSQL could be in a subdirectory.

        Some enterprising vendor, such as Lucid Imagination, might want to have a fancy GUI install, but that would be beyond the scope of what I intended here.

        Show
        Jack Krupansky added a comment - I was using the term "install" loosely, not so much the way a typical package has a GUI wizard and lots of stuff going on, but more in the sense of raw Solr where you download, unzip, and files are in sub directories right where they need to be. In that sense, the theory is that a subset of PostgreSQL could be in a subdirectory. Some enterprising vendor, such as Lucid Imagination, might want to have a fancy GUI install, but that would be beyond the scope of what I intended here.
        Hide
        Mark Miller added a comment -

        All the more reason to get LCF working completely with other Java databases.

        Show
        Mark Miller added a comment - All the more reason to get LCF working completely with other Java databases.
        Hide
        Karl Wright added a comment -

        Mark, it took most of a month to get Derby working, and to do it I needed to disable certain functionality in LCF. No performance tuning or analysis has yet been done on Derby, and I would not be surprised if another month was required to complete that. Point being that it is by no means ever a "plug and play" operation to switch databases - there are just way too many side effects (e.g. query A performs wonderfully on database X, but you need to use query B or you're dead on database Y). Jack, for example, was extremely surprised to learn that embedded Derby would not allow more than one process to access the database at a time - and Jack was the one advocating most strongly for Derby support!

        I therefore strongly suggest a cautious approach when considering Introducing additional databases. Testing of any change also becomes much more difficult the more supported databases there are. So, in my view, one really must ask, "What unmet scenario do you see that would demand support for this database?", before just going ahead and deciding to support whatever may be out there. I realize this cautious approach is diametrically opposed to your stated goal of supporting "other java databases". Perhaps you could clarify your request so that we could understand your true goal here.

        Show
        Karl Wright added a comment - Mark, it took most of a month to get Derby working, and to do it I needed to disable certain functionality in LCF. No performance tuning or analysis has yet been done on Derby, and I would not be surprised if another month was required to complete that. Point being that it is by no means ever a "plug and play" operation to switch databases - there are just way too many side effects (e.g. query A performs wonderfully on database X, but you need to use query B or you're dead on database Y). Jack, for example, was extremely surprised to learn that embedded Derby would not allow more than one process to access the database at a time - and Jack was the one advocating most strongly for Derby support! I therefore strongly suggest a cautious approach when considering Introducing additional databases. Testing of any change also becomes much more difficult the more supported databases there are. So, in my view, one really must ask, "What unmet scenario do you see that would demand support for this database?", before just going ahead and deciding to support whatever may be out there. I realize this cautious approach is diametrically opposed to your stated goal of supporting "other java databases". Perhaps you could clarify your request so that we could understand your true goal here.
        Hide
        Robert Muir added a comment -

        Karl, just an idea.

        given the issues you had with Derby, instead of doing performance tuning or analysis, maybe Mark's idea is a good approach first.

        I know i originally suggested hsqldb, i dont know anything about Derby except that its apache, but ive done real work with hsqldb.

        Show
        Robert Muir added a comment - Karl, just an idea. given the issues you had with Derby, instead of doing performance tuning or analysis, maybe Mark's idea is a good approach first. I know i originally suggested hsqldb, i dont know anything about Derby except that its apache, but ive done real work with hsqldb.
        Hide
        Karl Wright added a comment -

        Robert,

        I'm not opposed to implementing support for hsqldb, but let's be clear on the goals here.

        The initial goal for doing the Derby implementation was simply to be able to write unit tests, and to make Jack happy. Later, because of the fact that Derby has an embedded JDBC mode of operation, it was possible to construct the LCF Quick-Start to use it. So it made it possible to have an unzip-and-go solution.

        What would the be goal of using hsqldb? It seems to support an embedded mode, so it could certainly be used instead of Derby, wherever we are currently using Derby. Since it fully supports MVCC it is certainly much closer to Postgresql in actual operation than Derby is, so chances are good that we'd find fewer issues in scaling than with Derby. If this is the approach you are suggesting, I would suggest dropping support for Derby and simply replace it with Hsqldb. We'd leave the Derby implementation class around, of course, but we'd not tune against it or test against it.

        FWIW, if hsqldb is sufficiently performant, I could foresee also dropping support for postgresql in the future, in the same way. But that is yet to be proven. And, indeed, that's what the problem is - there's no way to know in advance of doing the work how exactly things will pan out. So if that's the true goal, we've got a fair bit of work to do before deciding whether Hsqldb or Derby or any other Java database can actually do what we need it to.

        Show
        Karl Wright added a comment - Robert, I'm not opposed to implementing support for hsqldb, but let's be clear on the goals here. The initial goal for doing the Derby implementation was simply to be able to write unit tests, and to make Jack happy. Later, because of the fact that Derby has an embedded JDBC mode of operation, it was possible to construct the LCF Quick-Start to use it. So it made it possible to have an unzip-and-go solution. What would the be goal of using hsqldb? It seems to support an embedded mode, so it could certainly be used instead of Derby, wherever we are currently using Derby. Since it fully supports MVCC it is certainly much closer to Postgresql in actual operation than Derby is, so chances are good that we'd find fewer issues in scaling than with Derby. If this is the approach you are suggesting, I would suggest dropping support for Derby and simply replace it with Hsqldb. We'd leave the Derby implementation class around, of course, but we'd not tune against it or test against it. FWIW, if hsqldb is sufficiently performant, I could foresee also dropping support for postgresql in the future, in the same way. But that is yet to be proven. And, indeed, that's what the problem is - there's no way to know in advance of doing the work how exactly things will pan out. So if that's the true goal, we've got a fair bit of work to do before deciding whether Hsqldb or Derby or any other Java database can actually do what we need it to.
        Hide
        Robert Muir added a comment -

        Karl, I agree with the goals you stated for the Derby impl.

        I just looked at some of the problems you had (such as multi-process support) and thought that perhaps these are specific to derby, not a general problem across java databases.

        I've used hsqldb in both modes that it supports: embedded and client-server. Perhaps it might be a good fit, given that the embedded mode could be exploited for junit tests, but client-server mode
        for production?

        Can you help me out and give me more ideas on what particular performance problems you are concerned about (e.g. query types or whatever) ?

        Show
        Robert Muir added a comment - Karl, I agree with the goals you stated for the Derby impl. I just looked at some of the problems you had (such as multi-process support) and thought that perhaps these are specific to derby, not a general problem across java databases. I've used hsqldb in both modes that it supports: embedded and client-server. Perhaps it might be a good fit, given that the embedded mode could be exploited for junit tests, but client-server mode for production? Can you help me out and give me more ideas on what particular performance problems you are concerned about (e.g. query types or whatever) ?
        Hide
        Karl Wright added a comment -

        >>>>>>
        Can you help me out and give me more ideas on what particular performance problems you are concerned about (e.g. query types or whatever) ?
        <<<<<<

        Hi Robert,
        There are two major determinants of performance for LCF, under Postgresql at any rate. The first is the performance of the queue stuffer query, and how that scales to when the queue is extremely large. This is a complex query, but its basic form is:

        SELECT <rowdata> FROM <queuetable> WHERE <some conditions> AND NOT EXISTS(<other row-specific conditions in the same table)>) ORDER BY <priority> ASC LIMIT <typically some hundreds of records>

        Because the queue may be very large, and this query may potentially return ALL records in the queue, the query plan MUST wind up reading directly out of the priority index, or the query simply will not work. It simply cannot afford to read 20 million records into memory and then sort them!

        The second place performance can be severely impacted is in how parallel writes can be. In postgresql 7.4, for example, everything was single-threaded on writes. This caused web crawling in particular to be poorly performing, because every typical web page has a significant number of links that must be entered in the queue, and single-threading that process cost some 4x to 10x over Postgresql 8.x, which allowed much more parallelism.

        Hope this helps.

        Show
        Karl Wright added a comment - >>>>>> Can you help me out and give me more ideas on what particular performance problems you are concerned about (e.g. query types or whatever) ? <<<<<< Hi Robert, There are two major determinants of performance for LCF, under Postgresql at any rate. The first is the performance of the queue stuffer query, and how that scales to when the queue is extremely large. This is a complex query, but its basic form is: SELECT <rowdata> FROM <queuetable> WHERE <some conditions> AND NOT EXISTS(<other row-specific conditions in the same table)>) ORDER BY <priority> ASC LIMIT <typically some hundreds of records> Because the queue may be very large, and this query may potentially return ALL records in the queue, the query plan MUST wind up reading directly out of the priority index, or the query simply will not work. It simply cannot afford to read 20 million records into memory and then sort them! The second place performance can be severely impacted is in how parallel writes can be. In postgresql 7.4, for example, everything was single-threaded on writes. This caused web crawling in particular to be poorly performing, because every typical web page has a significant number of links that must be entered in the queue, and single-threading that process cost some 4x to 10x over Postgresql 8.x, which allowed much more parallelism. Hope this helps.
        Hide
        Karl Wright added a comment -

        I should add that, even for Postgresql, we've had to mess with the stuffer query on pretty near every point release of Postgresql, to guarantee that it continues to meet the basic criteria. The last change that we needed was to perform an ANALYZE before every time the query was run. Why? Because Postgresql 8.3 became somehow incredibly sensitive to small changes in statistics and would cease to do the right thing very quickly as the database changed. We looked at this and discovered that it took a specific plan optimization path when it thought a particular statistic was 100%, and a totally different one when the statistic was anything less than 100%. A bug? Well, no, just a sensitivity... I guess one could call it a design flaw that nobody thought about what might happen if the statistics were slightly out of date.

        Show
        Karl Wright added a comment - I should add that, even for Postgresql, we've had to mess with the stuffer query on pretty near every point release of Postgresql, to guarantee that it continues to meet the basic criteria. The last change that we needed was to perform an ANALYZE before every time the query was run. Why? Because Postgresql 8.3 became somehow incredibly sensitive to small changes in statistics and would cease to do the right thing very quickly as the database changed. We looked at this and discovered that it took a specific plan optimization path when it thought a particular statistic was 100%, and a totally different one when the statistic was anything less than 100%. A bug? Well, no, just a sensitivity... I guess one could call it a design flaw that nobody thought about what might happen if the statistics were slightly out of date.
        Hide
        Robert Muir added a comment -

        Karl, just quoting from the docs here. From my experience, i had no perf problems in the past getting hsqldb to use my indexes, but my queries were relatively simple (just lots of data).

        Because the queue may be very large, and this query may potentially return ALL records in the queue, the query plan MUST wind up reading directly out of the priority index, or the query simply will not work. It simply cannot afford to read 20 million records into memory and then sort them!

        HyperSQL can use an index on an ORDER BY clause if all the columns in ORDER BY are in a single-column or multi-column index (in the exact order). This is important if there is a LIMIT n (or FETCH n ROWS ONLY) clause. In this situation, the use of index allows the query processor to access only the number of rows specified in the LIMIT clause, instead of building the whole result set, which can be huge. This also works for joined tables when the ORDER BY clause is on the columns of the first table in a join. Indexes are used in the same way when ORDER BY ... DESC is specified in the query. Note that unlike other RDBMS, HyperSQL does not create DESC indexes. It can use any index for ORDER BY ... DESC.

        Show
        Robert Muir added a comment - Karl, just quoting from the docs here. From my experience, i had no perf problems in the past getting hsqldb to use my indexes, but my queries were relatively simple (just lots of data). Because the queue may be very large, and this query may potentially return ALL records in the queue, the query plan MUST wind up reading directly out of the priority index, or the query simply will not work. It simply cannot afford to read 20 million records into memory and then sort them! HyperSQL can use an index on an ORDER BY clause if all the columns in ORDER BY are in a single-column or multi-column index (in the exact order). This is important if there is a LIMIT n (or FETCH n ROWS ONLY) clause. In this situation, the use of index allows the query processor to access only the number of rows specified in the LIMIT clause, instead of building the whole result set, which can be huge. This also works for joined tables when the ORDER BY clause is on the columns of the first table in a join. Indexes are used in the same way when ORDER BY ... DESC is specified in the query. Note that unlike other RDBMS, HyperSQL does not create DESC indexes. It can use any index for ORDER BY ... DESC.
        Hide
        Robert Muir added a comment -

        Why? Because Postgresql 8.3 became somehow incredibly sensitive to small changes in statistics and would cease to do the right thing very quickly as the database changed.

        Well, perhaps it might be worth looking into, as hsqldb does less of these type of magic optimizations. For example, its up to you to order your join clause in a way that makes sense.
        By the way, here is the link from the users guide where i pasted the text below, you can read more about it here (there is an example query that seems similar to yours, etc):

        http://hsqldb.org/doc/2.0/guide/sqlgeneral-chapt.html#N107B1

        Show
        Robert Muir added a comment - Why? Because Postgresql 8.3 became somehow incredibly sensitive to small changes in statistics and would cease to do the right thing very quickly as the database changed. Well, perhaps it might be worth looking into, as hsqldb does less of these type of magic optimizations. For example, its up to you to order your join clause in a way that makes sense. By the way, here is the link from the users guide where i pasted the text below, you can read more about it here (there is an example query that seems similar to yours, etc): http://hsqldb.org/doc/2.0/guide/sqlgeneral-chapt.html#N107B1
        Hide
        Mark Miller added a comment -

        Jack, for example, was extremely surprised to learn that embedded Derby would not allow more than one process to access the database at a time

        You can switch to a mode that allows multiple connections - I've started with one and moved to the other in the past - its been too long so I'd have to go look, but its entirely possible to use very similar code and run in a server mode.

        so that we could understand your true goal here.

        To make trying out and installing LCF much easier - the bar on this thing as a user and a dev is just kind of ridiculous at the moment. I know a lot of improvements are being made, but having to install a separate postgres db just to get going - just to try LCF out - is something I'd like to see go away. A java db would allow us to get to a point where you download the thing and launch it - that would really get this project going. It will be helpful to attract both users and a community of devs. There are at least half a dozen or more committers that signed up for this project, but the bar is so high to even 'do anything', that I suspect thats why most of them have yet to contribute even a comment - they don't have the time or motivation to go through that huge 'running LCF' doc - its a bear just to skim, say nothing about go through the steps. In general, I like to think of all that work as something the computer can do for me

        Baby steps though - I'm just pushing in that direction - I know there would be a long, long path to get there.

        Show
        Mark Miller added a comment - Jack, for example, was extremely surprised to learn that embedded Derby would not allow more than one process to access the database at a time You can switch to a mode that allows multiple connections - I've started with one and moved to the other in the past - its been too long so I'd have to go look, but its entirely possible to use very similar code and run in a server mode. so that we could understand your true goal here. To make trying out and installing LCF much easier - the bar on this thing as a user and a dev is just kind of ridiculous at the moment. I know a lot of improvements are being made, but having to install a separate postgres db just to get going - just to try LCF out - is something I'd like to see go away. A java db would allow us to get to a point where you download the thing and launch it - that would really get this project going. It will be helpful to attract both users and a community of devs. There are at least half a dozen or more committers that signed up for this project, but the bar is so high to even 'do anything', that I suspect thats why most of them have yet to contribute even a comment - they don't have the time or motivation to go through that huge 'running LCF' doc - its a bear just to skim, say nothing about go through the steps. In general, I like to think of all that work as something the computer can do for me Baby steps though - I'm just pushing in that direction - I know there would be a long, long path to get there.
        Hide
        Karl Wright added a comment -

        Mark,

        If your concern is about installing LCF, read the Quick Start part of the build/deploy page. You check out, build, and run. Derby-based. Nothing else to install. Not hard really.

        Show
        Karl Wright added a comment - Mark, If your concern is about installing LCF, read the Quick Start part of the build/deploy page. You check out, build, and run. Derby-based. Nothing else to install. Not hard really.
        Hide
        Mark Miller added a comment -

        If it's now that easy, then fantastic! The last quick start guide I saw was like many thousands of words, so it's nice to see it boiled down to like a dozen

        That is exactly what I was leaning towards - but what kind of hobbled state are you in with derby? You said you have to run the db and ui one at a time or something? And that many sql queries don't work with derby - that has all been addressed already?

        Anyhow, you asked what my goal was for pushing getting a working java db - and this is exactly it.

        Show
        Mark Miller added a comment - If it's now that easy, then fantastic! The last quick start guide I saw was like many thousands of words, so it's nice to see it boiled down to like a dozen That is exactly what I was leaning towards - but what kind of hobbled state are you in with derby? You said you have to run the db and ui one at a time or something? And that many sql queries don't work with derby - that has all been addressed already? Anyhow, you asked what my goal was for pushing getting a working java db - and this is exactly it.
        Hide
        Jack Krupansky added a comment -

        Karl notes that "we've had to mess with the stuffer query on pretty near every point release of Postgresql". Letting/forcing the user to pick the right/acceptable release of PostgreSQL to install is error prone and a support headache. I would argue that it is better for the LCF team to bundle the right/best release of PostgreSQL with LCF.

        Show
        Jack Krupansky added a comment - Karl notes that "we've had to mess with the stuffer query on pretty near every point release of Postgresql". Letting/forcing the user to pick the right/acceptable release of PostgreSQL to install is error prone and a support headache. I would argue that it is better for the LCF team to bundle the right/best release of PostgreSQL with LCF.
        Hide
        Karl Wright added a comment -

        >>>>>>
        That is exactly what I was leaning towards - but what kind of hobbled state are you in with derby? You said you have to run the db and ui one at a time or something? And that many sql queries don't work with derby - that has all been addressed already?

        <<<<<<

        The "hobbling" is that you can't sort on some columns in reports that you could sort on before when just Postgresql was involved. Also, that no real large-scale perf tests have been done on Derby. Also that you need to use "LIKE" %-based syntax instead of real regular expressions whenever you specify regular expressions in your reports. The quick-start does not limit your simultaneous use of UI and crawler - it runs jetty as the app server within the same process. It does limit your ability to use other commands simultaneously - but you should not need to do that in normal circumstances.

        So "that " has indeed already been addressed.

        Show
        Karl Wright added a comment - >>>>>> That is exactly what I was leaning towards - but what kind of hobbled state are you in with derby? You said you have to run the db and ui one at a time or something? And that many sql queries don't work with derby - that has all been addressed already? <<<<<< The "hobbling" is that you can't sort on some columns in reports that you could sort on before when just Postgresql was involved. Also, that no real large-scale perf tests have been done on Derby. Also that you need to use "LIKE" %-based syntax instead of real regular expressions whenever you specify regular expressions in your reports. The quick-start does not limit your simultaneous use of UI and crawler - it runs jetty as the app server within the same process. It does limit your ability to use other commands simultaneously - but you should not need to do that in normal circumstances. So "that " has indeed already been addressed.
        Hide
        Karl Wright added a comment -

        >>>>>>
        forcing the user to pick the right/acceptable release of PostgreSQL to install is error prone and a support headache
        <<<<<<

        Yup. It is. Problem is that products/versions get security fixes, CVE's, end-of-life notices, etc. It is beyond the scope of LCF to try and control all that - we'd be buying a whole new level of support headache, believe me.

        Show
        Karl Wright added a comment - >>>>>> forcing the user to pick the right/acceptable release of PostgreSQL to install is error prone and a support headache <<<<<< Yup. It is. Problem is that products/versions get security fixes, CVE's, end-of-life notices, etc. It is beyond the scope of LCF to try and control all that - we'd be buying a whole new level of support headache, believe me.
        Hide
        Jack Krupansky added a comment -

        When Karl says "It does limit your ability to use other commands simultaneously" (referring to use of embedded Derby), he is referring to commands executed using the "executecommand" shell script, such as registering and unregistering connectors, which is something typically done once before starting the UI or once every blue moon when you want to support a new type of repository, but not done on as regular a basis as editing connections and jobs and running jobs. The java classes to execute those commands would be, by definition, outside of the LCF process.

        Show
        Jack Krupansky added a comment - When Karl says "It does limit your ability to use other commands simultaneously" (referring to use of embedded Derby), he is referring to commands executed using the "executecommand" shell script, such as registering and unregistering connectors, which is something typically done once before starting the UI or once every blue moon when you want to support a new type of repository, but not done on as regular a basis as editing connections and jobs and running jobs. The java classes to execute those commands would be, by definition, outside of the LCF process.
        Hide
        Karl Wright added a comment -

        The quick-start even takes care of connector registration for you, so executecommand is not needed even then. What you don't get to do is use the command-based API to control LCF; that's not going to work in the single-process model.

        By the way, hsqldb is apparently limited to a 16GB database (version 2.0). That's not very much.

        Show
        Karl Wright added a comment - The quick-start even takes care of connector registration for you, so executecommand is not needed even then. What you don't get to do is use the command-based API to control LCF; that's not going to work in the single-process model. By the way, hsqldb is apparently limited to a 16GB database (version 2.0). That's not very much.
        Hide
        Robert Muir added a comment -

        By the way, hsqldb is apparently limited to a 16GB database (version 2.0). That's not very much.

        No, its just a default I think: http://hsqldb.org/doc/2.0/guide/deployment-chapt.html

        <set files scale statement> ::= SET FILES SCALE <scale value>

        Changes the scale factor for the .data file. The default scale is 8 and allows 16GB of data storage capacity. The scale can be increased in order to increase the maximum data storage capacity. The scale values 8, 16, 32, 64 and 128 are allowed. Scale value 128 allows a maximum capacity of 256GB.

        Show
        Robert Muir added a comment - By the way, hsqldb is apparently limited to a 16GB database (version 2.0). That's not very much. No, its just a default I think: http://hsqldb.org/doc/2.0/guide/deployment-chapt.html <set files scale statement> ::= SET FILES SCALE <scale value> Changes the scale factor for the .data file. The default scale is 8 and allows 16GB of data storage capacity. The scale can be increased in order to increase the maximum data storage capacity. The scale values 8, 16, 32, 64 and 128 are allowed. Scale value 128 allows a maximum capacity of 256GB.
        Hide
        Karl Wright added a comment -

        Moving this to "installers" category.

        Show
        Karl Wright added a comment - Moving this to "installers" category.
        Hide
        Otis Gospodnetic added a comment -

        Joining late. I've used PG a lot at my previous startup. Liked it, but it doesn't seem like the right DB to embed in a product like LCF. I'm surprised nobody mentioned H2 (which I've never used, but thought that it's THE database to use in situations like this one). Look at the feature table on http://www.h2database.com/html/main.html

        Show
        Otis Gospodnetic added a comment - Joining late. I've used PG a lot at my previous startup. Liked it, but it doesn't seem like the right DB to embed in a product like LCF. I'm surprised nobody mentioned H2 (which I've never used, but thought that it's THE database to use in situations like this one). Look at the feature table on http://www.h2database.com/html/main.html
        Hide
        Karl Wright added a comment -

        H2 looks pretty impressive also, featurewise.
        I actually already did a hsqldb driver for LCF but simply have not had a moment to even try it out. It should not be hard to attempt one for H2. Simple problems ought to manifest themselves rather quickly under the unit tests. Ideally, though, we need a large test to figure out what embedded database to choose.

        The hardest methods of the driver to writer are the "interrogation" methods - e.g. finding the definitions for tables and indexes, finding out whether a user already exists, etc. There's not enough standardization on how you do this across databases, and the way you do it is almost always not well documented either.

        Show
        Karl Wright added a comment - H2 looks pretty impressive also, featurewise. I actually already did a hsqldb driver for LCF but simply have not had a moment to even try it out. It should not be hard to attempt one for H2. Simple problems ought to manifest themselves rather quickly under the unit tests. Ideally, though, we need a large test to figure out what embedded database to choose. The hardest methods of the driver to writer are the "interrogation" methods - e.g. finding the definitions for tables and indexes, finding out whether a user already exists, etc. There's not enough standardization on how you do this across databases, and the way you do it is almost always not well documented either.
        Hide
        Jack Krupansky added a comment -

        I checked that H2 feature comparison table, but it did not suggest a great benefit of H2 for LCF. The footprint is a little smaller than Derby and of course a lot smaller than PostgreSQL. One area not in the table that could matter a lot is performance. Any quick thoughts on H2 performance relative to PostgreSQL and Derby?

        Show
        Jack Krupansky added a comment - I checked that H2 feature comparison table, but it did not suggest a great benefit of H2 for LCF. The footprint is a little smaller than Derby and of course a lot smaller than PostgreSQL. One area not in the table that could matter a lot is performance. Any quick thoughts on H2 performance relative to PostgreSQL and Derby?
        Hide
        Karl Wright added a comment -

        MVCC is the feature that suggests greater concurrency (and, hence, greater performance).

        Show
        Karl Wright added a comment - MVCC is the feature that suggests greater concurrency (and, hence, greater performance).
        Hide
        Karl Wright added a comment -

        The current status of this feature request is as follows:

        • There is a Quick Start, which can run either with Derby or with PostgreSQL, depending on what you put in connectors.xml.
        • We've done a lot of fixes to the Derby support, although there are two outstanding tickets that need to be resolved before it can be considered "real". See CONNECTORS-100 and CONNECTORS-110.
        • You can run the Quick Start with embedded Derby in a mode that allows external connections. "java -Dderby.drda.startNetworkServer=true -jar start.jar". If you want a truly multiprocess ManifoldCF running, you will also need to set up file-based synchronization, as described in how-to-build-and-deploy.html.
        • Instructions for how to run Quick Start with PostgreSQL are in the FAQ.
        • I coded an HSQLDB implementation, but that's stalled because HSQLDB internally deadlocks. See CONNECTORS-114.
        • Nobody has looked at H2 in any depth yet. Not clear if we still need to.
        Show
        Karl Wright added a comment - The current status of this feature request is as follows: There is a Quick Start, which can run either with Derby or with PostgreSQL, depending on what you put in connectors.xml. We've done a lot of fixes to the Derby support, although there are two outstanding tickets that need to be resolved before it can be considered "real". See CONNECTORS-100 and CONNECTORS-110 . You can run the Quick Start with embedded Derby in a mode that allows external connections. "java -Dderby.drda.startNetworkServer=true -jar start.jar". If you want a truly multiprocess ManifoldCF running, you will also need to set up file-based synchronization, as described in how-to-build-and-deploy.html. Instructions for how to run Quick Start with PostgreSQL are in the FAQ. I coded an HSQLDB implementation, but that's stalled because HSQLDB internally deadlocks. See CONNECTORS-114 . Nobody has looked at H2 in any depth yet. Not clear if we still need to.
        Hide
        Karl Wright added a comment -

        Correction: QuickStart with Derby cannot be run in multiprocess mode, because you cannot currently configure ManifoldCF to use the Derby client JDBC driver. See CONNECTORS-178.

        Show
        Karl Wright added a comment - Correction: QuickStart with Derby cannot be run in multiprocess mode, because you cannot currently configure ManifoldCF to use the Derby client JDBC driver. See CONNECTORS-178 .
        Hide
        Karl Wright added a comment -

        HSQLDB support has been added, which works a lot better than Derby does, so I think we've finally hit the necessary criteria for closing out this ticket.

        Show
        Karl Wright added a comment - HSQLDB support has been added, which works a lot better than Derby does, so I think we've finally hit the necessary criteria for closing out this ticket.

          People

          • Assignee:
            Karl Wright
            Reporter:
            Jack Krupansky
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development