Skip to end of metadata
Go to start of metadata

Overview

For the purposes of this document, an embedded database is a database that can be included in a software system and and does not have to be separately installed. These databases are generally lightweight, simple, and easy to administer, troubleshoot, and maintain. In some cases, the database does not even need access to limited system resources like open ports.

Rice has always suffered from being complex with many working parts. This causes a barrier for entry when users are evaluating the software. An embedded database could help make these evaluations easier. For integration testing, an embedded database can be used so that an external database does not need to be installed and configured. This makes running integration tests more controlled as easier to run in general. For users wanting to create a rice based project using rice's provided archetypes an embedded database makes this process much easier by removing the variability of an external database configuration. Finally, embedded support in rice will add potential benefit for all Kuali projects that choose to support the embedded database through rice. Currently this is not possible because of rice's lack of support.

Requirements

  • Add support for an embedded database in rice.
  • Identify and fix all rice code (potentially anything with hand written sql) that is incompatible to selected embedded database.
  • Add support for an embedded database in impex.
  • Add support for an embedded database in ojb.
  • Modify the rice archetypes to use the embedded database by default.
  • Modify the rice integration tests to use the embedded database by default.
  • Get rice all integration tests to pass
  • Work with Dev ops to add a new ci integration test configuration using embedded database against trunk
  • Move all non trunk ci integration test configurations to use embedded database rather than mysql (at the discretion of rice team & dev ops)

Approach

My approach for this work will be the path of least resistance. That is, I will use which ever open source embedded database is the easiest to meet the stated requirements. I don't see any compelling reasons to prefer one embedded database over another. To put this another way, all embedded databases are on equal ground at the beginning of this analysis.

Code

Analysis and Development

I did some research on cross-platform embedded databases with native java support and came up with three viable options (derby, hsql (formerly hypersonic), and H2). Note that H2 was created by the person behind HSQL and is meant to be the spiritual successor.  These databases all support sequences, views, users, roles, basic datatypes, etc. All the things needed for rice. Their documentation is all pretty good. They are all active projects with compatible open source licenses.  I have excluded the McKoi database because it does not look well maintained.  It has only had one release between 2004 and late 2012 (the time of this writing).  I also looked to see if there was an embedded version of MySQL or a MySQL variant.  No such solution exists (with native Java support).  MySQL used to have something called MX/J but it was discontinued.

I looked at the Kuali impex project (torque) and found that hsql and derby were at one time supported. There were some references to hsql (hypersonic) and derby. The support was removed for some reason. Then I went to the torque website to do a bit of research. I found the following: http://db.apache.org/torque/version-specific/supported-databases.html. This was promising because I may be able to retrieve the removed support from the torque codebase. I also noticed that they are still supported in torque 4: http://db.apache.org/torque/releases/torque-4.0/documentation/other/supported-databases.html. The H2 database is not supported in torque 3 or torque 4. Since torque is not part of Kuali's long term stategy, this does not exclude H2 as a viable option.

I started by putting derby and hsql support back into Kuali impex for the offical torque codebase. Since we had forked torque so much over the years, I had to do a bit of work to get this working.

Both derby and hsql seemed to work fine with the impex tool as I was able to integrate them into the rice build and generate sql scripts. The biggest problem I had with both derby and hsql was a lack of consistent features. The rice build requires that a user & schema is deleted when doing a maven clean and created when doing a maven install. This requires that the databases support either scripting (like PL/SQL) or "drop if exists" and "create if not exists". Without these features doing a clean when the user or schema is not present would cause an error. Doing an install when the user or schema is present would cause an error. These errors could be worked around but it would be easier to look for an embedded database that supports these features.

Update: Enter H2

With H2 I have to manually create support in the impex project from the ground up. This is not too much effort because impex is a simple codebase. H2 supports "drop if exists" and "create if not exists" which is a perfect fit for the rice build. So far, I have impex generating DDL & DML scripts and executing them into a running H2 server.

Update: Next I need to add H2 support into OJB and rice.  This was a fairly simple exercise with a small amount of code and config changes.  

Update: Now to startup rice...

Update: Rice startup is "almost" running against H2 but blows up inside the quartz library.  This leads me to my first problem as a result of the The Lossy Database Process.

Update: Rice is now starting up. The integration tests can run against H2. There appears to be an ordering issue when clearing out tables. Also, SYSDATE is not being translated correctly I think at the impex layer. Perhaps SYSDATE is being set to the default value? Not sure. Gotta look into it.

Update: The SYSDATE problem was related to SYSDATE not being valid in H2. I had to make a few small modifications to impex. This is now working.

Update: The integration tests now run but have some DB errors. Now comes the tedious task of rooting out all these issues.

Update: Some rice integration tests now run successfully under H2. The biggest hurdle so far has been getting the ojb platform correct. OJB sends unsupported java types into the statement.setObject method. These unsupported types must be converted. We need to review all these conversions for correctness. We should also review the sql types impex is generating for correctness.

Update: 95% of all integration tests pass. This is good enough to move on to figuring out how to startup and load the database automatically.

Update: have the database working in embedded mode as opposed to server mode. I'm storing the db files in ${user.home}/kuali/db/h2/${impex.database} to make it easy to configure the connection in an external file. I'd rather put the db files in ${basedir}/target/db/h2/${impex.database} but basedir is only available during maven execution. Since the database file location is specified in the connection url this can always be overriden. Finally got around to committing changes to foundation code ojb, impex, impex plugin. Anyone interested, take a look.

The Lossy Database Process

The rice team uses a physical database (oracle) to store schema and database information.  They then point impex at this database and have it generate xml documents to hold the data and schema info.  These xml documents then are used by the impex tool to generate DDL and DML scripts for specific databases.  The problem is oracle does not support many features present in other databases and JDBC.  So when going from a physical database to another physical database information about a schema will be lost.  I believe it is fundamentally flawed to attempt to do this in the first place.  The rice master database should be represented as source code checked in.  Then that source code is used to generate a physical database (not the other way around).  In this way features can be expressed in the database source code and translated to the equivalent features in the physical database.  This is the way we will likely do things once on liquibase.

A real example

Oracle and MySQL do not support Boolean Types. To get around this the rice team uses VARCHAR(1) columns. You could also use numeric types (BIT, NUMBER). Derby, H2, HSQL all support boolean types. When you have a schema that has a boolean column but that schema is generated from a physical oracle database you loose that information and the column shows as VARCHAR(1). This might not be a problem except what happens when setBoolean is called on a JDBC statement? In the case of MySQL and Oracle a 1 or 0 is created. In the case of H2, Derby, or HSQL some variant of the word "TRUE" or "FALSE" is created which obviously does not fit in a VARCHAR(1). Normally, we could get around this by creating our own DAOs for a specific database support. This is not possible because it is happening inside of the Quartz library. When you look at the Quartz documentation they expect column types for some columns to be actual Boolean types where supported. So the question is: How can we create schemas that quartz wants with a lossy database process? I'm not sure yet.... Still researching.

The following shows this problem in all three embedded databases I'm evaluating. The below code only works for Oracle and MySQL

and results in

A Temporary Solution

Until we change our database process and move to something like liquibase, we need to somehow communicate to the impex tool that certain columns should be created as a different type. One option I'll be exploring is using database comments associated with a column to attach metadata about a preferred or alternate type. We'll see where this idea takes me. I actually decided on a simpler solution. I added "replacement rule" support to the impex morph step used by rice's build. The downside to this approach is it has limitations in the types of rules that can be expressed. Also, it is specific to the maven plugin and is not available to user of impex through ant. The upside to this approach is it was really easy to implement. I figured this is a good trade-off since ultimately we want to move from impex.

Database Startup and Population

Most of the work to support an embedded database had a single clear cut solution with a goal of "make it work." For this subtask there are many different possible solutions. The two problems that need to be solved are: how do we startup an embedded database? how to we populated the embedded database? It might be helpful to list several usecases for this task:

Usecase 1

  1. user downloads rice war
  2. user installs rice war into application server
  3. user starts up application server
    Outcomes
  4. on server startup, the embedded database server is launched
  5. the embedded database is populated
  6. the normal rice startup continues using embedded database

Usecase 2

  1. developer checks out rice project
  2. developer runs integration test
    Outcomes
  3. on server startup, the embedded database server is launched
  4. the embedded database is populated
  5. the normal rice startup continues using embedded database

Usecase 3

  1. developer generates new project with rice archetype
  2. developer launches jetty via maven command on new project
    Outcomes
  3. on server startup, the embedded database server is launched
  4. the embedded database is populated
  5. the normal project startup continues using embedded database

Questions

Should the database continue to run after the test or application server shuts down?

If yes, then development tools will be able to connect to it. This would be useful for development and will work in the way that developers expect. The downside to this is there will be a rouge process as a side-affect of launching rice. A process that is using up a port. What happens when launching the application server again?

Does the database need to be available to external processes? forked java processes, development tools?

The answer to this is most likely yes. The reason is impex is a multiple step process which forks java processes. Also some of our unit tests fork processes and still connect to the same database. It would be nice if we could get away from this. Maybe impex isn't a big deal b/c we wouldn't be using impex to load data on startup but tests may need to be rewritten.

Do we have to use a port? Which one?

Not necessarily. We can use Server mode and that case you use a port. We can also support embedded mode but make the database persistent. This way the database does not die on jvm shutdown but you can still connect to the database with external tools. See http://www.h2database.com/html/features.html#connection_modes

Do we bundle h2 with rice by default so external drivers are not needed? (runtime dependency).

I suggest that we do this but also have a profile that excludes it to give implementers an easy way to build w/o it. The alternative is to require it to be installed in the application server.

Do we default rice, archetype, tests to use h2 by default?

I suggest that we do default everything to h2. This will make quickstarts easy

If a database is already running on startup do we "overwrite" it?

I suggest that this is togged with a config param.

Where does the dataset come from which we populate rice with?

We should be bundling the sql scripts into a jar which the various sample apps can depend on. This can then be consumed by the database startup code.

What does the actual database startup? SpringBean, ServletFilter, TestHarness, etc.

I suggest we use Spring since this should work for the sampleapps & test cases. It should be controlled by config params.

Questions and Other things

  1. How do we support many different DBs without forcing every database driver dependency onto our users? At the same time we do want users to be able to run tests, startup app, etc. without too much fuss. I would also like to keep any explicit references to db specific apis out of rice. Currently we don't have any such code references best not to start now.
  2. How do we launch the DB server. With H2 (but not all embedded DBs) you can launch the server in process in the same JVM. I don't think we want this because the JVM is forked when doing many things. Plus this style of DB looses all data by default when the JVM shuts down. Should we use: maven, testharness, init listener, spring bean, something else?
  3. Assuming we launch H2 in server mode, how to we assign the port? Do we just pick one or do something more intelligent like detecting and dynamically assigning a free one?
  • No labels

3 Comments

  1. Question: There is a small amount of code in rice related to McKoi database support. I highly doubt anyone on the planet is using this. Can it be removed? As I've stated earlier, McKoi seems to be a dead or dying database so I see no reason to keep the code around if no one is using it. Just one more thing to support.

  2. Hey Travis,

      I am sure you have seen this, but just in case wanted to give a reference to the spring embedded database support regarding your question on running the database.

    http://static.springsource.org/spring/docs/3.0.0.M4/reference/html/ch12s08.html

    Jerry

    1. Jerry,

      Thanks for pointing this out. I'm getting close to needing a solution for abstracting the embedded db (so rice doesn't directly depend on it in code) and at the same time starting & loading things automatically. This does look promising.