Skip to end of metadata
Go to start of metadata

Problem Statement

The current issues with the Kuali Service Bus have to do primarily with the way in which it handles fail-over, including it's behavior of marking of services as "dead".

There are various jira issues which have identified this problem:

Current Algorithm for Handling Failed Service Invocations

The current algorithm for handling failed service invocations is handled primarily inside of BusClientFailureProxy. This works as follows:

  • The process of service invocation is performed in a "do { ... } while (true);" loop
  • At the beginning of the loop, when the service is invoked via the proxy, if an exception is thrown the code checks for whether or not it is a "service removal exception"
  • A "service removal exception" is determined as follows:
    • It is an exception of one of the following types:
      1. org.apache.commons.httpclient.ConnectTimeoutException;
      2. org.apache.commons.httpclient.ConnectionPoolTimeoutException;
      3. org.apache.commons.httpclient.NoHttpResponseException;
    • OR, it is an HttpException and has one of the following codes:
      1. 404
      2. 503
  • If it is determined that the exception is a "service removal exception" then the code will ask the RemoteResourceServiceLocator to remove the service. Removing the service does the following:
    • removes the service endpoint from the list of available client-side proxies
    • mark the service endpoint as "dead" in the service registry (see below for what this means)
  • After the service has been "removed", the proxy will go back to the RemoteResourceServiceLocator and ask it to provide another endpoint with the same name.
    • Note that the bad endpoint has been removed from the list at this point, and therefore shouldn't be returned.
  • It will then re-assign the local service reference to the new service endpoint, assuming one exists
    • if a replacement service does not exist, then a message will be logged and the original exception which triggered the service failure will be thrown up the call stack.
  • Once the local service has been reassigned it will then loop back to the top of the "do-while" loop and try invoking the new service endpoint

What it means for a service to be marked "dead"

When a service is marked "dead" this sets the value of the SVC_ALIVE column in the KRSB_SVC_DEF_T to '0'.

Essentially, this makes the service "invisible" such that the next time that all clients update their service registries, they won't see this service and will therefore not create a client-side proxy to it.

The only way to make the service active again is to republish it. This will happen automatically whenever the application which was hosting the service comes back online. Or, if the application was online the whole time and the service was marked dead in error (see some of the scenarios discussed in the next section) the next time that application syncs it's client-side services with the bus the SVC_ALIVE column will get updated back to '1'.

Problems with the current service bus implementation

  1. The BusClientFailover proxy caches a reference to a single service endpoint for a given service name. When it performs the "failover" action, it does not reassign the target service on the proxy, it only reassigns the service for the scope of the current invocation. Because of this behavior, client applications which "cache" their reference to the service will have the following issues:
    1. They will always invoke the exact same endpoint of the service, even if more than one is available (see above)
    2. If the endpoint they are bound to has problems (triggering fail-over), then their service calls will always hit the bad endpoint first before failing over.
  2. It shouldn't be a client application's decision to mark a service as "dead".
    • Just because a single client application can't access the service, it doesn't mean that other clients can't
    • A firewall would be a good case of this, it doesn't make sense to "kill" a service just because one client can't reach it
  3. It's difficult (and potentially impossible) to know for sure if an application has completely left the service bus network.
    • When an application shuts down cleanly the services are all marked as dead. This means that the KSBConfigurer needs to shutdown cleanly and it's shutdown is typically triggered via the closing of the Spring context.
    • However, it's often the case that an application does not shut down cleanly. In these cases, we still have entries in the service bus table for that application, but they are inaccessible. And if the application has been shutdown permanently, they will always be inaccessible.

Proposal for Improvements

Short-term Fix

The immediate need is to address KULRICE-4287 as reported by the Kuali Coeus team for Rice 1.0.3.

I think this can be addressed most easily by the following:

  1. Allow for the service endpoint to be reassigned inside of the BusClientFailureProxy. This will prevent bad endpoints from getting called over and over again.
  2. Work with the KC team to identify where they are having problems specifically and why they are having to clear out their service registry table. It could be that fixing the KIM service caching issue that they noticed previously as well as addressing the above problem will be sufficient.

Medium-term Improvements

Rice 1.1?

  1. Remove the ability of client applications to mark a service as dead. As mentioned previously, the current implementation is a bit dangerous and heavy-handed.
  2. Keep client-side statistics on service endpoints (this could be as simple as the endpoint url, and a failure to success ratio, or could be more complex and be time and date aware). This will help to identify services and isolate the application from services that are "really" dead but were just never cleaned up.
    • Additionally, client application could periodically report endpoint availability and uptime to the server. This would allow for alerts coming out of the rice standalone server which might help to indicate that there's a service cleanup issue. Administrators could at that point go in and clean up the bad services.
  3. Re-engineer the client-side service proxies so that instead of referencing a single endpoint of a service, build fail-over and load balancing directly into the proxy. Currently, load balancing is happened at service proxy acquisition time, which doesn't help if you are keeping a reference to the service proxy. This is because as it is implemented currently is bound to a single endpoint.
    • This will also improve our ability to implement better failover.
  4. Implement the concept of "quarantining" service endpoints for some period of time before trying them again. In some cases an endpoint may be down for seconds or it may be down for hours. In the case that it's only a temporary blip or hiccup in the network, we don't want to abandon the service so quickly.
  5. Ensure that we have a comprehensive list of fail-over exceptions and error codes implemented on the BusClientFailureProxy.
    • One big question is whether error 500s from HttpException should be considered a failure scenario or not. These kinds of errors can be caused by server-side problems which may be inherent to only a single server. So thinking along those lines, it does seem like these need to be considered as failure scenarios. Added support for quarantining and detecting the "best" service endpoints will go a long way toward ensuring that an error 500 as the result of a legitimate application exception (which would happen on all endpoints) doesn't completely kill all endpoints of the service. In that case, the KSB client should notice that all endpoints are returning the same error, and are therefore on equal footing from the terms of their availability.
  6. Implement the ability for the KSB Service Registry to be maintained over a remote service invocation, instead of direct database access from the client application.

Long-term Improvements

As per the Rice roadmap, consider migrating to another open-source ESB implementation.

Dealing with the Shutdown and Cleanup Problem

As mentioned previously, because of the nature of a large distributed service bus, it will be difficult to ensure that we always have a consistent service registry which only has working endpoints in it. If an application instance does not report to the bus that it is being removed from the service bus, then it's old endpoints will persist in the registry (and with the eventual refactoring ultimately be relegated to the "do-not-use" bin).

In order to address this, we should look at adding improved administrative capabilities to the Rice Standalone Server which will allow for us to clean out all services from a single application.

Another possibility here is to require all applications that are publishing services on the bus to periodically "check-in" with the standalone server. If we don't get this periodic checkin then the service bus can decommission the services from that application and notify all of the other service bus clients of this change. However, depending on the level of support that we would like to provide to non-java applications publishing services on the bus (which is currently not supported at all), that may or may not be a feasible approach. Though we could certainly make it a requirement of these non-java applications, they would just have to implement it themselves.

Diagram of a proposed refactoring of the KSB

Gliffy Zoom Zoom ServiceBusArchitecture

Notes on this diagram:

  1. In this picture, what I'm calling "Service Bus" is basically meant to encapsulate a lot of the logic which is currently RemotedServiceRegistryImpl and RemoteResourceServiceLocatorImpl.
    • Right now those two classes are very difficult to understand and there is a large blurring of responsibility between the two
    • The idea here was to create a single client which managed the client application's connection to the service bus and was responsibile for maintaining and keeping up to date it's list of published services and remote service proxies
  2. A possible API for this Service Bus piece might look something like the following:
  3. While "Service Bus" is shown as a single piece in the diagram, behind the scenes, it would likely be made of up a few pieces:
    • A component which manages the client-side state of the central service registry as well as managing remote proxies to services
    • A component which manages services published on the client-side and ensures that the service registry is kept up to date with the state of the client application's published services
    • A component which handles managing a the state of services from the client perspective. Would handle quarantining services were that necessary as well as ensuring that client application service proxies obtained through "getService" are kept up to date with the latest available endpoints for a given service.
  4. In the diagram, Service Registry represents the API to manage and query for state about the central Service Registry. It does not know or care about how the client application acquires endpoints to or invokes those services (that's the job of the Service Bus piece in the diagram).
  5. The "Remote Resource Loader" in the picture really just becomes a thing wrapper around the "Service Bus" piece.
  6. One outstanding question is how asynchronous service proxies are aquired. Currently, it's handled through the MessageHelper interface which is a bit confusing. Perhaps better would be to add a getAsynchronousService method to the ServiceBus interface as it was designed above.
  7. The diagram above shows connections to the registry as "remote or embedded". In order to achieve the desired level of version compatibility though, this really needs to only support remote access and not embedded access.

Image of Whiteboard from 04-27-2011 in New Orleans

  • No labels