WhatWentWrongErrorHandling
From SPA Wiki
Contents |
WS8 - What Went Wrong?
Error Handling in a Distributed System
Andy Longshaw and Eoin Woods
Acknowledgements
Many thanks to TonyBarrettPowell for making excellent notes during the session, which became this output with minimal editing!
Good Error Handling Practices
- Create a logging hierarchy and classify each error withing the hierarchy.
- Allow error filtering based on the logging hierarchy and don't perform message related processing (e.g. creating debug strings) if the message isn't included in the current filter.
- Detect and report errors at the "right" place (e.g. validating input).
- Log details where error occurs and convert to more general error/concept for propagation back to caller.
- Create a repository of all known errors.
- Rotate developers into support for a time.
- Separate error detection from error logging from error handling.
- Report the operational context with the error information.
- Log a unique operation ID with error context (to allow pieces of error information to be strung together).
- Only intercept and handle an error if you can do something useful with it.
- Explore failure scenarios with users to find out how important different errors are to them.
- Classify each errors to allow them to be treated in groups.
- When errors occur, continue with reduced functionality if possible.
Bad Error Handling Practices
- Shutting down the system on error.
- No standards (for error handling).
- Propagating errors that should be handled locally.
- Exposing technical detail to end users.
- Wrong use of unchecked exceptions.
- Non-errors reported as errors.
- Reporting "normal" business errors as exceptions.
- Attempting to retro-fit a error handling strategy.
- Vague errors that don't clearly define what happened and where.
- Inflexible timeouts causing spurious errors.
- Losing lead developers when production transition is complete.
Team 1 Framework
Context
Error handling for a large distributed system involving a large number of data sources in existing operational systems on different platforms.
Framework Structure
Presentation ----------------> Central Log ! ! ^ Business Layer -------------%2b ! ! ! ! Data Access Layer -----------%2b ! ! Log Message Interface (MQ) ----------------- ^ ! ! ! ! ! Sys1 Sys2 Sys3 Sys4 ! [Shared Legacy Systems] ! ^ ^ ^ ^ ! [[LogD]]1 [[LogD]]2 [[LogD]]3 [[LogD]]4-----------------%2b [Logging Daemons]
- The system structure is on the left:
- Presentation later displays user interface, accesses Business Layer
- Business Layer provides business txn logic, accesses Data Access Layer
- The shared legacy systems are the sourcess of data for the business transactions, accessed via the Data Access Layer
- The unified logging system is on the right:
- Single central log "server"
- Presentation, Business and Data Access layers access the log via a standard programming library.
- A Logging Daemon is created for each legacy system, which runs on the remote machine, recognises error or status conditions of interest and sends log messages to the central log.
- The Logging Daemons access the central log via a message based (IBM MQ) service interface.
Principles
- Centralised logs (accessed via MQ messaging interface)
- Provide tools to navigate errors
- Add entries to the log at all of the levels of the system.
- Session Id, Application Id would be used to identify errors
- The legacy systems must be included into the error logging (using the MQ interface and a logging monitor on the legacy system to send error to the Central Log).
- For further analysis the legacy systems may require a service interface to enable search and filter of the error log.
- The classification of what is logged should be provided to the developers in this environment, for example what to log in the central log.
- Some form of information is required to identify what happens when an error occurs, for example what an operater should do when an error occurs.
Team 2 Framework
Context
There are a mix of technologies, the principle should be to make it easy for th developers no matter what the technology is.
Principles
- Centralised Error Definitions (codes, format, classification, required supplementary information)
- Example class structure for exception:
Exception ! ---------------------- ! ! Application System Exception Exception {| border="1" | || |} --------------- ! ! ! ... ... ... [Types of app. exception]
- Example Error class:
// The error has a unique id obtained at construction. class Error { public Error() public Error(Error e) toString() getID() setCode([[ErrorCode]] code) setParameter(name, value) isRetryable() raise() // throws the exception for the specific error } class [[ErrorReport]] { // If the error is not known or does not contain appropriate // information this is also an error condition. public [[ErrorReport]](Error e) setCorrelationId(Id id) toLog() toSNMP() }
- Abstract errors to the next level of system.
- Retry information can be included in the error.
- Layers can modify the retry information as required.
Team 3 Framework
Context
Definition of a framework of principles for error handling in large distributed information systems.
Principles
- Create a set of central error definitions.
- Handle errors at system boundaries.
- Classify errors to allow them to be processed appropriately.
- Log errors at boundaries of the system's levels.
- Enrich error information as the error propagate between the different levels of the system.
- Ensure that each instance of an error has a unique identifier.
Back to OtTwoThousandAndFourOutput.