EffectiveErrorHandling
From SPA Wiki
This session explored the roles and forces on error handling and how to develop a error handling policy for the system development. The intent of the session was to allow the delegates to explore a domain problem, to identify the players and their primary concerns, the failure points and how these failure points should be handled and communicated to the players.
This session created as a consequence of a Bof: 1 held at SPA2006.
The first exercise split the group into teams and for each team member to describe a domain or particular project. The group would then decide on which domain or project they would base the remaining exercises. There were 3 teams in the session the 3 domains were:
- An intranet/extranet information system (Team 1)
- A internal banking system (Team 2)
- An internet advertising system (Team 3)
The second exercise was to identify the players in the system and their concerns, the outputs from the teams were:
Team 1:
End Users Operations: Operators Development Security
Team 2:
End Users System Admin Bad Guys Tech. Support Testers Management
Team 3:
The Punter (end user) Client (the supplier of the adverts) Support Engineer
The next step was to consider the failure modes of the system, the outputs from the teams were:
Team 1:
A table for error classification was created, but the domain specific errors were not added to the table.
Validation | Recoverable | Non Recoverable | ||
Validation | _ | _ | _ | |
Availability | _ | _ | _ | |
Component Failure | _ | _ | _ | |
Security | _ | _ | _ | |
Data Integrity | _ | _ | _ |
Team 2:
Database Errors Hardware Errors Users' Errors Privileges App. Errors
Team 3:
All nodes & connections are potential points of failure Errors, in data, service or user experience Error may be from: - Network faults - Hardware (i.e. Server) problem - Software defects Errors may be... - Transient - Recurrent/persistent - Requirement constraints: - Keep serving, whatever - Timeliness And some system specific failures
Inside Data Centre | Outside Data Centre | |
Lose connection to database | Loss of XML feed from clients | |
Web server fails | Loss of all XML feeds | |
DB goes down | XML corrupt | |
Builder (the system that builds the adverts from live data) goes down | DNS failure | |
Server slow | DOS attack | |
N/W failure | CDN failure |
The next exercise was to consider the strategies for coping with the errors and the guidance given to development, the outputs from the teams were:
Team 1:
Team 1 considered the things they would include in the policy: - Coding Policy - (Includes Log everything relevant) - Implement test for error framework - Framework Implementation And they also considered how to communicate failures: - Screen Message - Log Message - Email - SMS
Team 2: The team ran out of time for this exercise, so suggested a humourus solution in the event of an error:
For the end user:
fprintf(webout, "<p>" %2b "Connection to Internet lost" %2b "Please contact your service provider." %2b "</p>");
And for operators of the site:
fprintf(stderr, "segmentation fault. core dumped."); exit(1);
Team 3:
Strategies for coping: - Health Monitoring: - of servers - of connections - centralised? - Link checking - alerts (SMS) - Redundancy - Load balancing - Disaster Recovery plan - Paranoia at boundaries - Prevention (reduction) of s/w defects through inspection and testing Scenarios - Data feed goes down... - continue serving adverts - until some time limit - Builder goes down... - tell support (quickly!) - Black Hat injects wrong data... - Monitor raises alert - Advert server fails... - make it fail fast (and start fast?)
During the session Charles and Tony put together some notes to guide the group work as the session progressed, this was refined into a mini-process for forming an error handling policy:
Error Handling
Identify - Primary Concerns (“polices” or “forces”) - Players and needs - Failure Points & Error Categories Define - Communication - Logs/IPC - Error Content - Reported - Diagnostic modes
It was interesting to note how useful this mini-process was in allowing the teams to develop a good feel for the scope of the error handling needs of the systems and to propose some policies. We also noticed the influence this process would have on the architecture of the system, as some concerns and the associated failure modes required architectural solutions rather than a coding policy.
Here are some shots of the teams in action during the session: