EffectiveErrorHandling

From SPA Wiki

Jump to: navigation, search

This session explored the roles and forces on error handling and how to develop a error handling policy for the system development. The intent of the session was to allow the delegates to explore a domain problem, to identify the players and their primary concerns, the failure points and how these failure points should be handled and communicated to the players.

This session created as a consequence of a Bof: 1 held at SPA2006.


The first exercise split the group into teams and for each team member to describe a domain or particular project. The group would then decide on which domain or project they would base the remaining exercises. There were 3 teams in the session the 3 domains were:

  • An intranet/extranet information system (Team 1)
  • A internal banking system (Team 2)
  • An internet advertising system (Team 3)

The second exercise was to identify the players in the system and their concerns, the outputs from the teams were:

Team 1:

  End Users
  Operations:
     Operators
     Development
     Security

Team 2:

  End Users
  System Admin
  Bad Guys
  Tech. Support
  Testers
  Management

Team 3:

  The Punter (end user)
  Client (the supplier of the adverts)
  Support Engineer

The next step was to consider the failure modes of the system, the outputs from the teams were:

Team 1:

A table for error classification was created, but the domain specific errors were not added to the table.

Validation Recoverable Non Recoverable
Validation _ _ _
Availability _ _ _
Component Failure _ _ _
Security _ _ _
Data Integrity _ _ _

Team 2:

  Database Errors
  Hardware Errors
  Users' Errors
  Privileges
  App. Errors

Team 3:

  All nodes & connections are potential points of failure
  Errors, in data, service or user experience
  Error may be from:
     - Network faults
     - Hardware (i.e. Server) problem
     - Software defects
  Errors may be...
     - Transient
     - Recurrent/persistent
     - Requirement constraints:
        - Keep serving, whatever
        - Timeliness
  And some system specific failures
Inside Data Centre Outside Data Centre
Lose connection to database Loss of XML feed from clients
Web server fails Loss of all XML feeds
DB goes down XML corrupt
Builder (the system that builds the adverts from live data) goes down DNS failure
Server slow DOS attack
N/W failure CDN failure

The next exercise was to consider the strategies for coping with the errors and the guidance given to development, the outputs from the teams were:

Team 1:

  Team 1 considered the things they would include in the policy:
     - Coding Policy - (Includes Log everything relevant)
     - Implement test for error framework
     - Framework Implementation
  And they also considered how to communicate failures:
     - Screen Message
     - Log Message
     - Email 
     - SMS

Team 2: The team ran out of time for this exercise, so suggested a humourus solution in the event of an error:

  For the end user:
     fprintf(webout, "<p>" %2b
        "Connection to Internet lost" %2b
        "Please contact your service provider." %2b
        "</p>");
  And for operators of the site:
     fprintf(stderr, "segmentation fault. core dumped.");
     exit(1);

Team 3:

  Strategies for coping:
     - Health Monitoring:
        - of servers
        - of connections
        - centralised?
        - Link checking
        - alerts (SMS)
     - Redundancy
     - Load balancing
     - Disaster Recovery plan
     - Paranoia at boundaries
     - Prevention (reduction) of s/w defects through inspection and testing
  Scenarios
     - Data feed goes down...
        - continue serving adverts
        - until some time limit
     - Builder goes down...
        - tell support (quickly!)
     - Black Hat injects wrong data...
        - Monitor raises alert
     - Advert server fails...
        - make it fail fast (and start fast?)

During the session Charles and Tony put together some notes to guide the group work as the session progressed, this was refined into a mini-process for forming an error handling policy:

Error Handling

  Identify
     - Primary Concerns (“polices” or “forces”)
     - Players and needs
     - Failure Points & Error Categories
  Define
     - Communication - Logs/IPC
     - Error Content 
        - Reported
        - Diagnostic modes

It was interesting to note how useful this mini-process was in allowing the teams to develop a good feel for the scope of the error handling needs of the systems and to propose some policies. We also noticed the influence this process would have on the architecture of the system, as some concerns and the associated failure modes required architectural solutions rather than a coding policy.


Here are some shots of the teams in action during the session:

EffectiveErrorHandling1.JPGEffectiveErrorHandling2.JPGEffectiveErrorHandling3.JPG