SPA Conference session: Understanding MapReduce with Hadoop

One-line description:Learning how to use the MapReduce programming model for analyzing large datasets with the Apache Hadoop framework.
 
Session format: Tutorial (75 mins) [read about the different session types]
 
Abstract:With today's applications we are generating data faster than we can understand it. So having tools for processing, aggregating, and analyzing large volumes of data is vital for us to reach that level of understanding. MapReduce and Hadoop are two new tools for this purpose. MapReduce is a parallel programming model devised at Google for efficiently processing large amounts of data, and Hadoop is an Apache open-source framework for running MapReduce programs.

In this session we will look at why processing very large datasets is difficult with current tools and how MapReduce and Hadoop help. The focus of the session is to understand the constraints that the MapReduce programming model impose on writing parallel programs, and how those same constraints actually provide a useful way to look at many data processing problems. To develop this understanding a few basic MapReduce worked examples will be given and demonstrated on a running Hadoop system, then the group will be invited to work in pairs to write a MapReduce program to solve a data processing problem.
 
Audience background:This session is aimed at programmers who have a familiarity with Java, including generics. No previous experience of parallel programming or concurrency is required.
 
Benefits of participating:Participants will learn a new programming model (MapReduce) and how to go about expressing problems in a form suitable for this model. They will also see how MapReduce programs can be run for real using Hadoop.
 
Materials provided:- Slides for presentations
- Worksheets with data processing problems for participants
 
Process:The first part of the session will be presentation-led discussion using slides and a demonstration of running code. For the second part of the session the group will work in pairs to write a MapReduce program to solve a data processing problem. Due to time constraints participants will use pen and paper to express their programs. Finally, two pairs will briefly describe their programs to the group and some of the challenges they had while working on them.
 
Detailed timetable:00:00 - 00:15 Introduction to MapReduce and Hadoop
00:15 - 00:25 Some simple MapReduce programs: grep, sort, word count
00:25 - 00:35 A demonstration of running a MapReduce program using Hadoop
00:35 - 01:05 Paired working on a MapReduce program for a data processing problem
01:05 - 01:15 Presentation of solutions (5 minutes each)
 
Outputs:The data processing problems and associated programs to solve them written by participants will be captured for display on a poster at the conference and on the wiki for after the conference.
 
History:None
 
Presenters
1. Tom White
Lexeme Ltd.
2. 3.