SPA Conference session: Top-K (in Haskell)

One-line description:Look at and implement probabilistic data structures for real-time top-k problems.
Session format: 150min workshop [read about the different session types]
Abstract:Suppose you'd like to have real-time information on the top 10 discussions on your chat site, or the top 5 pictures on your gallery site.

One option would be to keep a count of all the comments on discussions, or views on pictures. In most situations this would work great, but if in a single day you could expect millions of different pictures to be viewed, or thousands of different discussions to be conducted, you could start getting annoyed at all the data you're storing - after all, you only want the top 5 or 10, why store all the rest?!

This is where probabilistic data structures come in. They aim to store the important data, and forget about the rest, and they result in rough rather than exact answers.

We'll be looking at some of these data structures, and then giving them a whirl by implementing them, and hooking them up to some data streams, and seeing how good their answers are, and how much memory they're actually using up.

Coding will be in Haskell.
Audience background:Basic Haskell.
Benefits of participating:Probabilistic data structures are interesting and often not included when considering different problem solutions - you never know when they could come in handy.
Materials provided:Slides. Base code to get people started (github). A data stream to play around with (probably will be a Haskell lib that generates an infinite stream on the fly).
Process:Start with a presentation about probabilistic data structures from myself. Then we'll split into groups and each implement a different one.
Detailed timetable:00:00 - 00:15 presentation
00:15 - 01:00 implementing a top-k solution
01:00 - 01:15 trying out and tweaking our solutions with different sorts of input streams (often distribution or ordering of input stream can really affect how the data structure works)
01:15 - 01:30 discussion of strengths and weaknesses of different solutions
Outputs:A program that outputs to stdout.
History:No. The idea came because this is what I'm researching at work at the moment. I'll be doing a presentation internally at work on the subject, but not a workshop.
1. Emily Green
2. 3.