SPA Conference session: Getting Started with Data Science

One-line description:Start Data Science - without lots of baggage
Session format: Short Tutorial [read about the different session types]
Abstract:Data Science and machine learning is gaining more and more popularity. However making a start seems to be so difficult with talks of "cloud scale", big data, Hadoop, Mahout, etc. Is it really impossible math?

In this tutorial, we will move away from the large tools and petabytes of data, and get started with machine learning using simpler approaches. We will discuss things like regression, classification, clustering, recommendation engines, etc. We will then go ahead and implement some techniques to see them in action. This should give us enough ammunition to go and explore (and understand) "Big Data" offerings.

In particular, we will code the following algorithms:

kNN Classification: Classification based on majority vote from neighbouring data points.

Decision Trees: Finding out which parameters contribute in what amount in selecting which group some input will belong to.

k-Means Clustering: Find hidden patterns in data.

We will interactively act out the algorithms, and discuss other aspects of data science (for example reproducible research). We will also look at some samples. If you want to follow the code, it is recommended you bring a laptop with R Studio and R installed.

R can be downloaded from
R Studio can be downloaded from
Audience background:Programming experience, and interest in data science.
Benefits of participating:Gain a hands on introduction to data science.
Materials provided:Exercise instructions and datasets.
Process:Discussions, and hands on exercises.
Detailed timetable:00:15
Exercise: kNN Classification.
Exercise: Decision Trees
Exercise: k-Means Clustering
1. Ashic Mahtab
Heartysoft Solutions Limited
2. 3.