Machine Learning Using Zeppelin and Scikit-Learn


Machine learning applications have been written since the late 1950s (see Perceptron, 1958). The term ’machine learning’ was accredited to Arthur Samuelin 1959.With the recent interest in Deep Learning (see Geoffrey Hinton et al., 2006) machine learning techniques have become valuable tools for the data scientist.  

Machine learning is a field of computer science that gives computer systems the ability to progressively improve performance on a specific task with data, without being explicitly programmed. Many of the tasks performed by data scientists such as computational statistics, which focuses on prediction-making through the use of computers, usually engage machine learning algorithms.

Machine learning is often associated with data mining. Data mining is often accomplished by data scientists through the application of exploratory data analysis (EDA) and unsupervised learning. Machine learning is used in the credit card industry, for example, to evince and establish baseline behavioral profiles for various entities and then find meaningful anomalies like the fraudulent use of credit cards or identities.

Machine learning algorithms are used to devise complex models and algorithms that lend themselves to prediction. In commercial use, this is known as ’predictive analytics.’ Analytical models allow data scientists to "produce reliable, repeatable decisions and results" and uncover "hidden insights" through learning from historical relationships and trends in the data.

This course instructs the student in key concepts and fundamental practices of machine learning (through lecture and labs using the Scikit-Learn libraries)that are relevant to the activities of a data scientist. 

50% Lecture 50% Hands-on Labs 

Target Audience

Individuals who are new to the application of Machine Learning. The goal of this course ware is to provide the concepts and the tools a data scientist needs to implement programs that are capable of ’learning’ from data. Applications will be written in the Python programming language using the Apache Zeppelin environment.


Experience with the Python programming language, the Zeppelin IDE and exposure to EDA statistics is a prerequisite. It is suggested that a student new to programming and new to Zeppelin take the course ’Introduction to Python using Zeppelin.’ Either experience with programming EDA statistics using Zeppelin or the completion of the course ’Statistics for Data Science using Zeppelin’ is a prerequisite for this course.  

Course Outline

  • Day 1: Fundamental Machine Learning and Classification
  • Day 2: Training Models and Support Vector Machines
  • Day 3: Decision Trees and Ensemble learning and Random Forests
  • Day 4: Dimensionality Reduction

SLI Main Menu