Xiangrui Meng of Databricks, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive data sets.
Machine learning workflows often involve a sequence of processing and learning stages. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. Inspired by scikit-learn, the idea was to provide simple APIs to help users quickly assemble and tune ML pipelines. Under the hood, it seamlessly integrates with Spark SQL’s DataFrames and utilizes its data sources, flexible column operations, rich data types, as well as execution plan optimization to create efficient and scalable implementations.
There are many factors affecting a parallel implementation of an ML algorithm, e.g., optimization algorithm, platform limitation, communication pattern, data locality, numerical stability and performance, and fault-tolerance. Different implementations of the same ML algorithm can perform dramatically different. Xiangrui shares lessons learned from optimizing the alternating least squares (ALS) implementation in MLlib.
This talk was recorded at the NYC ML Meetup via Hakka Labs and hosted at Pivotal Labs in NYC. The slides for the presentation can be seen HERE.
Download insideAI News: An Insider’s Guide to Apache Spark
Speak Your Mind