Tuesday, September 22, 2020

Data Science Study Notes: Automatic Machine Learning (AutoML)

Automated Machine Learning (AutoML) is one of the hottest topics in data science today, but what does it mean?

Why the random search is much more faster than the grid search, and at the same time, not missing the optimal point. Here is the beautiful formula we need to achieve:
where is the formula coming from?
So you can apply the typical condition, 95% of chance falling in the optimal region, then we know as long as we random sample 60 times, we have 95% of chance to land the optimal region:
Here is the grach to understand why the random search is likely to achieve the optimal region with much less time of trails? It's essentially due to the fact that there are usually not that many important factors for the model, in other words, only a few important factors that worthy to grid search, all the other searches are essentially a waste for the un-important factors:
here is the video from Danny Leybzon. Danny has an academic background in computational statistics. He believes that good data science requires good data engineering in order to create clean, accurate, and accessible data for data scientists. In the past, he’s given presentations on distributed deep learning, productionizing machine-learning models, and the importance of big data for machine learning in the modern world.

In this workshop, Danny D. Leybzon (a seasoned data scientist and Solutions Architect at Qubole) will give a broad overview of AutoML, ranging from simple hyperparameter optimization all the way to full pipeline automation. After going over the theoretical framework and explanation of AutoML, he will dive into concrete examples of different types of AutoML. Throughout the presentation, Danny will leverage Apache Spark (a framework popular with data scientists who need to scale their machine learning workloads to Big Data) and Apache Zeppelin notebooks, as well as popular Python libraries such as Pandas, Plotly, and bayes-opt. Participants will walk away from this workshop with in depth knowledge of hyperparameter tuning (using grid search, random search, Bayesian optimization, and genetic algorithms) and will have been exposed to new tools for automating their machine learning workflows.

No comments:

Post a Comment

Python Study notes: How HDBSCAN works?

HDBSCAN is a clustering algorithm. It extends DBSCAN(Density-Based Spatial Clustering of Applications with Noise)by converting it into a hi...