Máy học (Machine learning) đang trở thành một trong những xu hướng lớn nhất trong phát triển hệ thống hiện đại, với khả năng đem đến những hiểu biết chiến lược, các dự đoán & cái nhìn chuyên sâu cho doanh nghiệp. Tuy nhiên, xây dựng & tích hợp 1 hệ thống máy học không phải lúc nào cũng dễ dàng, đặc biệt với những hệ thống lớn & hệ thống phân tán - khi mà các khuôn phép về phát triển máy học còn chưa đạt đến độ phát triển bằng hệ thống phần mềm.
Trong buổi thảo luận này, chúng ta sẽ cùng tìm hiểu cách Amazon Web Services (AWS) đã thiết kế & xây dựng 1 trong những nền tảng MLOps được ứng dụng rộng rãi nhất trên thế giới - Amazon SageMaker.
- Về diễn giả: My Nguyễn hiện là Kiến trúc sư giải pháp tại AWS Việt Nam, chuyên sâu vào hỗ trợ các giải pháp xây dựng hệ thống Máy học.
Code versioning controls
Shared environments, IDE – Jupyter Note/Lab
Infrastructure as code
Self-service environment
SaaS
Most importantly: training & processing
Separation of source, environments, etc.
Security
Experiment lifecycles
Pricing
Efficiency
Reproduceability is hard
End-to-end tracability
Dashboard ->
Netflix built metaflow
Lyft build Flyte
Kubeflow
Apache Airflow
Important factor: skill set & enforce
Metaflow
Netflix built metaflow
Netflix is a huge customer of AWS
In production since 2018
Made open source by Netflix & AWS in 2019
What is it?
Basic concepts of metaflow
Deploying to AWS is easy
Flyte
A K8s native distributed workflow orchestrator used at Lyft for:
Data science
Pricing
Fraud detection
Locations
ETA and more
Enables highly concurrent, scalable workflows for ML and data processing
Core concepts of Flyte – task, DAG, workflows, control flow specification.
Actual task can be in any language – tasks executed as containers.
Provisions necessary resources dynamically, executes tasks as docker containers, and de-provisions resources when tasks are complete to control costs.
Supports execution across 100s of machines e.g. production model training
Kubeflow, Airflow are fairly popular
Airflow
Amazon SageMaker with Apache Airflow 1.10.1. If you use Airflow, you can use SageMaker Workflow in Apache Airflow
More details from https://sagemaker.readthedocs.io/en/stable/using_workflow.html
Many customers want to use the fully managed capabilities of Amazon SageMaker for machine learning, but also want platform and infrastructure teams to continue using Kubernetes for orchestration and managing pipelines. SageMaker addresses this requirement by letting Kubernetes users train and deploy models in SageMaker using SageMaker-Kubeflow operations and pipelines. With operators and pipelines, Kubernetes users can access fully managed SageMaker ML tools and engines, natively from Kubeflow. This eliminates the need to manually manage and optimize ML infrastructure in Kubernetes while still preserving control of overall orchestration through Kubernetes. Using SageMaker operators and pipelines for Kubernetes, you can get the benefits of a fully managed service for machine learning in Kubernetes, without migrating workloads.
If you use Kubernetes, you can use SageMaker Operators for Kubernetes
You can install the Sagemaker Operator for Kubernetes using the provided Helm Chart
Once you have this operator installed, K8s users can natively invoke SageMaker features like model training, Hyperparameter Tuning and Batch Transform jobs
They can also setup model serving using SageMaker Model Hosting Services
https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_operators_for_kubernetes.html#what-is-an-operator
https://eksworkshop.com/advanced/420_kubeflow/pipelines/
We see customers build serverless ML workflows using AWS Step Functions
Open source - Step Functions Data Science SDK for SageMaker
Create workflows to pre-process data, train/deploy models using SageMaker
Data pre-processing can be done using AWS Glue
SageMaker functionality like model training, HPO and end point creation is accessible
Use the SDK to create and visualize the workflows
Scale workflows without having to worry about infrastructure
https://aws.amazon.com/about-aws/whats-new/2019/11/introducing-aws-step-functions-data-science-sdk-amazon-sagemaker/
Many good tools exist. You can run any of the tools we saw earlier on AWS.
Remember - Tools are meant to make your life easier
Don’t get fixated on the tools.
Work backwards from the problem you are trying to solve.
So think about your existing s/w engg workflows and tools
Ask yourself, which tools will best augment what you already have
Ask yourself, which tools are your people most comfortable with
AWS approach is use the tools that work for you
Easy to think of SageMaker as Notebook.
The key thing to remember is that the notebook UI we see a lot in the demos is just a part of the SageMaker platform – and an optional part at that!
The notebook is the front-end environment in which we’ll experiment with our data and code.
Keep that instance low-cost resource. Value of separation…
When we’re ready to try and train or deploy a model, we’ll be spinning up separate, dedicated infrastructure in the SageMaker container runtime – which means we have lots of flexibility to choose resources cost-effectively and only pay for what we need.
All managed
The orchestration that SageMaker gives us to make this happen is closely integrated to these other two services:
The images defining our containers will need to be stored in Amazon ECR (there’s not currently an integration for external registries like DockerHub – but if you have a particular technology in mind our service team would appreciate the feedback!
…And the preferred storage platform for not just our input data but also model artifacts and other stuff generated in the workflow will be Amazon S3. Why? <The generic S3 pitch – it’s got everything you need for a data lake> Most integrated service, arguably most mature, tiers, security models, high durability
Recaping: 4 things
…So let’s look at how that end-to-end process works.
To start with I have:
The data that I want to train on (prepared and loaded to S3) – pre-processed already, in Notebook, but also option for other services like Glue or Processing Jobs to …
The training script I’d like to run (e.g. defining neural network shape and fitting routine – on the notebook instance where I’m working) minimum code
One of the pre-prepared SageMaker framework container images somewhere in Amazon ECR – maybe TensorFlow, PyTorch, or MXNet repeatable, controlled, re-producable
So what’s happening when we start a training job by calling “estimator.fit()” in those examples from before?
We’re gonna start seeing a lot of arrows here, so the cool thing to remember is that all of the arrows are things *SageMaker is doing for you* - not things you need to do yourself!
First, assuming you provide a custom code script (or folder of code), the SageMaker SDK is going to zip that up and upload it to a new location in S3. So you can’t forget to check your working version in to git, and you won’t lose track of that version that worked well in the middle of your experiments: The results are going to be traceable to the code that created them.
Next, SageMaker is going to spin up whatever infrastructure you asked for in the fit() request, and pull down the docker image to run on it
SageMaker will also start downloading your source data from S3 into the container – no messing about with S3 API calls in your script – your code can read it from folder, just as if you were running locally. Env params…
As the container fires up, that framework application does a load of helpful prep but one particularly important thing: It installs any additional inline dependencies specified for your custom code, then starts it up and passes in the parameters of the training job.
Your code runs, prints status to the console, and saves the trained model to disk just like you normally would… But SageMaker takes care of zipping and uploading that final model to S3 – and also other output mechanisms like sending the logs to CloudWatch and collecting metrics. Pay only for …
So the benefit we’ve gained here is that our custom code can be quite simple: Load a CSV from file, make a random forest, save it to file, etc. We can even add specify additional dependencies via a requirements.txt file… and SageMaker plus the framework container will orchestrate these overhead tasks to give us this nice lineage-traceable workflow with all of the cool features we talked about earlier – with no extra code complexity required on our part.
When it’s time to deploy that model to an inference endpoint, we simply reference:
Our model artifact tarball from S3
An inference container (which might be the same one as for training, or might be a different image because the dependencies could be differently optimized for run-time)
And maybe some custom code again: This time just defining some helper functions that we might want to customize from the built-in inference flow, such as how to de/serialize requests and responses, or how the model file(s) need to be loaded from disk into memory if the process is different from standard. How it’s optimized
As in training, SageMaker will handle the creation of infrastructure and loading of these components for us. If we used the ‘estimator’ pattern from the high-level SageMaker SDK, all we need to call is a single estimator.deploy(…) function to make it happen.
Again here the intent is that any custom code needed can be small: Just providing a few optional functions for serialization, model loading, etc… Rather than writing and having to maintain a model server, integrations with TorchServe or TensorFlow Serving, etc.
Custom input format (JSON)…
Not today, but…
In SageMaker, batch transform jobs function pretty much identically to real time inference endpoints from a user code point of view: The batch transform engine handles reading your source data from S3, feeding it through your model, storing the results back to S3, and shutting down the resources again as soon as the job is done.
Pay only for…
Mechanism: how easiest for different personas?
Skillset dependency – learning curve
…So that’s our overview picture for framework containers:
You write pretty minimal code just as you usually would for experimenting in your notebook. But instead of running that code locally, which can make things like infrastructure optimization, experiment tracking, and inference deployment tricky… SageMaker provides some nice streamlined, high-level APIs to trigger containerized training and inference jobs (or deploy endpoints) on separate infrastructure.
At the fundamental level, the system is super flexible because you can make fully custom container images and model artifact tarballs… But the framework container images together with the SageMaker SDK library (for your notebook) enable this higher-level, container-plus-custom-code workflow.
Same as the morning, just diff drawing
Solve problems on experimenting, tracking, etc.
Also lession learnt & best practices
The Repeatable stage is generally focused on applying automation as the number of machine learning workloads running in production increases. In general, at this stage many of the activities in building, training and deploying machine learning models is automated. The introduction of automation reduces manual hand-offs between teams and reduces the operational overhead of previously manual/ad-hoc tasks. The ability to orchestrate machine learning workflows into automated machine learning also depends on having a data strategy and automated data processing tasks.
Queue Management: Ability to manage, schedule, and prioritize tasks
Resource Management: Access to horizontally scalable compute that can scale based on workflow task requirements
Workflow Operators: Error handling, retry and conditional logic functions
Workflow Logs: Centralized logs and configuration parameters for execution and task level logs
The Reliable stage builds on the automation from the Repeatable stage but aims to ensure automation is balanced with practices aimed to increase quality, enable end-to-end traceability, increase reliability through automatic rollbacks, increase visibility into development and operational health, and ensure repeatability. In general, at this stage MLOps practices of Infrastructure-as-Code/Configuration-as-Code, Continuous Integration, Continuous Delivery/Deployment, and Continuous Monitoring are introduced.