Run:ai Releases Advanced Model Serving Functionality to Help Organizations Simplify AI Deployment 

Print Friendly, PDF & Email

Run:ai, a leader in compute orchestration for AI workloads, announced new features of its Atlas Platform, including two-step model deployment — which makes it easier and faster to get machine learning models into production. The company also announced a new integration with NVIDIA Triton Inference Server. These capabilities are particularly focused on supporting organizations in deploying and using AI models for inference workloads on NVIDIA-accelerated computing, so they can provide accurate, real-time responses. The features cement Run:ai Atlas as a single unified platform where AI teams, from data scientists to MLOps engineers, can build, train and manage models in production from one simple interface. 

AI models can be challenging to deploy into production; despite the time and effort spent to build and train models, most never leave the lab. Configuring a model, connecting it to data and containers, and dedicating only the required amount of compute are major barriers to making AI work in production. Deploying a model usually requires manually editing and loading tedious YAML configuration files. Run:ai’s new two-step deployment makes the process easy, enabling organizations to quickly switch between models, optimize for economical use of GPUs, and ensure that models run efficiently in production. 

“With new advanced inference capabilities, Run:ai’s Altas Platform now offers a solution for the entire AI lifecycle — from build to train to inference — all delivered in a single platform,” said Ronen Dar, CTO and co-founder of Run:ai. “Instead of using multiple different MLOps and orchestration tools, data scientists can benefit from one unified, powerful platform to manage all their AI infrastructure needs.”

Run:ai also announced full integration with NVIDIA Triton Inference Server, which allows organizations to deploy multiple models — or multiple instances of the same model — and run them in parallel within a single container. NVIDIA Triton Inference Server is included in the NVIDIA AI Enterprise software suite, which is fully supported and optimized for AI development and deployment. Run:ai’s orchestration works on top of NVIDIA Triton and provides auto-scaling, allocation and prioritization on a per-model basis — which right-sizes Triton automatically. Using Run:ai’s Atlas with NVIDIA Triton leads to increased compute resource utilization while simplifying AI infrastructure. The Run.ai Atlas Platform is an NVIDIA AI Accelerated application, indicating it is developed on the NVIDIA AI platform for performance and reliability.

Running inference workloads in production requires fewer resources than training, which consumes large amounts of GPU compute and memory. Organizations sometimes run inference workloads on CPUs instead of GPUs, but this might mean higher latency. In many use cases for AI, the end user requires a real-time response: identification of a stop sign, facial recognition on a phone, or voice dictation, for example. CPU-based inference can be too slow for these applications. 

Using GPUs for inference workloads gives lower latency and higher accuracy, but this can be costly and wasteful when GPUs are not fully utilized. Run:ai’s model-centric approach automatically adjusts to diverse workload requirements. With Run:ai, using a full GPU for a single lightweight workload is no longer required, saving considerable cost while maintaining low latency. 

Other new features of Run:ai Atlas for inference workloads include: 

  • Visibility and Monitoring – New inference-focused metrics and dashboards give insights into the health and performance of the AI models in production.
  • Deploy Models on Fractional GPUs – Right-sizing models and deploying them on GPU fractions avoids resource waste and ensures performance requirements are met.
  • Auto-Scaling – Allows organizations to automatically scale models up or down based on predefined thresholds using built-in and GPU-specific metrics. This ensures model Service Level Agreements (in terms of latency) are met.
  • Scale-to-Zero – Automatically scales deployments to zero resources when possible, freeing up valuable resources which reduces cost and enables repurposing of resources for other workloads.

“The flexibility and portability of NVIDIA Triton Inference Server, available with NVIDIA AI Enterprise support, enables fast, simple scaling and deployment of trained AI models from any framework on any GPU- or CPU-based infrastructure,” said Shankar Chandrasekaran, senior product manager at NVIDIA. “Triton Inference Server’s advanced performance and ease of use together with orchestration from Run:ai’s Atlas Platform make it the ideal foundation for AI model deployment.”

Sign up for the free insideAI News newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Speak Your Mind

*