AWS 最新情報（自動翻訳版）

Amazon SageMaker launches new inference capabilities to reduce costs and latency

Posted On: Nov 29, 2023

We are excited to announce new capabilities on Amazon SageMaker which help customers reduce model deployment costs by 50% on average and achieve 20% lower inference latency on average. Customers can deploy multiple models to the same instance to better utilize the underlying accelerators. SageMaker actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available.

These features are available for SageMaker's real-time inference which makes it easy to deploy ML models. You can now create one or more InferenceComponents and deploy them to a SageMaker endpoint. An InferenceComponent abstracts your ML model and enables you to assign CPUs, GPU, or Neuron accelerators, and scaling policies per model. We will intelligently place each model across instances behind the endpoint to maximize utilization and save costs. Each model can be independently scaled up and down to zero. This frees up hardware resources for other models to make use of the accelerators on the instance. Each model will also emit its own metrics and logs to help you monitor and debug any issues. We added a new Least Outstanding Requests routing algorithm which leads to more even distribution of requests resulting in reduced end-to-end latency.ENnbsp;

These new features are generally available in: Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney, Jakarta), Canada (Central), Europe (Frankfurt, Stockholm, Ireland, London), Middle East (UAE), South America (Sao Paulo), US East (N. Virginia, Ohio), and US West (Oregon).

Learn more by visiting our documentation page and our product page.

Amazon SageMaker、コストとレイテンシーを削減する新しい推論機能をリリース

Amazon SageMaker の新機能を発表できることを嬉しく思います。これにより、お客様はモデルのデプロイコストを平均 50% 削減し、推論レイテンシーを平均 20% 削減できます。お客様は複数のモデルを同じインスタンスにデプロイして、基盤となるアクセラレータをより有効に活用できます。SageMaker は推論リクエストを処理しているインスタンスをアクティブに監視し、どのインスタンスが使用可能かに基づいてリクエストをインテリジェントにルーティングします。これらの機能は SageMaker のリアルタイム推論に利用できるため、ML モデルを簡単にデプロイできます。1 つ以上の推論コンポーネントを作成して SageMaker エンドポイントにデプロイできるようになりました。InferenceComponent は ML モデルを抽象化し、CPU、GPU、Neuron アクセラレータ、スケーリングポリシーをモデルごとに割り当てられるようにします。各モデルをエンドポイントの背後にある複数のインスタンスにインテリジェントに配置して、使用率を最大化し、コストを節約します。各モデルは個別にゼロまでスケールアップまたはスケールダウンできます。これにより、ハードウェアリソースを解放して、他のモデルがインスタンスのアクセラレータを利用できるようになります。また、各モデルから独自のメトリクスとログが出力されるため、問題の監視とデバッグに役立ちます。新しい最少未処理リクエストルーティングアルゴリズムを追加しました。これにより、リクエストがより均等に分散され、エンドツーエンドのレイテンシが短縮されます。これらの新機能は通常、アジアパシフィック (東京、ソウル、ムンバイ、シンガポール、シドニー、ジャカルタ)、カナダ (中部)、ヨーロッパ (フランクフルト、ストックホルム、アイルランド、ロンドン)、中東 (UAE)、南米 (サンパウロ)、米国東部 (バージニア北部、オハイオ)、米国西部 (オレゴン) でご利用いただけます。詳細については、ドキュメンテーションページと製品ページをご覧ください。