At re:Invent 2018 Amazon announced its custom-designed chip named Inferentia due out next year and created Elastic Inference, a service that identifies parts of a neural network that can benefit from acceleration.
AWS Inferentia is a machine learning inference chip designed to deliver high performance at low cost. AWS Inferentia will support the TensorFlow, Apache MXNet, and PyTorch deep learning frameworks, as well as models that use the ONNX format.
Making predictions using a trained machine learning model–a process called inference–can drive as much as 90% of the compute costs of the application. Using Amazon Elastic Inference, developers can reduce inference costs by up to 75% by attaching GPU-powered inference acceleration to Amazon EC2 and Amazon SageMaker instances. However, some inference workloads require an entire GPU or have extremely low latency requirements. Solving this challenge at low cost requires a dedicated inference chip.
AWS Inferentia provides high throughput, low latency inference performance at an extremely low cost. Each chip provides hundreds of TOPS (tera operations per second) of inference throughput to allow complex models to make fast predictions. For even more performance, multiple AWS Inferentia chips can be used together to drive thousands of TOPS of throughput. AWS Inferentia will be available for use with Amazon SageMaker, Amazon EC2, and Amazon Elastic Inference.