Learning ONNX for trading - page 6

 

Accelerating ML Inference at Scale with ONNX, Triton and Seldon | PyData Global 2021



Accelerating ML Inference at Scale with ONNX, Triton and Seldon | PyData Global 2021

In the video "Accelerating ML Inference at Scale with ONNX, Triton and Seldon | PyData Global 2021," Alejandro Saucedo of Seldon Technologies discusses the challenges of scaling machine learning inference and how to use ONNX and Triton to optimize and productionize models. Using the GPT-2 TensorFlow model as a use case, the session covers pre-processing, selecting optimal tokens, and deploying the model using Tempo and the Triton inference server. Saucedo emphasizes the need to abstract infrastructure complexities and facilitate easy deployment while ensuring reproducibility and compliance. The talk concludes with collaborations with open-source projects for end-to-end training and deployment components.

  • 00:00:00 In this section Alejandro Saucedo introduces himself and his company, Seldon Technologies, which focuses on machine learning deployment and monitoring. He explains that the session will cover the challenge of accelerating machine learning inference at scale by taking a practical approach, using the GPT-2 TensorFlow model as a use case. The goal is to optimize the model using ONNX and test it locally using a tool called Tempo before productionizing it on Kubernetes. The main focus is to abstract underlying infrastructure complexities so that data science practitioners can focus on the data science side. Saucedo also explains what GPT-2 is and its applications, noting that it is a relatively complex model that requires a significant amount of compute for training and inference.

  • 00:05:00 In this section of the video, the speaker discusses the use of pre-trained models, specifically GPT-2, and how they can be leveraged for various applications, such as games and code generation. The speaker explains how to use the Hugging Face library to fetch the pre-trained GPT-2 model and discusses the components of the model, including the tokenizer and the head model, which are used for pre-processing and inference. The speaker then walks through an example of pre-processing a human-readable sentence using the tokenizer and using the generate function to predict 20 tokens. Finally, the speaker explains the underlying workflow of the generate function, which runs the inference multiple times for the model, and how to convert the machine-readable output back to human-readable format using the decoder of the tokenizer.

  • 00:10:00 In this section, the speaker explains the process of selecting the optimal token for a model and the challenges that practitioners face when scaling the model. The challenges include the requirement for specialized hardware, complex dependencies between components, and the need for reproducibility and compliance. The speaker then introduces the ONNX format, which is widely adopted in industry and allows for the transformation of Python or PyTorch models into a standardized format for optimized servers to read and modify for optimal performance. However, the conversion into ONNX is not necessary, and practitioners can still deploy their models directly. The speaker also introduces the Tempo framework, which simplifies the process of deploying and scaling models.

  • 00:15:00 In this section, the speaker discusses how to productionize the model using Tempo and the Triton inference server. The first step is to define a Tempo wrapper for the GPT-2 model and then run the optimized model in the optimized server from Nvidia. Next, a custom transformer logic must be defined to convert machine-readable tokens to human-readable strings, allowing users to interact with the model in an easy and simple way. After testing the model in Docker locally, it can be deployed into the Kubernetes stack with a simple command. The code for defining the wrapper is shown, and the speaker explains that this method allows different types of model frameworks to be used, making it a versatile tool for productionizing ML inference.

  • 00:20:00 In this section, the speaker discusses how to create a custom transformer logic using PyTorch models with ONNX and Triton. First, the transcript explains how to send tokens through this custom logic, using string conversion and the predict function. The speaker then explains how to load artifacts from the GPT2 model and define the predict function as a REST-endpoint, before iterating through the tokens and generating the response. The key takeaway is that, by passing tokens through the model and iterating through several runs, we can return a string from this complex infrastructure. Furthermore, the speaker mentions that this approach can be deployed easily in Kubernetes through the deploy remote function.

  • 00:25:00 In this section, the speaker discusses machine learning acceleration at scale, specifically focusing on optimizing a GPT-2 TensorFlow model with ONNX and running it with Triton using Docker, and then deploying it in Kubernetes while following best practices. The goal is to minimize leakage of the underlying infrastructure and ensure reliable deployment with minimal effort. The speaker also mentions their collaborations with Tempo and Metaflow teams to provide end-to-end training and deployment components in open-source projects. The talk concludes with a Q&A session.
Accelerating ML Inference at Scale with ONNX, Triton and Seldon | PyData Global 2021
Accelerating ML Inference at Scale with ONNX, Triton and Seldon | PyData Global 2021
  • 2022.01.19
  • www.youtube.com
Accelerating ML Inference at Scale with ONNX, Triton and SeldonSpeaker: Alejandro SaucedoSummaryIdentifying the right tools for high performant production ma...
 

AI Show Live - Episode 62 - Multiplatform Inference with the ONNX Runtime



AI Show Live - Episode 62 - Multiplatform Inference with the ONNX Runtime

In the "Multiplatform Inference with the ONNX Runtime" episode of the AI Show Live, hosts showcase how to deploy a super resolution model and an object detection model on multiple platforms using the ONNX Runtime framework. They discuss pre-processing and post-processing steps for both mobile and web platforms, demonstrate the benefits of using a single solution, explain the process of converting a PyTorch model to an ONNX model, and showcase how to preprocess data for inference with the ONNX Runtime. Additionally, they demonstrate the implementation of the BERT natural language processing model using Onnx Runtime in C#. The code and open-source models are available for customization for users' solutions.

In second part of the AI Show Live, the presenters cover a variety of topics related to running inference with the ONNX Runtime. They demonstrate the process of text classification using an example from the ONNX inference examples and explore the installation of packages and tools needed to build BERT classification models in C#. They also discuss the use of IntelliCode with VS 2022 and walk through the steps of preparing for model inference, including creating tensors, configuring the ONNX Runtime inference session, and post-processing the output. Additionally, they touch on the importance of consulting model documentation and selecting the correct tokenizer for accurate results.

  • 00:00:00 In this section of the AI Show Live, host Cassie Breviu introduces special guests Victor, Kalia, and David, interns on the ONNX Runtime team, who will be showcasing a project on how to deploy a super resolution model and an object detection model on mobile and web using the ONNX Runtime framework. The project aims to improve object detection on images through super resolution output, demonstrating the capability of the ONNX Runtime for multi-platform deployment.

  • 00:05:00 In this section, the hosts introduce a project that uses React Native and Expo to build an app that can be deployed across different platforms for mobile and web. They explain that using React Native's native modules feature allows for the implementation of functions and models in other languages like C++ and Java, which can be used in JavaScript code. This feature enables them to write pre-processing functions in a different language, such as their get pixels function written in Java, to handle data better, making it easier to get the pixel data of an image for their machine learning models.

  • 00:10:00 In this section of the YouTube video, the speaker discusses the pre-processing and post-processing steps of a mobile application that uses a super resolution model. Unlike other models that work with RGB values, this model only works with the luminance (Y) component of an image. Therefore, the speaker shows how to convert an RGB image to a YCbCr image to extract the Y component. The speaker also demonstrates how to load the model into the mobile environment using the ONNX Runtime format, which provides an optimized and reduced size build for mobile and web applications. Finally, the post-processing step is done to process the output of the model.

  • 00:15:00 In this section of the video, the host and guest demonstrate the post-processing function that takes in an array from their model and changes the YCbCr back to RGB. They then use a native model's function to get the image source for display. The pre-processing and post-processing in non-Python languages tend to be the hard part when operationalizing and inferring models in other languages. They show a demo where the model is deployed locally on a mobile device, and then later reuse the code to deploy the same model in a web browser. One viewer asks if the same can be done in C#, to which the guest believes it is possible.

  • 00:20:00 this section, Kalia demonstrates the differences in pre and post-processing for running the model on web versus mobile platforms. On the web, off-screen canvas and the Canvas API are used to get the image data, whereas on mobile, there is no need to go back and forth between APIs. Once the off-screen canvas draws the image, the pre-processing function adjusts the image data to the y channel, which the super-resolution model uses. The post-processing function then converts the data from y cbcr to rgb format so that it can be displayed on a screen. Kalia's code for the pre and post-processing functions can be used in either Java, C#, or React.

  • 00:25:00 In this section of the video, the presenters discuss the benefits of using a single solution for multiple devices, such as with the ONNX Runtime. They demonstrate how to run a mobile model on a web platform and the advantages of on-device inferencing, including cost efficiency and privacy. The presenters also explain the process of converting a PyTorch model to an ONNX model and then to an ONNX Runtime format. Finally, they introduce the object detection aspect of the project and explain how they used ONNX Runtime for detecting objects in the images.

  • 00:30:00 In this section of the video, the presenter discusses the details of the model used in their project, which is an object detection AI model that utilizes super resolution to increase the accuracy of the overall detection. They explain the differences in pre-processing and post-processing between their model and the previously discussed model and detail the four outputs of their model, including location, classes, score, and number of detections. Additionally, they show how they utilized the Netron tool to analyze and break down their model, and how they adjusted the pre-processing to keep the RGB values consistent for the model to detect objects accurately.

  • 00:35:00 In this section, the presenter demonstrates running a model on a photo pre-super resolution and shows the results of the object detection, which accurately identifies the dog in the photo. Using the super resolution model enhances the image and leads to a more accurate and smaller detection box. This demonstrates the portability and practical use of the ONNX Runtime and shows the capabilities of running the model on an optimized model on device. The code and open source models are also available for users to access and customize for their own solutions.

  • 00:40:00 seeing in this section is a demonstration of the BERT natural language processing model using Onnx Runtime in C#. The host explains that while there are many examples of using BERT in Python, they preferred to use C#. They started with the BERT base uncased model before moving on to an example from Onnx Runtime docs for question-answering. With the Hugging Face transformers API, they were able to easily grab the pre-trained model and export it into the Onnx format. They then show how to give the model inputs and run it using Onnx Runtime in C#.

  • 00:45:00 In this section, the speaker discusses the pre-processing step for the model, where the text is tokenized. They showcase how dynamic axes are used to allow for different input lengths, and how they use the tokenizer to pre-process the input in C sharp. They also introduce the Boat Tokenizer, an open-source project that allows them to tokenize the BERT model in C sharp, which is not possible with the Python-based transformers package. The encoded input is then returned as input IDs, which are the different tokens that are attached to different words in the model.

  • 00:50:00 In this section, the presenter discusses the implementation of BERT models in C# by creating a console app. They explain that using a console app is helpful when experimenting with different C# implementations of models, and it can be integrated into a production application if needed. The presenter demonstrates how to use tokenization to get the actual tokens of a sentence and how to encode input with the IDs associated with tokens. They also show the large vocabularies used and how they are turned into objects to be used in tokenization.

  • 00:55:00 In this section, the presenter is discussing how to preprocess data and prepare it for inference with the ONNX Runtime. They demonstrate how to convert data into tensors, which are required for the inference process, and how to create a list of named ONNX value objects to pass into the inference session. They also mention the importance of setting correct labels for the input data. Overall, they provide helpful tips for working with the ONNX Runtime and preparing data for machine learning inference.

  • 01:00:00 In this section, the speaker runs the inference values and obtains the start and end logits. The results are returned in the order of the index of the labels. To get the predicted answer, the maximum value and the index of the maximum value from the starting and end logits need to be obtained first. The output names are shown, and the encoded token values can be seen, which are used to compare if the tokens are correct. The speaker also demonstrates the process of converting Python inference code into C# for operationalizing models. Finally, they suggest experimenting with more models, converting Python inference code into C#, and fine-tuning models.

  • 01:05:00 In this section of the video, the host explores text classification using the ONNX runtime and an example from the ONNX inference examples, which is a good resource for finding examples of how to use ONNX. The example uses a tokenizer from Hugging Face and a smaller, distilled version of the base encased tokenizer. The host sets the path based on the model name, then sets the inputs for the model, which has dynamic axes due to variable sentence length. The inputs for the text classification model are input ids and the attention mask. The host mentions that there are extensions being developed for ONNX and that the new version of the runtime supports .NET six and Maui. Although the pre-processing example for the model is not readily available, the host plans to use Netron to figure it out.

  • 01:10:00 summarize this section. In this section of the video, the speaker renames the previous model in a less informative way and adds text classification to it. They go on to create a new project using C# and .NET 6 and explain the new feature of having a simple scripted console template. The speaker copies and pastes some code from the previous project into the new one and adds the BERT input, which now only has two outputs. The speaker acknowledges that they should create a separate file for this input but opts to script it out instead.

  • 01:15:00 In this section, the speaker is discussing the installation of various packages and tools to build Burp classification models in C#. They install the required tokenizer package and the ONNX runtime packages, along with the managed package. The unneeded attribute packages are commented out, and the speaker adds an input sentence and tokenizer to obtain the tokens for encoding. The speaker also mentions the VS 2022 IntelliCode, which uses the GPT-2 model to train on the code base and runs locally.

  • 01:20:00 In this section of the video, the presenter talks about using Intellicode (previously known as Intellisense) with VS 2022, which is an AI-powered tool that can learn from an existing codebase. They then move on to working with the tokenizer and the encoded value for a sentence. They also discuss the model path and how to paste the model into a console app for experimental purposes, although there are better ways to handle this for a production application. Finally, the presenter creates an inference session for the BERT model.

  • 01:25:00 In this section of the video, the presenters go through the steps needed to prepare for running an inference using the ONNX Runtime. They start by creating tensors and converting them to a tensor. They then create an input using input IDs and an attention mask, and list of lists of lists. After creating a named ONNX value, they run the model and discuss options for configuring the ONNX Runtime inference session, including different graph optimizations and execution providers. Finally, they retrieve the output, which in this case only has one value.

  • 01:30:00 In this section of the transcript, the speaker is going through the code for using a model with the ONNX Runtime. They are explaining how they named the labels to be the same for using the ONNX model and how they can run a sanity test to see if everything works. They set a breakpoint to step through the code and check if the input, attention mask, and ids are right. Once the input is correct, they load the model, create their tensor, session, and inference. Then they explain that they need to post-process to figure out how to turn this back into a result and they went to find some pre-processing code.

  • 01:35:00 In this section of the video, the speaker discusses the process of processing two values obtained from a classification model to determine the positive and negative sentiment of a given sentence. They demonstrate the use of a tokenizer to tokenize the sentence and obtain its tokens, which they use to confirm their understanding of how to perform the process in C#. They also mention the importance of consulting model documentation and selecting the correct tokenizer to ensure accurate tokenization.

  • 01:40:00 In this section, the hosts of the AI Show Live discuss the Optimum project from Hugging Face that implements optimizations for machine learning, including accelerators for training and different hardware integrations using the ONNX runtime on the back end. The hosts also review the pre-processing steps for the tokenizer and creating the session for the text classification model. They explore the encoded version of a sentence and reuse some previously written code to create the session for their model.

  • 01:45:00 In this section, the presenter prepares for model inference by exporting the model and processing the input data. They confirm that the correct tokenizer was used to tokenize the input data by performing a sanity check of the encoded tokens. However, they find that the input mask is missing and go back to examine the model and code to pinpoint the issue. Despite uncertainty around the tokenizer used, they confirm that the encoded tokens are correct and proceed to generate the input mask.

  • 01:55:00 In this section of the video, the presenter is trying to set up the inputs and outputs for running the model. They encounter some issues with the input mask and attention mask, and ultimately realize that they can just grab the tokens and send it in without having to do any extra processing. They then switch their focus to the model input, which is a bit more complicated since it requires two inputs and needs to specify the shape for the batch. The presenter uses the ONNX Runtime to set up the inputs and outputs and tests it to see if it produces the same results as the C# model.
AI Show Live - Episode 62 - Multiplatform Inference with the ONNX Runtime
AI Show Live - Episode 62 - Multiplatform Inference with the ONNX Runtime
  • 2022.07.29
  • www.youtube.com
Join Cassie Breviu as she takes us on a tour of what the ONNX Runtime can do when it comes to inference AND on multiple platforms to boot.
 

Applied Machine Learning with ONNX Runtime



Applied Machine Learning with ONNX Runtime

Jennifer Looper, a Principal Education Cloud Advocate at Microsoft, discusses the convergence of app building, machine learning, and data science in this video. She recommends building smart apps for the web and explores various JavaScript APIs, including ml5.js, Magenta.js, PoseNet, and Brain.js, for incorporating machine learning technology into apps. Looper emphasizes the usefulness of scikit-learn for classic machine learning and recommends it as a powerful tool without the heavy solution of neural networks. She also discusses the Onnx Runtime, which optimizes training and inferencing by defining a common set of operators for building machine learning and deep learning models, and sources data from Kaggle to explain the process of performing a basic classification task using supervised machine learning. The speaker then demonstrates how to build a recommendation engine using machine learning models and suggests visiting Microsoft's online resources for learning more about machine learning. She concludes that Onnx Runtime is suitable for beginners as part of their curriculum or for anyone who wants to learn more about machine learning.

  • 00:00:00 In this section, Jen Looper, a Principal Education Cloud Advocate at Microsoft, discusses the convergence between app building and machine learning and data science. She explains the challenges of creating mobile apps today, particularly in the face of new apps that are infused with intelligence and run machine learning algorithms in the background. Looper contends that this new demand for intelligent experiences first has contributed to the challenges facing indie app developers.

  • 00:05:00 In this section, the speaker discusses how to go about building smart apps and what the architectural decisions and technical stack would guide this process. Options include building a native app, building for the web, or building for the desktop. The speaker recommends sticking to the web for building smart apps and explains that, despite the differences in skill sets between web developers and machine learning engineers, there are ways for these fields to converge. The speaker illustrates the collaboration between developers and machine learning engineers, citing the use of devops, data sourcing and cleaning, training, and iteration, as well as the ML Ops team that ensures accurate delivery and continuous improvement of machine learning models.

  • 00:10:00 personal anecdotes, the speaker explains how bridging divides in machine learning engineering and creating web apps can be less daunting than overcoming a fear of bridges. The speaker introduces various tools for incorporating machine learning technology into web apps, including TensorFlow.js, Brain.js, and ONNX. She highlights the benefits of each tool and encourages viewers to explore the TensorFlow website to discover the cool demos they offer. She also focuses on ONNX Runtime and its ability to bring ONNX-based models into web apps. Overall, the speaker aims to provide app developers with knowledge of the available tools for enhancing their apps with machine learning technology.

  • 00:15:00 In this section of the video, the speaker discusses various Javascript APIs that can be used with pre-trained models to explore and create artificial intelligence in the browser. One of these is the ml5.js which is built on top of Tensorflow and provides examples on image recognition and sound analysis. Another API mentioned is the Magenta.js which uses pre-trained models to create music and art in the browser. The speaker also talks about PoseNet which can be used to estimate single or multiple poses for the whole body, face, or just the hand. Finally, the speaker introduces Brain.js which allows neural networks to run in Javascript on browsers and Node.js.

  • 00:20:00 In this section, the speaker discusses the limitations of using JavaScript and the browser for training machine learning models, suggesting that it is not a great environment for this purpose and is better suited for running off-the-shelf models or retraining existing ones. They recommend that for more robust and proper machine learning models, Python is the way to go, and programmers can learn enough Python to be dangerous and work with Jupyter Notebooks for training their models. They also discuss using services such as loeb.ai for training on images and other media.

  • 00:25:00 In this section, the speaker discusses an alternative to using TensorFlow for machine learning, scikit-learn. The speaker explains that not all machine learning problems require neural networks, and they created a curriculum on GitHub that does not use neural networks. They also show how they used scikit-learn to create a cuisine recommending application, powered by an ONNX runtime, allowing users to input ingredients and receive suggestions for a type of cuisine they can make with those ingredients. The speaker emphasizes the usefulness of scikit-learn for classic machine learning and recommends it as a powerful tool without the heavy solution of neural networks.

  • 00:30:00 In this section of the video, the speaker introduces scikit-learn, an open-source machine learning framework that provides examples and documentation for basic machine learning tasks such as classification, regression, and clustering. They explain that scikit-learn is a popular framework among data scientists and is accessible to everyone as it comes bundled with numpy, scipy, and matplotlib. The speaker then discusses Onnx Runtime, a product of the Open Neural Network Exchange (Onnx) that optimizes training and inferencing by defining a common set of operators for building machine learning and deep learning models. Onnx Runtime supports a variety of frameworks, tools, and runtimes, and enables AI developers to use their preferred framework with their chosen inference engine. The speaker outlines a typical machine learning workflow that involves cleaning data using Python, training models using scikit-learn, and converting models for use with Onnx Runtime using the skl to Onnx library.

  • 00:35:00 In this section of the video, the speaker sources data from Kaggle about different types of cuisines and explains how to clean and balance the data. The dataset contains 100 potential ingredients that are classified as Indian, Thai, Korean, Japanese, or Chinese. To build the model, the speaker explains that you need to pick an algorithm, a classifier and a solver to optimize the results. The data set is small, with only around 700 data points. The data is labeled, meaning that supervised learning can be used, and the speaker stresses the importance of understanding the data before shaping and cleaning it for use in machine learning applications.

  • 00:40:00 In this section, the presenter discusses the process of performing a basic classification task using supervised machine learning and choosing a multi-class classification algorithm. The presenter presents a cheat sheet for multi-class classification and rules out neural networks due to the nature of the data set and training locally. The two multi-class classification algorithms that remain are logistic regression and decision forests. The presenter chooses logistic regression and opts for one versus rest to handle multi-class classification. The presenter then explains the importance of picking the correct solver and picks the liblinear solver. The presenter trains the model by using lr fit and tests its accuracy using a recipe with cilantro, onions, peas, potatoes, tomatoes, and vegetable oils, reporting a 71% accuracy. The presenter also presents Scikit-learn algorithm cheat sheet to aid in the selection of an appropriate algorithm based on the amount of data and labels available.

  • 00:45:00 In this section, the speaker discusses the importance of choosing the right classification algorithm when building a machine learning model. They demonstrate how they experimented with different classifiers and solvers to see which gave the best accuracy. After choosing the Support Vector Classifier (SVC), they rebuilt the model and tested its accuracy. Once they were satisfied with the model's accuracy, they converted it into an Onyx file and used the Onyx runtime to build a web application. The speaker explains how they created a simple web application with a series of checkboxes to feed ingredient data to an imported model and used an asynchronous function to start the inference. They then demonstrated the web app and checked what the model suggested.

  • 00:50:00 This section of the video demonstrates how to build a recommendation engine using machine learning models. By inputting a list of ingredients, the model suggests what kind of cuisine one might be able to create. In addition, the speaker invites viewers to learn more about machine learning through Microsoft's online resources which offer free content on various topics including clustering, regression, classification, reinforcement learning, natural language processing, and time series applications. The speaker encourages viewers to ask questions and engage with the content on the website mycommworld.com.

  • 00:55:00 In this section, the question is asked if Onnx Runtime is a good tool for beginners. The answer is yes, as one of the speakers is a self-taught web developer who believes that anyone can learn anything if they try hard enough. Therefore, Onnx Runtime is suitable for beginners as part of their curriculum or for anyone who wants to learn more about machine learning.
Applied Machine Learning with Onnx Runtime
Applied Machine Learning with Onnx Runtime
  • 2021.12.02
  • www.youtube.com
This conference talk was delivered via the One Million Arab Coders initiative. It gives an overview of what applied ML is, how we need to bridge the divide b...
 

Bring the power of ONNX to Spark as it never happened before



Bring the power of ONNX to Spark as it never happened before

In this video, Shivan Wang from Huawei explains how to bring the power of ONNX to Spark for inference. He discusses the challenges in deploying DL models on Spark and how the Spark community has initiated a proposal called Spip to simplify the process. The speaker also discusses Huawei's AI processor, Ascent and the Ascent AI ecosystem which includes multiple Ascent processor models and Atlas hardware. He suggests adding Con as a new execution provider in the next runtime to use ONNX models on Ascent hardware directly, without the need for model translation. Finally, he mentions that the POC code for bringing the power of ONNX to Spark is almost complete and welcomes interested users to leave a message to discuss and potentially provide resources for testing purposes.

  • 00:00:00 In this section, Shivan Wang of Huawei discusses how to bring the power of Onyx to Spark for inference. He explains that the well-defined data frame inference interface is very friendly to data engineers, who can easily load data and complete the feature engineering. However, there are gaps between AI frameworks and internal pre-samples, making the deployment of DL models on Spark difficult. To simplify the process, the Spark community has initiated a discussion on a proposal called Spip, which will provide a simple API to make the process of Spark and AI influence small cell. Finally, by executing the Onyx inference in the Spark executor, users can easily complete Onyx influence on Big Data with the help of the Onyx inference platform.

  • 00:05:00 In this section, the speaker discusses Huawei's AI processor called Ascent and the Ascent AI ecosystem, which includes multiple Ascent processor models and Atlas hardware. The software layer of the Ascent ecosystem is called Cam and it provides APIs to help developers quickly build AI applications under services based on the Ascent platform. To run an ONNX model on other hardware, the user must first use a model translation tool provided by Con to translate the model from ONNX to Ascent. However, the speaker suggests that a better solution is to add Con as a new execution provider in the next runtime, so that users can use ONNX models on Ascent hardware directly, without the need for model translation. They plan to finish all ONNX operator support by the end of this year and ensure all models in ONNX model Zoo work well on Ascent, followed by further development next year.

  • 00:10:00 In this section, the speaker mentions that the POC code for bringing the power of ONNX to Spark is almost complete and basic operations such as the add operation can run correctly. They also invite interested users to leave a message to discuss and potentially provide resources for testing purposes. The section concludes by thanking viewers for watching.
Bring the power of ONNX to Spark as it never happened before
Bring the power of ONNX to Spark as it never happened before
  • 2022.07.13
  • www.youtube.com
Both data processing platforms and deep learning frameworks are evolving in their own fields. Usually, Spark is used for offline data processing, and then va...
 

Builders Build #3 - From Colab to Production with ONNX



Builders Build #3 - From Colab to Production with ONNX

The video illustrates the process of deploying a project from Colab to production by using ONNX. The presenter covers various aspects such as pre-processing signals, modifying code for deployment, creating a handler on AWS Lambda, accepting audio input on a website, uploading a function to S3, and deploying dependencies for ONNX. Despite encountering some difficulties, the speaker successfully deploys their model with AWS and suggests that they can use a browser load base64 file object or sound file read bites for future steps.

Additionally, the video showcases the use of the SimCLR model for contrastive learning in audio, building a catalog of songs by feeding them into the model, and training it with PyTorch to attain zero loss and recall at k=1. The presenter discusses the challenges of using PyTorch in production and proposes ONNX as a solution. The video demonstrates how to export and load the PyTorch model in ONNX format and execute inference. It also shows how to process audio files using Torch Audio and Numpy libraries and troubleshoots issues when setting up a PyTorch model for deployment. The video offers insights on how to shift models from development in Colab notebooks to production environments.

  • 00:00:00 In this section, the speaker discusses a simple framework for contrastive learning of visual representation using the SimCLR model, which involves sampling two random transformations from a set of different transformations applied to an image, resulting in two different images (x tilde i and x tilde j), which are then passed into an encoder (resnet 50) to give two vectors (h i and h j) passed to a projection function (MLP) to return two projections (z i and z j). The goal is to maximize the cosine similarity of the two projections using a contrastive loss to learn f and g so that the output of the model's two projections are very close together. The speaker applies this framework to audio, where the input is not an image but a signal transformed into a spectrogram, and uses a low pass filter to change the playback speed as one of the different transformations.

  • 00:05:00 In this section, the presenter discusses the implementation of the NT_Xent loss that was used for the model, which is a cross-entropy loss that involves a pair of positives and the sum of negative pairs. They also talk about cheating a little bit during evaluation by using the same test data as training data and using recall at k as the metric. Finally, they mention that they found an implementation of the loss function on a website called lightning and it worked fine when tested on dummy data.

  • 00:10:00 In this section, the speaker explains how they built a catalog of songs by feeding them into the model in order to get a set of vectors representing each song, which were then saved with their corresponding titles, audio signals, and index representations. They then computed similarities by taking the dot product of the features and catalog index, normalized the vectors, and computed recall at k to determine the best matches. They used a normal training loop in PyTorch and added the LARS optimizer, which helped the model converge to zero loss and achieve recall at k equal to one, meaning it consistently predicted the correct song. The speaker then discusses the challenge of using PyTorch in production and presents a solution using ONNX, a format that allows for seamless integration with other frameworks and deployment on different devices.

  • 00:15:00 In this section, the speaker discusses the use of ONNX, a lightweight framework for inference that allows models to be imported from other frameworks like PyTorch and TensorFlow for hosting in the cloud. The speaker intends to build a python handler function that will leverage ONNX to run inference on audio data, get the spectrogram from the audio, run the model on the spectrogram and return a JSON object with predictions. The speaker notes that to use ONNX, the model needs to be saved as a .onnx file extension and explains the process for exporting the model from PyTorch to ONNX.

  • 00:20:00 In this section, the speaker explains how to export a model from PyTorch to ONNX format using torch.one and x.export functions. The dummy input allows the ONNX format to understand the expected shape of the input file, and output and input names are specified using a dictionary or JSON object. The speaker provides an example of exporting a model named simclr with the current timestamp, uses export params to store the train parameter weights inside the model file and shows how to add a lambda function to retrieve the timestamp of the model.

  • 00:25:00 start a new section where the video creator explores how to load and run inference on a model using ONNX. They discuss creating an inference session and loading the model from a model path. The creator plans to incorporate the model into an API but is unsure of how to get the audio in the handler. They consider using a base 64 thing and create a new file to test it out. They then move on to discussing how to run inference without loading a model and decide to focus on that instead.

  • 00:30:00 In this section, the speaker discusses modifying the code to use torch audio instead of using a numpy array. The speaker discovers that they can use torch audio and installs it to move forward. They then discuss modifying the input and output names and call the output "projection" to do cosine similarity against their library. The library is set up as a JSON object with an array of titles, an array of waveforms, and an array of projections which the speaker intends to use in their cosine similarity calculations.

  • 00:35:00 In this section, the speaker is seen writing code and explaining the process aloud. They write a script to sort a list of songs in a music library, discussing various techniques like matrix multiplication, similarities, and sorting algorithms along the way. The speaker also loads a JSON library and utilizes it in the script. The video is part of a series on building software from collaboration to production using the ONNX framework.

  • 00:40:00 In this section, the presenter demonstrates how to use ONNX to produce and save a catalog in JSON format. The catalog is created from arrays, and the presenter checks the shapes before dumping the catalog as a JSON file using the `dump()` function. The ensuing error is resolved by changing the `catalog` to a copy of `library`. The presenter then converts the arrays into lists using the `tolist()` method and downloads the JSON file. Finally, the presenter shows how to load the saved JSON file using the `load()` function.

  • 00:45:00 In this section, the speaker discusses their approach to processing audio files from file paths using Torch Audio and Numpy libraries in Python. They explain that they have downloaded Torch Audio and will use its "preprocess signal" method to resample and preprocess the audio file. They then use Numpy's mean function to calculate the mean of the processed signal while keeping the dimensions, followed by padding the signal using np.pad. Overall, the speaker provides a clear and concise explanation of their processing method.

  • 00:50:00 In this section of the video, the speaker is trying to get a spectrogram of a waveform using a function. However, there seem to be some issues with the function not recognizing the target sample rate and resulting in an error message. The speaker tries using a test clip to troubleshoot and prints the shape of the spectrogram, which returns an unexpected output. It is unclear what exactly is causing the issues, whether it's an input or model problem.

  • 00:55:00 In this section, the speaker evaluates some of the error messages that appeared while trying to set up a PyTorch model for deployment, and identifies issues with the Torch package's size and incompatible dependencies. They note that a file loaded with Torch Audio specifies itself as needing over 1GB of memory, potentially leading to issues when running models with large file inputs. To resolve this, they suggest loading files with librosa instead and replacing Torch Audio when converting specifications like frequency and loading files. Overall, this section focuses on highlighting some of the issues that can arise when transitioning models from development in Colab notebooks to deployment in production environments.

  • 01:00:00 In this section, the speaker installs Torch and Torch audio to continue building the project, which still has a small size of less than 500 megabytes. They use Librosa and resample to ensure that the signal returns with the original sampling rate and a target sampling rate. They encounter some errors when running the script and realize that they need to mix different channels to ensure that the result is not weird. They continue troubleshooting to ensure that the project works as expected.

  • 01:05:00 In this section, the speaker is working with signal pre-processing and loading different channels. They encounter an issue where the signal only has one channel and they need to expand it more than once. The speaker uses numpy and squeeze expand dims to solve the issue, ultimately committing the changes.

  • 01:10:00 In this section, the speaker is modifying the code to allow for the deployment of the SpeakFluent REST API, which is a function that would handle a request and then run inference. They modify the handler to get a request that has the audio file and file name, and then save the file locally. They run inference using the ONNX runtime without Torch, and then return the best match.

  • 01:15:00 In this section, the speaker discusses the source python handler for the application, which currently returns an array of song titles and their corresponding best match titles. The speaker also mentions future additions to the handler that will include returning the URI for an S3 bucket and the match percentage for the best match. The speaker plans to take a break and will come back to building an actual handler that can upload and deploy to an AWS Lambda function.

  • 01:25:00 In this section of the video, the presenter explains how to create a handler on AWS Lambda using a script and an event. The Lambda handler is designed to take an event and not a request, and it will receive the audio part of the form from the event. The presenter explains that the process involves creating a form on the front end and getting the ID from the audio input.

  • 01:30:00 In this section of the video, a developer discusses the process of accepting audio input on a website using Javascript and uploading the file to a server for processing. The developer explains that they will add a recording button to the website and use Javascript to create a file and automatically upload it to the server. They then discuss calling inference on the uploaded file and returning a 200 status code with a body using JSON.dumps. The developer also considers using base64 encoding of the audio instead of a file for increased efficiency. They explore the process of running ONNX inference with Python in AWS using Lambda runtime.

  • 01:35:00 In this section, the speaker discusses the process of uploading a function to S3 and creating a new function from scratch for the project named Diva. The speaker expresses annoyance at the process but proceeds to create a new function using Python 3.9, debating on the difference in ARM and also considering uploading files directly to the function.

  • 01:40:00 In this section, the speaker explores the Cloud Functions dashboard and demonstrates how to upload a zip file, change event formats to JSON and create a new event. They also discuss the need for additional configurations for memory and environment variables. The speaker then attempts to add a recording with base64 audio but encounters an issue with available options.

  • 01:45:00 In this section, the speaker appears to be copying and pasting a big JSON file and saving it as an AWS Lambda deployment package. They mention wanting to use a different method, but ultimately remove it and decide to use the Lambda handler instead. However, they need to install some things and try to figure out a way to upload the package to AWS. They also discuss exporting functions.

  • 01:50:00 In this section, the speaker discusses the steps to compile and deploy the necessary dependencies for ONNX, an open-source model management project. They explain that it is important to compile in a similar environment to Lambda to avoid compatibility issues and suggest finding the ONNX runtime folder and copying it to the deployment package. While this process can be considered "nasty," the speaker explains that it is necessary to ensure the code will work properly. They then list the necessary dependencies, including librosa, scipy, and numpy, and discuss the size of the deployment package, which can be up to 200 megabytes.

  • 01:55:00 In this section, the speaker discusses the steps they need to take to deploy their model with AWS. They need to create a lightweight Lambda handler, figure out how to deploy it with AWS limitout, and decide whether to store it in S3 since it is larger than 50 megabytes. The next step is to update the processing function to take a base64 audio object instead of a file, and the speaker considers using a browser load base64 file object or sound file read bites to achieve this. They conclude by stating that they think they can just do that.
Builders Build #3 - From Colab to Production with ONNX
Builders Build #3 - From Colab to Production with ONNX
  • 2022.03.21
  • www.youtube.com
Last week, we built a Shazam clone using Self-supervised learning (SimCLR).Let's get the model out of Colab and run inference in production with ONNX!I have ...
 

Combining the power of Optimum, OpenVINO™, ONNX Runtime, and Azure



Combining the power of Optimum, OpenVINO™, ONNX Runtime, and Azure

The video showcases the combination of Optimum, OpenVINO, ONNX Runtime, and Azure to simplify the developer's workflow and improve the accuracy and speed of their models. The speakers demonstrate the use of helper functions, ONNX Runtime, and the OpenVINO Execution Provider to optimize deep learning models. They also show how to optimize hugging face models using quantization in the Neural Network Compression Framework and illustrate the training and inference process using Azure ML, Optimum, ONNX Runtime, and OpenVINO. The demonstration highlights the power of these tools in improving the performance of models while minimizing the loss of accuracy.

  • 00:00:00 In this section of the video, Cassie talks to representatives from Intel about the OpenVINO toolkit and Onyx runtime. The OpenVINO toolkit uses advanced optimization techniques specifically designed for Intel Hardware to boost the performance of deep learning models. With the Onyx runtime library and a simple modification to the instant session line of code, developers can use Intel's OpenVINO execution provider to accelerate inferencing of Onyx models. The demo shown in the video showcases the accelerated performance of YOLO V7, a popular deep learning model, on an Intel CPU.

  • 00:05:00 In this section of the video, the speaker discusses the various helper functions used in the demo to perform pre-processing, assign color values to specific labels, and read and reprocess images. The demo utilizes ONNX Runtime to create an inference session and run the inference task, and OpenVINO Execution Provider to speed up the deep learning models on an Intel CPU. The process is simplified by making a simple modification to the code line, which involves installing ONNX and OpenVINO libraries, importing the ONNX Runtime library, and setting the OpenVINO provider. The speaker also gives a brief architecture overview of how the ONNX model gets converted into an in-memory graph representation and goes into the Graph Practitioner for backend querying.

  • 00:10:00 In this section, the speakers discuss how to optimize hugging face models using quantization in the Neural Network Compression Framework. They walk through a code example showing how to enable quantization during training using Optimum Intel and the OV Config. They also showcase an AI workflow that integrates data preparation, model training, inference, deployment, and automation to help developers and customers perform complex activities more efficiently. The speakers demonstrate how to use Azure ML to support these workflows for better performance.

  • 00:15:00 In this section of the video, the speaker discusses the process of training and inference using Azure ML, Optimum, ONNX Runtime, and OpenVINO. They begin by discussing the files and scripts used for the training pipeline and how to submit the job to Azure ML. They then move on to discuss the inference script and how it utilizes ONNX Runtime and the OpenVINO execution provider. The speaker provides details on the F1 score results for the quantization and training of the model, demonstrating that there was only a minor loss of accuracy during this process. Overall, the section provides a detailed overview of the process of training and inference using these technologies.

  • 00:20:00 In this section, the speaker demonstrates how the quantization process works by showing the fp32 original model and the int 8 optimized model, which has been visualized through Netron. They also discuss how Azure ML and OpenVINO can be leveraged to improve accuracy and performance during the training and inference process. They mention using ONNX Runtime to further optimize and improve performance, and encourage viewers to check out the code and blog post for more information. Overall, the demonstration showcases the power of combining multiple tools to simplify the developer's workflow and improve the accuracy and speed of their models.
Combining the power of Optimum, OpenVINO™, ONNX Runtime, and Azure
Combining the power of Optimum, OpenVINO™, ONNX Runtime, and Azure
  • 2023.01.27
  • www.youtube.com
Devang Aggarwal, and Akhila Vidiyala from Intel join Cassie Breviu to talk about Intel OpenVINO + ONNX Runtime. We'll look at how you can optimize large BERT...
 

Faster Inference of ONNX Models | Edge Innovation Series for Developers | Intel Software



Faster Inference of ONNX Models | Edge Innovation Series for Developers | Intel Software

The OpenVINO Execution Provider for ONNX Runtime is discussed in this video. It is a cross-platform machine learning model accelerator that allows for the deployment of deep learning models on a range of Intel compute devices. By using the OpenVINO toolkit, which is optimized for Intel hardware, and setting the provider as the OpenVINO Execution Provider in the code, developers can accelerate inference of ONNX models with advanced optimization techniques. The video emphasizes the simplicity of the modification required to utilize the tools discussed.

Faster Inference of ONNX Models | Edge Innovation Series for Developers | Intel Software
Faster Inference of ONNX Models | Edge Innovation Series for Developers | Intel Software
  • 2022.11.30
  • www.youtube.com
Join Ragesh in his interview with Devang Aggarwal, a product manager at Intel with Intel’s OpenVINO™ AI framework engineering team doing work around deep lea...
 

Faster and Lighter Model Inference with ONNX Runtime from Cloud to Client



Faster and Lighter Model Inference with ONNX Runtime from Cloud to Client

In this video, Emma from Microsoft Cloud and AI group explains the Open Neural Network Exchange (ONNX) and ONNX Runtime, which is a high-performance engine for inferencing ONNX models on different hardware. Emma discusses the significant performance gain and reduction in model size that ONNX Runtime INT8 quantization can provide, as well as the importance of accuracy. She demonstrates the end-to-end workflow of ONNX Runtime INT8 quantization and presents the results of a baseline model using PyTorch quantization. Additionally, Emma discusses ONNX Runtime's ability to optimize model inference from cloud to client and how it can achieve a size of less than 300 kilobytes on both Android and iOS platforms by default.

  • 00:00:00 In this section, Emma, a Senior Program Manager in the AI Framework Team at Microsoft Cloud and AI group, explains ONNX and ONNX Runtime's role in the AI Software Stack. ONNX, which stands for Open Neural Network Exchange, is a standard format for representing both traditional machine learning models and deep learning neural networks. ONNX Runtime is a high-performance engine for inferencing ONNX models on different hardware. ONNX Converters and ONNX Runtime are the major pieces in the workflow to operationalize an ONNX model, which can be turned on from any framework using ONNX converter tools. There are many popular frameworks that support ONNX, including PyTorch, TensorFlow, and Caffe.

  • 00:05:00 In this section, the benefits and features of ONNX Runtime are discussed. ONNX Runtime is a high-performance inference engine for ONNX models that offers APIs for a variety of languages and hardware accelerations for CPUs, GPUs, and VPUs. ONNX Runtime is also open and extensible, allowing for easy optimization and acceleration of machine learning inference. It has already been integrated into multiple internal and external platforms and has been powering many flagship products. One of the newest and most exciting features of ONNX Runtime is the INT8 quantization for CPU, which approximates floating-point numbers with lower bits, reducing model size memory, and improving performance. Benchmark results on various models and hardware show significant speedups using ONNX Runtime.

  • 00:10:00 In this section, the speaker discusses the significant performance gain of ONNX Runtime INT8 quantization, which can accelerate inference performance by up to three times on a big machine and around 60 percent on smaller machines, as well as reduce the model size by almost four times. The speaker also emphasizes the importance of accuracy and provides an example of how ONNX Runtime quantized models can maintain similar accuracy as FP32 models on a common NLP task. The speaker then demonstrates the end-to-end workflow of ONNX Runtime INT8 quantization, which involves converting models to ONNX format, using the quantization tool to obtain an INT8 model, and then performing inference in ONNX Runtime. Finally, the speaker presents the results of a baseline model using PyTorch quantization and evaluates performance using the Tokenize and evaluation function from Hugging Face.

  • 00:15:00 In this section, the speakers discuss the process of ONNX Runtime quantization for optimizing model performance and size. The process involves an optimization step before quantization, which is only necessary for transformer models. Once optimized, the model can be quantized into an 8-bit format using ONNX Runtime's quantization API, resulting in a much smaller model size. Performance and accuracy results show that the ONNX Runtime quantization outperforms the PyTorch quantization in terms of F1 score. Another exciting feature of ONNX Runtime is the ability to minimize the runtime size for on-device inference on smartphones and edge devices.

  • 00:20:00 In this section, the ONNX Runtime's ability to optimize model inference from cloud to client is discussed. Two major techniques are enabled for ONNX Runtime mobile: the introduction of a new optimized format called ONNX Runtime format and building ONNX Runtime with operators only needed by predefined models to reduce the size of the Runtime. This significantly reduces the Runtime size by getting rid of unused operators, making it more feasible for own device inference and meeting memory requirements. The core ONNX Runtime mobile package can achieve a size of less than 300 kilobytes on both Android and iOS platforms by default. ONNX Runtime is an open-source project with tutorials and examples available on its GitHub repo.
Faster and Lighter Model Inference with ONNX Runtime from Cloud to Client
Faster and Lighter Model Inference with ONNX Runtime from Cloud to Client
  • 2020.10.16
  • www.youtube.com
ONNX Runtime is a high-performance inferencing and training engine for machine learning models. This show focuses on ONNX Runtime for model inference. ONNX R...
 

Fast T5 transformer model CPU inference with ONNX conversion and quantization



Fast T5 transformer model CPU inference with ONNX conversion and quantization

By converting the T5 transformer model to ONNX and implementing quantization, it's possible to decrease the model size by 3 times and increase the inference speed up to 5 times. This is particularly useful for deploying a question generation model such as T5 on a CPU with sub-second latency. Additionally, the Gradio app offers a visually appealing interface for the model. The T5 transformer model from Huggingface is utilized, and the FastT5 library is used for ONNX and quantization. Implementing these optimizations can result in significant cost savings for production deployments of these systems.

  • 00:00 Introduction and Agenda

  • 01:07 Install the transformers library from hugging face

  • 02:18 Download Hugging face model

  • 02:40 Sample of Generating a question

  • 04:00 Gradio app deployment in GUI

  • 08:11 Convert T5 Pytorch to ONNX & Quantize with FastT5

  • 17:22 Store the model in the drive

  • 18:30 Run the Gradio App with New model

  • 21:55 Future episode & Conclusion
Fast T5 transformer model CPU inference with ONNX conversion and quantization
Fast T5 transformer model CPU inference with ONNX conversion and quantization
  • 2021.04.26
  • www.youtube.com
Question Generation using NLP course link: https://bit.ly/2PunWiWThe Colab notebook shown in the video is available in the course.With the conversion of T5 t...
 

Azure AI and ONNX Runtime



Azure AI and ONNX Runtime

The text covers various aspects of machine learning and its deployment. It discusses the evolution of data science, the challenges of framework compatibility, the use of Azure AI and ONNX Runtime for model deployment, the creation of ML environments, and the limitations of ONNX Runtime. The speaker emphasizes ONNX's standardization and its support for multiple frameworks, making it easier to optimize for different hardware. The video also mentions the absence of a benchmark for hardware preferences and the need for using multiple tools to overcome the limitations of ONNX.

  • 00:00:00 In this section of the transcript, the speaker discusses the evolution of data science and how it has transformed from a science of laboratory work to a world of interconnectedness. The speaker shares his experience working with IoT systems and how they have evolved from being created by hand to using cloud services. The speaker also emphasizes the importance of being able to work in environments where the use of cloud services is not allowed and how specialized companies are needed in these situations. Finally, the speaker addresses the challenges of having to change frameworks or cloud providers and explains why customers often switch providers instead of changing frameworks.

  • 00:05:00 In this section, the speaker talks about the issue of compatibility between different AI frameworks and how it affects businesses. With the example of a bank, he explains that if a company had built its system on one AI framework, but then a new customer came in wanting to use a different framework, the company would have to completely rebuild the system from scratch, costing them both time and money. He then discusses the Onex runtime, which allows for businesses to convert their existing frameworks and models to a compatible format without the need for complete reconstruction. The speaker also mentions tools available to analyze and optimize these converted models.

  • 00:10:00 In this section, the speaker explains how the Azure AI and ONNX Runtime can be used to deploy machine learning models across different platforms easily. By selecting the appropriate options for their platform and language, businesses can use the software to load their neural network's 0s and 1s sequences and utilize the platform and language of their choice to export the system for easy deployment. The session also covers how Onex can be used to optimize training throughout the development process, which leads to faster and more accurate outcomes. The speaker also introduces Intel serrandello's automatic optimization system for GPUs and CPUs, which makes model development even more streamlined.

  • 00:15:00 In this section, the speaker discusses the creation of their ML environment and the development of a classifier for plant species based on the length and width of sepals and petals. The speaker notes that creating an ML environment used to involve buying clusters of servers and manually configuring everything, but now they can create their own ML environment and launch their own studio without the need for hardware. Their ML environment includes a normal virtual machine and ONNX, which is used to save TensorFlow models. The speaker then demonstrates the creation of a simple neural network to classify plant species based on the parameters given.

  • 00:20:00 In this section, the speaker shows how to load up a saved ONNX model and run a prediction on it. She imports the Tensorflow framework, loads the ONNX model, and assigns a unique input name to the input values. She then creates a link field to generate random input values to execute the ONNX expression. Lastly, she runs the call and gets the category of the output value obtained from the prediction. The ONNX model created earlier is a single file that does not need documentation, thus making it easier for developers to use it.

  • 00:25:00 In this section, the speaker explains how Azure AI and ONNX Runtime make it easy to integrate machine learning models into various applications. With Azure AI, the customer only needs to create their model in their preferred language, and then they can use Azure AI to load the model and create the necessary input data to invoke the model. ONNX Runtime can then be used to integrate the model into different applications such as Xamarin, Android or Mac, regardless of the underlying platform. This makes it possible to integrate machine learning models with ease on various devices and platforms. The speaker also notes that ONNX Runtime optimizes for various processors, including those for mobile devices.

  • 00:30:00 In this section, the speaker explains that ONNX has become the de facto standard for machine learning models, as it supports all major frameworks and is supported by many companies. This standardization allows for easier optimization for different hardware, without the need for manual optimization as in the past. Additionally, ONNX is not limited to neural networks and can also be used for other algorithms. The speaker also notes that ONNX supports the same mathematical functions and operations across different platforms, allowing for seamless deployment across different hardware and platforms as long as the supported opsets match.

  • 00:35:00 In this section, the speaker discusses the limitations of ONNX runtime in terms of performance. While ONNX is a great tool that works well for general use cases, it doesn't utilize the hardware to its full potential. This is where other tools beyond ONNX such as pytorch may be more beneficial. However, if a person's focus is on exporting the model, ONNX can replace other tools entirely. In addition, the speaker explains that ONNX doesn't handle machine-to-machine communication like multi-machine rendering. The speaker suggests using other tools alongside ONNX to overcome these limitations.

  • 00:40:00 In this section, the speaker discusses the absence of a benchmark for hardware preferences and highlights that most benchmarks can be found on the hardware manufacturer's website. They also note that many companies now write for the hardware, rather than the other way around. The speaker then mentions the most commonly used platforms for acceleration, including Core ml, ONNX Runtime, and RT. They mention how the training phase is accelerated, and once that is completed, it is easier to sell to clients. The speakers suggest that there are few changes forthcoming in the next few years and that xeon and similar programs will remain prominent.
Azure AI and ONNX Runtime
Azure AI and ONNX Runtime
  • 2023.03.09
  • www.youtube.com
Language: ItalianoAbstract:We will see how to create an AI model with Azure, turn it into ONNX, and use it in our .NET services natively with ONNX Runtime. A...