Learning ONNX for trading - page 5

 

ONNX Community Day! Streamed live on Jun 24, 2022

This event is being hosted in-person at the brand-new Microsoft Silicon Valley Campus on Friday, June 24th.

The event will cover ONNX Community updates, partner and user stories, and plenty of community networking.



ONNX Community Day!

Brief summary:

  • 00:00:00 - 01:00:00 The YouTube video "ONNX Community Day!" discusses updates and improvements to the ONNX community's work on interoperability and flexibility for developers working with machine learning models. The ONNX community works under open governance, and the three categories of tools, creation, running, and visualization, support the community's engagement and usage of ONNX. The video provides progress reports on different aspects, such as updates to the ONNX specifications, new operators, and improvements in converters. The speaker also highlights the benefits of ONNX, including the wider range of customers for hardware vendors and access to multiple frameworks and hardware accelerators for users. The future of ONNX includes the notion of ONNX functions to provide an executable specification.

  • 01:00:00 - 02:00:00 The ONNX Community Day event discusses multiple topics related to ONNX, including ONNX Model Zoo and ONNX Tutorials, which provide pre-trained machine learning models and demos to use with ONNX models. The video highlights the work of the ONNX Preprocessing Working Group, which aims to standardize data preprocessing operations to improve model deployment. The speakers also discuss the basics of neural network quantization and how TensorRT supports quantized networks through various fusions, including post-training quantization and quantization aware training. They also delve into the limitations of ONNX in representing low precision quantization and suggest a strategy to extend its representational power using clipping to induce precision between the quantized and dequantized nodes. Finally, the video delves into a case study on the accuracy of a quantized and fine-tuned TensorFlow saved model.

  • 02:00:00 - 03:00:00 The ONNX Community Day showcased numerous speakers discussing the importance of metadata in machine learning models and Java Virtual Machine (JVM) support in ONNX. The speakers emphasized using hardware-based technologies to protect data and highlighted the compatibility of ONNX with various machine learning libraries, including DeepJ and Deep Java Library. They demonstrated the use of byte buffers for better efficiency and discussed the importance of standardizing metadata for responsible and explainable AI. The presentations also featured success stories, including a Chinese bank's runtime improved using ONNX and ONNX runtime. The ONNX community is working on metadata creation, querying, and filtering from hubs workflow with support for machine-readable metadata in ONNX. Overall, the presentations highlighted the strengths of the ONNX platform and the community's commitment to its development.

  • 03:00:00 - 04:00:00 The "ONNX Community Day!" video covers various updates and features related to the ONNX ecosystem. This includes discussion on simulating quantization to reduce the accuracy drop between quantized and pre-trained models, deploying TensorFlow models trained with NVIDIA's toolkit onto an ONNX graph using TensorRT, and improvements made to ONNX Runtime, such as optimized tensor shape, execution providers, and support for mobile platforms. Additionally, updates to ONNX itself were discussed, including support for the XNNpack library and the creation of ONNX runtime extensions for pre/post-processing tasks. The video also introduces the library Optimum, which focuses on accelerating transformer models from training to inference.

  • 04:00:00 - 05:00:00 The ONNX Community Day included discussions on various topics related to the ONNX runtime and its use cases. Speakers described the features of the ONNX runtime package, the PiTorch ONNX converter, and custom Ops in PyTorch. They also discussed use cases such as process monitoring and digitization of commerce, as well as challenges associated with model deployment and compatibility testing. Throughout the event, it was emphasized that the ONNX runtime can help improve performance and reduce deployment size, but compatibility and selection are essential to ensure consistent quality and speed.

  • 05:00:00 - 06:00:00 The ONNX Community Day featured several speakers discussing various tools and techniques used to optimize and deploy machine learning models using the ONNX framework. NVIDIA discussed their approach to improving image quality and model compatibility using block splitting for inference, as well as their ONNX-specific tools for debugging and modifying models. Qualcomm explained how they have integrated AI accelerators into their designs using ONNX as an interchange format, and introduced their software stack that includes the ONNX runtime and various tools for optimization and deployment. Additionally, several speakers discussed optimizing ONNX models using techniques such as module optimization, checkpointing, and shape inference. The event highlighted the versatility and scalability of ONNX for various device use cases, and encouraged contributions to continue growing the ONNX project. The last part is focused on simplifying the process of deploying machine learning models to Spark using the SPIP proposal, which aims to hide the complexities of tab processing conversion and model initialization for developers. The Ascend AI ecosystem and its processors were introduced, including the software layer "Kung" that provides APIs for building AI applications and services. The Khan software stack was discussed, and the roadmap for adding it as a new execution provider for ONNX runtime was presented. The event ended with roundtable discussions on topics such as ONNX for Mobile and Edge, model deployment, training and operations, conversions, and operators, followed by a happy hour and survey feedback.

The detaled timeline summary:
  • 00:15:00 Create the mode in framework you prefer and then export your model to a common format that can be used by other frameworks and tools. This enables greater interoperability and flexibility for developers working with machine learning models. Additionally, ONNX is beneficial for hardware vendors who want to create optimized runtimes for machine learning models as they can now focus on supporting the common set of operators defined by ONNX rather than having to support multiple different frameworks.

  • 00:20:00 In this section, the speaker discusses the benefits of using ONNX, which enables access to multiple frameworks and hardware accelerators for users, as well as a wider range of customers for hardware vendors. The ONNX development is done by the community under open governance, meaning there's no single company controlling it. The speaker also highlights the working groups which include architecture and infra, operators converters, the model zone, tutorials, and a new pre-processing working group. The speaker goes on to outline the three categories of tools, creation, running, and visualization of ONNX models, and provides some statistics from the last six months, such as an increase in the number of PRs, contributors, stars, and monthly downloads, reinforcing the community's engagement and usage of ONNX.

  • 00:25:00 In this section, the speaker discusses the releases and updates that have happened in the ONNX community since the last community update. ONNX 1.11 was released earlier this year, which introduced new operators and updated some operators, including the ONNX model hub, which allows users to pull pre-trained models from different model zoos. Additionally, utilities like the compose utility and the function builder were introduced, along with bug fixes and infrastructure improvements. ONNX 1.12 was recently introduced with more new operators, shape and inference enhancements, and support for python 3.10. The speaker also discusses the ONNX roadmap process and 12 roadmap requests that were selected for further progress and assigned to the working groups. These requests include the implementation of new operators for data pre-processing, a C API for ONNX, and more.

  • 00:30:00 In this section of the video, the speaker discusses the progress made in addressing the need for more structured quantized information to flow through the model and the tensors. The proposal for the end-to-end pipeline with ONNX operators is undergoing further refinement as it is identified as a long-term intercept. There has been some progress made with the converters so far, with higher functioning ops getting support. The speaker also touches on different areas, such as the need for more volunteers as this is a community project and a request for companies to have more people join the effort. The speaker lists different resources such as the website, GitHub, Slack channels, and the ONNX calendar.

  • 00:35:00 In this section, the speaker discusses recent updates and improvements to ONNX. There have been two recent releases with various updates such as improved chip influence for operators and more stable handling of incorrect and optional input. Furthermore, the speaker highlights the addition of two important releases: the Model Composer and Function Builder. The Model Converter has also become more stable, and the team plans to provide better support for mixed releases in the future. Overall, it is impossible to list all the improvements made by contributors, but they continue to work towards improving ONNX.

  • 00:40:00 Rama from Operator SIG gave a summary of recent changes and updates to the ONNX specification. The focus of Operator SIG is to evolve the set of operators that make up the ONNX specification, adding new operators and clarifying their specs. In the last two releases, new operators such as grid sample and layer normalization were introduced. Existing operators such as scatter op were updated to support duplicate indices, while some ops were extended to provide support for types such as b float 16 and optional types. Rama also mentioned plans to promote some of the new operators to become functions soon.

  • 00:45:00 In this section, the speaker discusses the plans for the future of ONNX and the trade-off between having a compact specification and supporting new kinds of models that require more ops in the spec. The solution to this challenge is the notion of ONNX functions, which provide an executable specification for an op and allow for a balance between conflicting requirements. The speaker mentions plans to reduce the set of primitive operators by promoting them into functions and enabling authoring in Python using a subset called ONNX-Crypt. Examples of functions, such as the Jello activation function and the dropout op, are given to illustrate how the use of control flow makes it easier to naturally and compactly specify their semantics.

  • 00:50:00 In this section, Kevin Chen gives an update on the converter sig and the work done since the last meeting. He discusses the front-end converter updates, including the PyTorch, TensorFlow, and sk-learn to ONNX converters. For the PyTorch converter, the latest release supports ONNX exports up to ONNX offset 16, and there have been new features added, such as the ability to export neural network modules specifically as ONNX local functions. Chen also goes over the back-end converter updates, such as the ONNX centrality and ONNX TensorFlow converter. Finally, Chen presents the roadmap for the converter sig and encourages people to get involved.

  • 00:55:00 In this section, the speaker discusses updates and improvements for the SK learn to ONNX converter, the ONNX detector key converter, and the ONNX TensorFlow converter. Users are recommended to update to the latest versions for improved user experience when converting models. The roadmap for the converter stick includes goals such as improving community-driven tooling, standardizing utility functions, and improving operator and offset support. Users are encouraged to join the ONNX converters channel on Slack or subscribe to the ONNX Converter SIG mailing list to get involved with the community and provide feedback.

  • 01:00:00 Jackie from Microsoft introduces ONNX Model Zoo and ONNX Tutorials. ONNX Model Zoo is a collection of pre-trained, state-of-the-art machine learning models, mostly contributed by the ONNX community. There are currently 168 models in the zoo, including 40 ONNX models and 35 vision-based ONNX models for image classification and object detection. ONNX Tutorials provide documents and notebooks demonstrating ONNX in practice for different scenarios and platforms. The Model Zoo has seen several improvements since the last workshop, including new quantized models from Intel and increased test coverage with routine CI testing, fixing broken test datasets and collaborating with team Hugging Face to create a web interface for demoing models.

  • 01:05:00 The speakers discuss the availability of tutorials and demos for using ONNX, including a website that allows for easy processing of images and ONNX models by writing just a few lines of Python code. They also discuss the future roadmap for the ONNX Model Zoo, with plans to enable more models to be run by ORT and to incorporate more contributions, including quantized models and training example models. Additionally, they highlight the work of the ONNX Preprocessing Working Group, which is focused on making it easier to preprocess data for use with ONNX models.

  • 01:10:00 In this section of the video, the speaker discusses the lack of standardization in data pre-processing pipelines, highlighting differences between popular libraries like Pillow and OpenCV in image pre-processing. These disparities can lead to accuracy issues when deploying models on different platforms. The speaker introduces the ONNX group's goal of standardizing data pre-processing operations to avoid ambiguity and improve model deployment. The group has been working to develop infrastructure to include data pre-processing in models, such as the development of composition utilities and the sequence map operator for batch processing. The ONNX group is also investigating ways to tag the pre-processing part of a model for identification by back-ends. Additionally, the group is proposing extensions to the resize operator, including an optional anti-aliasing filter and a keep aspect ratio policy.

  • 01:15:00 The speakers discuss the implementation of a proposed center crop or path operation, which offers a higher level of abstraction and relies on existing pad and slice operators. They encourage viewers to join their Slack channel and monthly meetings to share ideas. The following presentation is given by Joaquin Anton, who recaps the goal of the ONNX preprocessing working group and shares the recent work they've been doing. Marvin from Italy also introduces himself and his work as a developer and data scientist in the field of natural language processing.

  • 01:20:00 The speaker discusses the importance of checking ONNX documentation before beginning to work on a project. They explain that not all models can be easily converted or optimized for ONNX, and it's important to ensure that the operation functions required for the project are implemented in the framework being used. Additionally, the speaker advises against the assumption that the best performance options will always optimize a model, as sometimes these options can actually reduce accuracy. Overall, it's important to carefully consider the architecture of the project and check ONNX documentation and tools like ONNX optimizer to avoid errors and ensure successful deployment on the cloud or devices.

  • 01:25:00 In this section, Jiraj Perry from Nvidia discusses the basics of neural network quantization and how TensorRT supports quantized networks through various fusions. He explains that quantization is the process of converting continuous values into a discrete set of values using linear or non-linear scaling techniques, which can offer faster inference and lower memory footprint. However, there might be trade-offs with accuracy. Jiraj also mentions the different quantization schemes and the importance of quantization parameters or q params. He then introduces post-training quantization (PTQ) and quantization aware training (QAT) and how they can determine q params.

  • 01:30:00 In this section, the video discusses post training quantization (PTQ) and quantization aware training (QAT). PTQ involves running a pre-trained model on a calibration data set and collecting layer-wise statistics to determine the dynamic range for each layer for computing quantization parameters. QAT introduces qdq nodes at desired layers and fine-tunes the graph for a small number of epochs to learn model or quantization parameters. PTQ is generally faster and has less control over final accuracy, while QAT is slower but provides more accuracy control. The video also highlights differences in approaches between Google's TF Mod toolkit and NVIDIA's TF2 quantization toolkit built on top of TF Mod.

  • 01:35:00 In this section, the speaker discusses the differences between Nvidia's quantization toolkit and tf mod in terms of where qdq nodes are placed. Nvidia's toolkit places qdq nodes at the inputs and weights of a layer in the network, while tf mod recommends placing them at the weights and the outputs of a layer. The speaker also describes how TensorRT optimizes models through layer fusions, such as point wise fusion convolution and pooling fusion, along with dedicated fusions for qdq nodes. Additionally, TensorRT's graph optimizer performs qdq propagation to move q and dq nodes to ensure that the maximum portion of the graph runs in intake. The fusion examples presented include average pool quantization and element wise addition fusion. Finally, the speaker examines the quantization fusion in a residual block.

  • 01:40:00 In this section, the speaker explains the importance of adding qdq nodes at the identity branch in order to get the best performance from Tensor RT when reloading operations. The resulting fusions look like dq nodes of the weights and inputs being propagated beyond the ad and fused with q nodes after the add layer. The speaker stresses the need for qdq nodes in the original graph to get the best performance for your model and warns that not using them properly can lead to a poor performance in the model. The speaker concludes by inviting a discussion on how to insert qdq nodes in the TensorFlow toolkit.

  • 01:45:00 In this section, the speaker acknowledges a technical difficulty with navigating the slides and assures the audience that it will be fixed shortly. They then move on to discussing a case study on accuracy after obtaining a quantized and fine-tuned TensorFlow saved model. The audience is invited to ask any questions before taking a short break.

  • 01:50:00 In this section, the speaker discusses the concept of quantization and how it can be used to represent low precision in quantizing neural networks. Quantization is a combination of two functions: the quantized function which maps floating point values to integer values and the quantized function that maps integer values back to a floating point representation. The combination of these two functions is referred to as fake quantization. This process allows for the mapping of representations to an integer-only representation. The use of uniform quantization, particularly with reduced precision, allowed for 1.7 billion samples per second with less than three microseconds of latency on the rf soccer platform.

  • 01:55:00 In this section, the speaker discusses the limitations of ONNX in representing low precision quantization, especially below eight bits, and suggests a strategy to extend the representational power of ONNX by leveraging clipping to induce precision between the quantized and dequantized nodes. This strategy adds an extra clipping function that is supported over integer boundaries and does not affect retrocompatibility with existing libraries and tools. However, this strategy extends only as far as quantized linear allows to go as an operator, and it has some limitations regarding different types of roundings. The speaker also mentions their efforts with the quantization on exponentiation dialect (tuonex), which represents fake quantization in just one node while extending to a wider set of scenarios with input broadcasts, binary quantization options, and more. This format is leveraged as part of their deployment efforts on fpgas, and their tools, such as q ONNX and nqcdq, integrate with existing quantization libraries and have gained adoption in the fpga community.

  • 02:00:00 In this section, Daniel from Mithril discusses how they have developed a solution called "Blind AI" which allows for the deployment of ONNX models within secure enclaves that leverage hardware to protect data. By using hardware-based technologies, this solution provides isolation and encryption of the memory content of the enclave which prevents any dumping attempt from the outside. Data is decrypted inside the enclave, and any malicious insider can't access the data, which is a huge advantage for the data owner when it comes to privacy and security. Blind AI is an open-source solution that is easy to onboard, and it is effortless for the AI provider to maintain and sell this solution.

  • 02:05:00 In this section, the speaker discusses the ability to deploy AI models with progress guarantees, using the Python SDK to securely upload the models and send data for analysis without giving third parties access to the data.  ONNX 's expressivity is also highlighted, which enables it to cover various use cases, including baggage screening, analyzing medical documents, and facial recognition. The speaker also presents different models used in practice and their speed inside and outside the enclave, with reasonable added latency due to the protection it provides. Additionally, ONNX requires a minimal code base, making it better for security reasons and able to reinforce each operator, ensuring secure enclave usage. The presentation concludes with information on their GitHub and how they can cover various scenarios, and the opportunity to delve into the technical security details.

  • 02:10:00 In this section, the speaker discusses the proposal for enabling machine-readable AI metadata in ONNX, which involves tracking provenance and other relevant characteristics of a model to identify how it moves over time and evolves given a specific use case. The proposal was initially presented to the ONNX steering committee in October of 2020, and now the team wants to extend the proposal further to include the creation, querying, and visualization of metadata as part of an end-to-end design for model hubs and zoos. The speaker emphasizes the importance of metadata as a key enabler of responsible and explainable AI and highlights its usefulness in narrowing down failure modes and identifying pain points for AI systems.

  • 02:15:00 In this section of the ONNX Community Day presentation, the speakers discuss the importance of metadata in models and the potential for using RDF for a more machine-readable and standardized approach to representing metadata. They explain how this approach can help establish relationships among entities, maintain transparency, and track provenance, answering questions about what caused a model to have lower accuracy than expected. The speakers also discuss the power of querying metadata using SPARQL, and explain how models with RDF-formatted metadata can provide information beyond what a simple model card can offer.

  • 02:20:00 In this section, the speaker discusses the Furnace control vocabulary, a set of guiding principles for making data and digital assets accessible, interoperable, and reusable. The principles of Furnace were identified by the semantic web community and include fairness, trustworthiness, and sustainability. RDF-encoded metadata can be queried using the Furnace ontology to discover models suitable for NLP tasks, identify the creators and size of models, and track the carbon footprint of models to classify and sort them. The extensibility of RDF and its query language, Sparkle, allow for infinite extensibility beyond the vocabulary selected by exam authorities. This can enable tracking of responsible AI and mixed precision current.

  • 02:25:00 In this section, the presenters discuss the querying and filtering capabilities of ONNX Community Day. They showcase how the metadata author can identify models trained with datasets containing private or personal information by using metadata tags. The presenters also demonstrate how extended filtering capabilities enable users to query models with mixed precision. They highlight the visualization of model profiles and explainable AI techniques for displaying metadata efficiently. The presenters call for action to consider a concrete design around model creation and consumption that covers entire metadata creation, querying, and filtering from hubs workflow with support for machine-readable metadata in ONNX. They are currently preparing a strawman implementation and exploring technologies for metadata creation.

  • 02:30:00 In this section, Adam Pocock from Oracle Labs discusses the importance of supporting the Java Virtual Machine (JVM) with ONNX. While most ML applications are written in languages other than Python, such as Java, bringing machine learning to these languages is crucial. The ONNX runtime Java API was developed by Oracle Labs to incorporate machine learning in Java and other languages, with features such as minimal performance impact and ease of deployment. Adam also provides a code example to demonstrate the similarities between the Java API and other APIs.

  • 02:35:00 In this section, the speaker discusses how to feed data and run a model by using ONNX tensor, a byte buffer representation for a stream of bytes. Although it's possible to use regular arrays in Java, the speaker recommends using byte buffers due to their zero copy path, which allows for better efficiency in handling data. The speaker also notes that Java's multi-dimensional arrays are not optimal for machine learning because they are not flat and involve a lot of pointer chasing. The speaker further discusses plans for upgrading to a newer version of Java, adding new features, and building out to match the ONNX runtime tree. Additionally, the speaker introduces an open-source library that writes ONNX models from Java, which is available anywhere on the JVM.

  • 02:40:00 In this section, the speakers discuss the ONNX toolset's compatibility with new machine learning libraries such as DeepJ and how it integrates with ONNX runtime to provide top-notch performance. DeepJ creates an abstraction layer over various deep learning libraries, abstracting all the necessary libraries, and providing several operator backends for the machine learning engines to use, such as Apache MixTape, Tensorflow, PyTorch, ONNX, Pedal, and more. They are also exploring ways to standardize the metadata in this toolset to emit standard metadata formats while continuing to expand the operator enumeration.

  • 02:45:00 In this section, the speaker discusses the benefits of the Deep Java Library which includes a set of pre-trained models that cover tasks like image classification, object detection, sentiment analysis, and action recognition. The library is service-ready and has gone through rigorous testing to perform with the best possible speed and memory control, as demonstrated by its successful use with DHL for over half a year without any errors. Additionally, the speaker shares several use cases in which ONNX and ONNX runtime were used to achieve significant performance gains and reduce latency. One success story features a Chinese bank that was able to bring its OCR models runtime down from one second to less than 400 milliseconds on a single image. Furthermore, the speaker introduces the Hybrid Engine concept, which allows for loading two engines at the same time and provides a smooth transition between them.

  • 02:50:00 In this section, the speaker explains a method using the direct buffer to send pointers from Python directly to ONNX runtime, which avoids data copying and provides a performance boost. They also introduce ND Manager, a tree-like architecture implemented in the DeepDraw library to provide more cost-effective memory collections. The speaker discusses how customers can transition from using PyTorch to ONNX runtime without changing a single line of code. Later, the speaker from Hype Factors talks about their media intelligence company and how they chose to base their infrastructure around the JVM for its developer experience, ecosystem of reusable components, and high scalability.

  • 02:55:00 In this section, the speaker discusses the technical aspects of their media analytics website, including the use of the JVM for powering most of the website’s functionality, and the migration to a system that principally enriches all data that comes in. With a few billion GPU inferences per day, the product's features depend heavily on machine learning and model management, which has become an essential part of keeping everything up and running, leading to the criticality of the process. The data encompasses all sorts of formats, including HTML and PDFs, and they track its runtime to enrich data on the fly, including named entity recognition, salience, sentiment, and more. There have been many challenges along the way, including conversion errors and a rare memory leak in DTL, which took a while to solve.
ONNX Community Day!
ONNX Community Day!
  • 2022.06.24
  • www.youtube.com
This event is being hosted in-person at the brand-new Microsoft Silicon Valley Campus on Friday, June 24th. The event will cover ONNX Community updates, part...
 

ONNX Community Day! Streamed live on Jun 24, 2022

This event is being hosted in-person at the brand-new Microsoft Silicon Valley Campus on Friday, June 24th.

The event will cover ONNX Community updates, partner and user stories, and plenty of community networking.



ONNX Community Day!

Brief summary:

  • 00:00:00 - 01:00:00 The YouTube video "ONNX Community Day!" discusses updates and improvements to the ONNX community's work on interoperability and flexibility for developers working with machine learning models. The ONNX community works under open governance, and the three categories of tools, creation, running, and visualization, support the community's engagement and usage of ONNX. The video provides progress reports on different aspects, such as updates to the ONNX specifications, new operators, and improvements in converters. The speaker also highlights the benefits of ONNX, including the wider range of customers for hardware vendors and access to multiple frameworks and hardware accelerators for users. The future of ONNX includes the notion of ONNX functions to provide an executable specification.

  • 01:00:00 - 02:00:00 The ONNX Community Day event discusses multiple topics related to ONNX, including ONNX Model Zoo and ONNX Tutorials, which provide pre-trained machine learning models and demos to use with ONNX models. The video highlights the work of the ONNX Preprocessing Working Group, which aims to standardize data preprocessing operations to improve model deployment. The speakers also discuss the basics of neural network quantization and how TensorRT supports quantized networks through various fusions, including post-training quantization and quantization aware training. They also delve into the limitations of ONNX in representing low precision quantization and suggest a strategy to extend its representational power using clipping to induce precision between the quantized and dequantized nodes. Finally, the video delves into a case study on the accuracy of a quantized and fine-tuned TensorFlow saved model.

  • 02:00:00 - 03:00:00 The ONNX Community Day showcased numerous speakers discussing the importance of metadata in machine learning models and Java Virtual Machine (JVM) support in ONNX. The speakers emphasized using hardware-based technologies to protect data and highlighted the compatibility of ONNX with various machine learning libraries, including DeepJ and Deep Java Library. They demonstrated the use of byte buffers for better efficiency and discussed the importance of standardizing metadata for responsible and explainable AI. The presentations also featured success stories, including a Chinese bank's runtime improved using ONNX and ONNX runtime. The ONNX community is working on metadata creation, querying, and filtering from hubs workflow with support for machine-readable metadata in ONNX. Overall, the presentations highlighted the strengths of the ONNX platform and the community's commitment to its development.

  • 03:00:00 - 04:00:00 The "ONNX Community Day!" video covers various updates and features related to the ONNX ecosystem. This includes discussion on simulating quantization to reduce the accuracy drop between quantized and pre-trained models, deploying TensorFlow models trained with NVIDIA's toolkit onto an ONNX graph using TensorRT, and improvements made to ONNX Runtime, such as optimized tensor shape, execution providers, and support for mobile platforms. Additionally, updates to ONNX itself were discussed, including support for the XNNpack library and the creation of ONNX runtime extensions for pre/post-processing tasks. The video also introduces the library Optimum, which focuses on accelerating transformer models from training to inference.

  • 04:00:00 - 05:00:00 The ONNX Community Day included discussions on various topics related to the ONNX runtime and its use cases. Speakers described the features of the ONNX runtime package, the PiTorch ONNX converter, and custom Ops in PyTorch. They also discussed use cases such as process monitoring and digitization of commerce, as well as challenges associated with model deployment and compatibility testing. Throughout the event, it was emphasized that the ONNX runtime can help improve performance and reduce deployment size, but compatibility and selection are essential to ensure consistent quality and speed.

  • 05:00:00 - 06:00:00 The ONNX Community Day featured several speakers discussing various tools and techniques used to optimize and deploy machine learning models using the ONNX framework. NVIDIA discussed their approach to improving image quality and model compatibility using block splitting for inference, as well as their ONNX-specific tools for debugging and modifying models. Qualcomm explained how they have integrated AI accelerators into their designs using ONNX as an interchange format, and introduced their software stack that includes the ONNX runtime and various tools for optimization and deployment. Additionally, several speakers discussed optimizing ONNX models using techniques such as module optimization, checkpointing, and shape inference. The event highlighted the versatility and scalability of ONNX for various device use cases, and encouraged contributions to continue growing the ONNX project. The last part is focused on simplifying the process of deploying machine learning models to Spark using the SPIP proposal, which aims to hide the complexities of tab processing conversion and model initialization for developers. The Ascend AI ecosystem and its processors were introduced, including the software layer "Kung" that provides APIs for building AI applications and services. The Khan software stack was discussed, and the roadmap for adding it as a new execution provider for ONNX runtime was presented. The event ended with roundtable discussions on topics such as ONNX for Mobile and Edge, model deployment, training and operations, conversions, and operators, followed by a happy hour and survey feedback.


The detaled timeline summary:

  • 03:00:00 In this section, a speaker discusses their experience with the ONNX ecosystem and the challenges they faced while using it. They talk about the need to match up drivers and set up monitoring to ensure the system keeps running smoothly. They also mention their future plans to increase GPU efficiency, add more models, and improve the overall robustness of the system through GPU-accelerated testing. The speaker invites any questions and discussion on this use case and takes the opportunity to thank everyone. The video will be resumed after lunch with a replay of the NVIDIA talk.

  • 03:25:00 I'm sorry, but this transcript excerpt is not related to the "ONNX Community Day!" video. Can you provide another excerpt from the video for me to summarize?

  • 03:30:00 In this section of the video, the speakers discuss how to simulate quantization and store the final q parameters in order to reduce accuracy drop between the quantized model and the pre-trained model. One way to perform quantization-aware training is to use the TensorFlow model optimization toolkit or the toolkit built by Nvidia, which offers features such as quantizing layers using layer name and class attributes and pattern-based quantization. The speakers note that Nvidia's toolkit uses a symmetric quantization variant that offers the best performance for a QAT model on a GPU using one extension.

  • 03:35:00 In this section, we learn about the process of deploying a model trained using NVIDIA's TF2 Quantization Toolkit onto an ONNX graph using TensorRT. The workflow involves quantizing a pre-trained TensorFlow 2.0 model with NVIDIA's toolkit, fine-tuning it for a small number of epochs, and converting it into an ONNX graph using TF2ONNX converter. Then TensorRT's APIs are used to generate TensorRT Engine out of the ONNX graph. We see that quantization-aware training provides an alternative to deploy deep neural networks in lower precision, and qrt models might be less prone to accuracy drop during inference compared to ptq models due to model parameters fine-tuned. Finally, the experiments with ResNet models show that INT8 accuracy is on par with FP32 baseline accuracy, and the latency is more than 10x faster compared to their FP32 counterparts.

  • 03:40:00 In this section, Ryan Hill, a software engineer working on ONNX runtime since its creation, talks about the features and usage of ONNX runtime. ONNX runtime is a runtime for ONNX models, fully cross-platform and with language bindings for many programming languages. Microsoft uses it across all of their major product groups such as Windows, Office and Azure, while there are over 160 models in production with ONNX runtime. Hill goes through notable new features in recent releases, including the ability to use operational kernels as a math library and the ability to feed external initializers entirely in memory. Performance improvements include the addition of a transpose optimizer, optimizing away heap allocations, and reducing the need for layout transformations.

  • 03:45:00 In this section, the speakers discuss the improvements made to ONNX Runtime, including optimized tensor shape and inline vector classes, resulting in a reduction in heap allocations and improved performance. They also explain the benefits of execution providers, which enable ONNX Runtime to perform optimally on various hardware possibilities, including a complete CPU implementation as a fallback option. Additionally, they highlight updates made to support mobile platforms and improved usability for mobile developers, including the use of NHWC conversion at runtime and the addition of Android and iOS packages with the full ONNX Runtime builds. Finally, they introduce ONNX Runtime Web, backed by the same core codebase as ONNX Runtime and with a smaller binary, and discuss the introduction of a JavaScript library called ONNX Runtime Common.

  • 03:50:00 In this section, the speakers discuss the updates to ONNX, including support for the XNNpack library and upcoming OpenGL support in the 1.12 release. They also address the challenges with data pre and post-processing and the creation of the ONNX runtime extensions, which provide a library of shareable custom ops focused on model pre-post-processing work. These extensions include potential functions like converting text to uppercase or lowercase and separating positive and negative values into separate tensors. The current library is mainly focused on natural language processing and vision and text domains, but it is anticipated that this will evolve as new needs are identified. They also introduce Jeff from Hugging Face, who discusses the integration of ONNX with the Optimum library for accelerating Transformers models.

  • 03:55:00 In this section, the speaker discusses the power of transformer models and how they are being used by major companies like Tesla, Gmail, Facebook, and Bing to make billions of predictions every day. They explain that Hugging Face's goal is to make these models accessible to every company in the world through readily accessible pre-trained models and tools to make use of them. They also discuss their focus on building a community that is sharing and improving upon what is possible, with over 1300 open-source contributors to their libraries and access to over 50,000 fine-tuned models for every machine learning task and language. The speaker then introduces their library Optimum, which is focused on accelerating transformers models from training to inference, addressing the challenges of compute, memory, and bandwidth resources that come with increasing model parameters.


  • 04:00:00 In this section, the speaker discusses the ONNX runtime package within the Optimum toolkit and its ability to accelerate training and inference of transformer models. They introduce the new trainer class called ORT Trainer which allows users to get native integration of deep speed and achieve up to 40% acceleration in the throughput of training. For inference, there are three main classes: ORT Optimizer, RT Quantizer, and RT Model for Task. With these classes, users can simplify the graph from their model, optimize weights, and benefit from all the hardware acceleration offered by ONNX runtime. The speaker also mentions collaboration efforts towards enabling sequence to sequence model optimization through these optimum accelerated inference pipeline classes.

  • 04:05:00 In this section, two presenters discuss the ONNX community, focusing on the optimization and conversion process for ONNX models. The first presenter introduces the optimum library, which allows users to optimize and quantize their models, increasing throughput and decreasing latency while conserving their models' accuracy. The second presenter discusses the architecture and flow of the PiTorch  ONNX converter, explaining its steps of converting PiTorch models to the torch intermediate representation, using graph optimizations and converting to ONNX IR. They also highlight some interesting features, such as the support for exporting a quantized model in QTQ format and capturing python control flow loops and ifs in ONNX models as ONNX loop and ONNX if nodes.

  • 04:10:00 In this section, the speaker discusses various ways to export custom Ops in PyTorch, including writing a custom Torch autogr aph function and defining the forward and backward methods. The speaker explains how to utilize the API to register a custom symbolic function to tell the exporter how to export it as either standard ONNX ops or any custom ops in a custom domain. They then introduce the ONNX Local Function feature that allows users to specify a certain Torch module class or node type as a function for back-end to still be able to run the model without having a specified kernel. Lastly, the speaker mentions that the team will continue to focus on support for more models and improving the experience of diagnosing failures.

  • 04:15:00 In this section, a use case is discussed where an existing camera system is reused to detect employees entering dangerous areas near moving machinery. Using an open-source ONNX model for detecting people and SAS's Sasebo Stream Processing tool for real-time analysis and event processing, a solution was developed that could process millions of events per second and be scaled to larger systems. The solution was also made available through a graphical studio and Jupyter notebook for data scientists to develop models and the ONNX runtime was integrated into Sasebo Stream Processing. To ensure resilience, a modular solution was suggested that split the image processing into several steps with Kafka as a buffer queue.

  • 04:20:00 In this section, the speaker describes a computer vision processing model that has been deployed on the edge using Kubernetes as a deployment mechanism. The model includes an ingest process with a pod for each camera, a Kafka bus for video data, and a processing pod that uses a computer vision model to create results. The results are then sent to a third pod that processes additional sensor data from the customer to understand if the equipment being recorded is active or not. Additionally, the speaker explains that this architecture is currently in production at one of their customer's facilities and that the ONNX runtime integration ensures an optimal time to value thanks to public available pre-trained models and the reuse of customer assets. The architecture resiliency is another key benefit and is insured thanks to Kubernetes and Kafka.

  • 04:25:00 In this section, Matthew at Bazaar Voice discusses the digitization of commerce and how brands and retailers have shifted to the infinite shelf space across the internet. With the scale of data that e-commerce companies possess, creating impactful insights using AI can be a game-changer. Matthew illustrates this by using Bazaar Voice as an example, which manages and processes data for over a billion shoppers a month and provides over 8 billion total reviews for brands and retailers. By focusing on sharing product reviews across catalogs, the concept of product matching plays a pivotal role. Matthew explains how a machine learning model is built to perform product matching by comparing unique product identifiers, but any leftovers are done manually. To implement a solution that generates real business value, the ideal approach is a lightweight, cost-efficient solution that maintains performance.

  • 04:30:00 In this section, the speaker discusses different options for deploying machine learning models, including virtual servers, cloud ML platforms, and serverless functions such as Azure Cloud Functions or AWS Lambdas. After evaluating the pros and cons of each option, the speaker and their team decided to develop a scikit-learn model exported to ONNX format, build it with Python, deploy it to a serverless function, and use a node environment to run inference on ONNX runtime. The speaker also mentions that ONNX helps reduce the deployment size of models but highlights the challenges of working within the timeout and deployment size limits of serverless functions, as well as the costs and size limitations of Python packages.

  • 04:35:00 In this section, Senior Software Engineer at Adobe Nikhil Calro discusses the unique challenges involved in high-performance machine learning for video and audio workflows. These challenges include resource limitations, data-intensity, and compute-heavy requirements. To address these issues, Adobe uses a combination of technologies and pipelines to accelerate workflows, including the ONNX runtime to power machine learning workflows in Windows and Direct ML execution provider for GPU acceleration on Windows platforms. Calro also notes that Adobe's machine learning workflows aim to enable creators to spend more time on the creative process and less time on redundant and repetitive tasks.

  • 04:40:00 In this section, the speaker talks about how Adobe's Creative Cloud apps target the entire Windows ecosystem as a single platform and must provide feature and functional parity across all major IHVs that support Windows such as Nvidia, Intel, and AMD. They chose the DirectML execution provider to enable the use of vendor-specific hardware like tensor cores on Nvidia GPUs as it leaves hardware free for other async compute workflows. They also made additional performance optimizations by building a framework on top of the ONNX runtime and trying to assemble the inference requests into batched workflows to reduce resource contention with the GPU and minimize driver overhead. They give an example of the Scene Edit Detection workflow, which is an extremely resource-intensive workflow, but they are able to run the entire pipeline end-to-end from decode to inference in about 10 seconds or six times real-time.

  • 04:45:00 In this section, the speaker discusses how the performance enablements provided by ORT and the Direct ML execution provider made it possible to use modern high-end GPUs to enable machine learning-based workflows during GPU render. They plan to transition their pipeline to look more like the pipeline on the right, minimizing transfers to the CPU and keeping as much stuff on the GPU or GPU-addressable hardware as possible. This will become even easier as more of their GPU compute transitions to DX12, removing the overhead associated with OpenCL and CUDA to DX12 in their OP.

  • 04:50:00 In this section, Alexander Zang, a software developer at Topaz Labs, discusses the challenges of deploying image models on desktops and laptops. He explains that the critical part of this deployment is fitting into an existing workflow, getting the expected performance without manual configuration, and delivering high-quality image models. Alexander explains that unlike server deployment, desktop deployment lacks control over the system, particularly with different GPUs from different vendors with varying levels of memory and responsiveness constraints. His solution to this is to rely on different inference libraries for each hardware vendor, which ONNX provides. This approach allows Topaz Labs to create a model architecture that can be used by different inference libraries while saving on manual work.

  • 04:55:00 In this section, the speaker discusses the challenges associated with model conversion and the need to test for compatibility issues before training a model. The issue of ambiguity in model specifications is highlighted, as well as the need to test different libraries for performance and consistency. The speaker also explains the reasons for performing multiple conversions, stating that using a more generic interface can lead to additional loading steps and conversion costs that can impact the performance of their apps. Finally, the process of selecting the appropriate configuration and handling the runtime inference pipeline is explained, highlighting the need for compatibility and selection while ensuring consistent quality and speed from desktops.



  • 05:00:00 In this section, a speaker from NVIDIA talks about their approach to handling ONNX model compatibility and improving image quality across desktop systems by splitting images into blocks and running them through inference, maximizing throughput and potentially running on multiple devices and libraries in parallel. The speaker also addresses the difficulty in ensuring new model architectures can be added and behave well across all libraries, which can take a lot of work and time. They then move on to discuss two tools, ONNX Craft Surgeon, a python library that allows you to create and modify ONNX models, and Polygraphy, a toolkit for debugging deep learning models. The speaker explains how these tools work and how they can be used to construct models as simple as constructing tf graphs.

  • 05:05:00 In this section, the speaker introduces the ONNX tooling, which includes a Python API and various command-line tools that offer a lot of functionality for manipulating ONNX models. The speaker focuses on ONNX-specific tools, such as inspect model, which shows a text representation of an ONNX model, and surgeon sanitized subtool, which simplifies and folds constants in the model. The surgeon extract allows users to extract subgraphs from a model to debug it, and the model bisector debug reduce works like git bisect but for ONNX models, allowing one to find the smallest failing model to diagnose errors in the system.

  • 05:10:00 In this section, the presenter discusses using the Calligraphy debug reduce tool to debug models that may have runtime issues. By reducing the model size and testing each intermediate model, developers can identify problematic areas in the code and make the debugging process easier. The presenter also explains how Qualcomm collaborates with the community by using ONNX as an interchange format that can be used across a variety of devices, from earbuds to laptops to automotive systems. By targeting models using ONNX, developers can create models that are compatible with all of Qualcomm's supported devices.

  • 05:15:00 In this section, the speaker talks about the challenges of dealing with various architectures of devices that require different models, implementations, and timing requirements. He gives an example of how the same algorithm and technology built for depth sensors and cameras in mobile phones are now used for secure smart doorbells and inside and outside surveillance cameras in automobiles. He then emphasizes the importance of scalability and compares the differences between processing machine algorithms on CPUs, GPUs, and AI accelerators using the example of running the Inception V3 model, where running it on the AI accelerators can provide up to a thousand inferences per second, freeing up the CPU for other useful tasks.

  • 05:20:00 In this section, a representative from Qualcomm explains how they have integrated artificial intelligence (AI) accelerators into their hardware to improve performance and scalability. By using a purpose-built AI accelerator, they can handle AI workloads without the extra energy consumption or slower speeds that often result from using a CPU or GPU. Additionally, their ONNX interchange format allows them to compile and execute machine learning models across different devices and verticals, saving time for the company and allowing for more portability. They have also created a unified software stack and library that supports a variety of operating systems and hides hardware details to make it easier for customers to use their hardware.

  • 05:25:00 In this section, the speaker introduces the software stack that Qualcomm has developed surrounding ONNX. They have built a complete solution which includes the ONNX runtime, as well as a delegation system that takes care of routing models to either the CPU or GPU depending on the device use case. The speaker discusses the many tools they have developed, including compilers, profilers, analyzers, and enablement for network architecture search tools. The speaker emphasizes the scalability and versatility of ONNX and how it can be used for various device use cases, including camera algorithms, smart speakers, and XR devices.

  • 05:30:00 In this section, the speaker explains the development process of the ANITA compiler, which is used to provide a reference dialect in Mair for easy optimization across different architectures and deploy the models on various environments. The speaker also highlights the framework they have introduced to support custom accelerators, which enables them to choose which operators to upload and turn on/off the accelerator easily. The speaker also provides an overview of how optimization is deployed in ONNX Mlir, where the ONNX model is gradually lowered into an intermediate representation.

  • 05:35:00 In this section, the speaker talks about the ONNX compiler and how multiple dialects are used for CPU and accelerator optimization. The high-level optimization includes graph-level optimization, and at lower levels, optimization is applied to CPU and accelerator operations. The speaker presents an example of how an accelerator framework in the compiler works, where using the accelerator is 11 times faster than running it on a CPU. They also mention how they are focusing on deep learning operator optimization and will support online machine learning operators as well as other accelerators such as the CPU. Finally, they express their gratitude to the contributors and invite more contributions to continue growing the ONNX project. The next speaker, from Preferred Networks, introduces PFVM, their neural network compiler that uses ONNX's intermediate representation.

  • 05:40:00 In this section, a member of the Compile team discusses a use case for ONNX in module optimization, specifically with regards to its stable and well-documented intermediate representation. The team uses ONNX to optimize models with a variety of optimization paths by extending ONNX with customer operators, such as adding device information to optimize device changes and memory. The speaker also discusses the importance of shape inference in module optimization and introduces three optimization cases. The first case involves reducing kernel range overhead in computational graphs executed on CUDA through fusing multiple element-wise operators into a single fusion group operator.

  • 05:45:00 pass would contain many unnecessary operators if the model was generated by a program like neural architecture search. This is where optimizations like shape inference and graph simplification come in handy. Shape inference is crucial for determining if adjacent element-wise operators can be fused together, whereas graph simplification can remove unnecessary operators from the backward pass. Both of these optimizations can significantly reduce the number of computations needed and improve the overall efficiency of the model.

  • 05:50:00 In this section, the speaker discusses the technique of checkpointing, which reduces memory usage when executing a module. By modifying the computational graph, the memory usage can be reduced even more, but at the expense of increased latency. The speaker emphasizes the importance of knowing the tensor sizes to estimate memory usage and states the limitations of automatic checkpointing when dealing with unknown dimensions. Additionally, the speaker discusses the impact of unknown dimensions on optimization opportunities and outlines the improvements that have been made to the ONNX shape inference. Specifically, the introduction of symbolic inference in ONNX 1.10 has greatly improved shape inference through data propagation.

  • 05:55:00 In this section, the speaker discusses shape inference in ONNX, explaining that shape information can be propagated globally from the top for static cases, but more support is needed for dynamic cases. The speaker shows an example of an ONNX graph where the shape of a reshape needs to be estimated, but is currently unknown. They suggest implementing more shape-inference functions for dynamic cases, and ask if support for cases like the concatenation of two tensors of different sizes is necessary. The speaker also briefly mentions optimizations for the md4 supercomputer, which only accepts models with static shape and no dynamic branching, and uses static information for scheduling and optimizations. Next, a representative from Huawei shares a pre-recorded talk about bringing the power of ONNX to Spark for inference.

  • 06:00:00 In this section, the focus is on the SPARK improvement proposal, commonly known as SPIP, which aims to simplify the process of machine learning model deployment to spark, integrating it with third-party DL frameworks. This proposal is aimed at data engineers or developers who need to deploy DL models on Spark. The end goal is to hide the complexities of tab processing conversion and model initialization within a modal UDF, enabling users to complete ONNX inference easily on big data. Huawei's own AI processor called Ascend is introduced, and it is explained that to complete the SPARK and ONNX pipeline on the Ascend platform, one should introduce Ascend support in ONNX runtime first.

  • 06:05:00 In this section, the speaker discusses the Ascend AI ecosystem and the different processors that it supports. The Ascend 310 only supports AI inference, while the Ascend 710 and 910 support both training and inference. Additionally, the ecosystem provides a software layer called "Kung" that provides APIs for developers to easily build AI applications and services. The speaker then focuses on Khan, the current software stack in the Ascend ecosystem, and the newest version, Khan 5.0. They explain the different layers of Khan and how it provides operator libraries, optimization engines, and a framework adapter for developers. The speaker then discusses the roadmap for adding Khan as a new execution provider for ONNX runtime, allowing users to run ONNX models on Ascend hardware directly.

  • 06:10:00 In this section of the video, the ONNX Community Day is wrapping up with the roundtable discussions for the in-person attendees. The roundtables consisted of six different topics, with big notepads available for attendees to write on. The topics were selected based on submissions from attendees and include ONNX for Mobile and Edge, model deployment and machine learning quantization, training and operations, conversions, and operators. The ONNX Runtime team was also available to join in on the conversations. After the roundtables, attendees enjoyed a happy hour with food and drinks, and were encouraged to fill out the survey to provide feedback.
ONNX Community Day!
ONNX Community Day!
  • 2022.06.24
  • www.youtube.com
This event is being hosted in-person at the brand-new Microsoft Silicon Valley Campus on Friday, June 24th. The event will cover ONNX Community updates, part...
 

ONNX: Past, Present, and Future - Jim Spohrer, IBM & Prasanth Pulavarthi, Microsoft



ONNX: Past, Present, and Future - Jim Spohrer, IBM & Prasanth Pulavarthi, Microsoft

The "ONNX: Past, Present, and Future" video features IBM's Jim Spohrer and Microsoft's Prasanth Pulavarthi discussing the growth and future of the open-source AI framework ONNX. They highlight the importance of standardizing AI models' deployment through the interchanging format provided by ONNX, enabling seamless optimization across different deep learning frameworks. Additionally, they discuss the recent developments in ONNX runtime's ability to work with various hardware accelerators and offer tips and resources for getting started with ONNX. The speakers answer audience questions regarding ONNX's capabilities, commercial deployment, and upcoming certification plans while urging viewers to get involved in the ONNX community.

  • 00:00:00 In this section, Jim Spohrer from IBM and Prasanth Pulavarthi from Microsoft introduce themselves and provide an overview of the past, present, and future of ONNX, an open-source AI framework. ONNX serves as a standardized interchange format, allowing different tools to interoperate and optimize inference across various deep learning frameworks. The speakers urge viewers to get involved with the ONNX community by checking out the news and getting started information on the ONNX website, as well as joining the community on GitHub and Gitter. They also highlight recent virtual community meetings, where ONNX partners discussed their projects and how they are using ONNX in innovative ways.

  • 00:05:00 In this section, the speakers discuss the growth of the ONNX community and its importance as an interchange format amidst numerous open source projects in the field of artificial intelligence and machine learning. They highlight the progress of the ONNX community in terms of pull requests, contributors, stars, forks, published papers, and the model zoo, and encourage more organizations and individuals to get involved. The speakers also introduce ONNX at Microsoft and its use in various products, emphasizing the need for a standardized format like ONNX in the diverse landscape of AI and ML solutions. They offer tips on how to use ONNX and welcome questions from the audience.

  • 00:10:00 In this section, the speakers discuss common problems that developers face when trying to deploy ML models into production, such as high inference latency, running models on edge and IoT devices, and the need to run the same model on different hardware and operating systems. To solve these issues, the speakers introduce the ONNX format and ONNX runtime, which allows developers to represent models from various frameworks in a common format and run them efficiently on different platforms and accelerators. Microsoft's Speech Service is given as an example of how ONNX has improved agility and performance in production.

  • 00:15:00 In this section, the speakers discuss how using ONNX runtime can lead to benefits in terms of agility, performance, and accuracy. They mention examples of Microsoft's cognitive services, such as speech-to-text and computer vision, as well as Azure Connect, a device with body tracking capabilities. The portability aspect of ONNX is also highlighted, as it allows for the same model and application code to be used across different platforms and hardware accelerators, saving time and customization efforts. Additionally, the speakers touch on Windows ML, which uses ONNX runtime as a common format for models, making it easy to do machine learning inferencing in the Windows operating system.

  • 00:20:00 In this section, Jim Spohrer and Prasanth Pulavarthi discuss some of the recent developments and achievements of ONNX runtime. One of its most significant features is its ability to work with different types of hardware accelerators, such as GPUs or VPUs. It also offers cross-language support, allowing users to convert models trained in pythonic environments into C#. One example of a company using ONNX runtime is an ISV who trains their financial models in python using scikit-learn but uses ONNX runtime for production in C#. In addition, ONNX runtime has recently optimized the inferencing and training of transformer models such as BERT and GPT2, resulting in significant speedups and cost savings for users.

  • 00:25:00 In this section, the speakers provide information on how to get started with ONNX and ONNX runtime. The Model Zoo, which is accessible via a URL, offers a variety of pre-trained models to download and start using with ONNX runtime for vision, language, and upcoming speech models. Additionally, the speakers encourage the contribution of models to the Model Zoo. Existing models from other frameworks can also be converted or exported to the ONNX format. Frameworks such as PyTorch, Keras, TensorFlow, and Scikit-Learn have ONNX export functionalities, and ONNX runtime can be installed on Windows, Linux, and Mac with support for multiple programming languages.

  • 00:30:00 In this section, the speakers discuss hardware acceleration and how different hardware accelerators can be integrated through the API called execution providers. The ONNX runtime has a highly optimized CPU implementation as well as a CUDA implementation, and hardware vendors such as Nvidia and Intel have partnered with ONNX to integrate their optimizations with ONNX runtime. This ensures that any ONNX model can run with full support for the entire ONNX spec, even if a particular operation is not supported by a specific accelerator. The speakers encourage viewers to try out ONNX and share resources available in the ONNX community, including open governance, SIGs, and working groups.

  • 00:35:00 In this section, the speakers discuss the ONNX project's open governance structure, which consists of different special interest groups (SIGs) and working groups. They explain that SIGs and working groups meet periodically and that everything is open, and the meetings are all published on the calendar on the AI calendar. Furthermore, the open governance structure describes how contributors and approvers are selected, who respectively get voting rights on different decisions or have merge permissions. Ultimately, the speakers encourage people to get involved in the ONNX project by joining the different channels and groups, signing up for the mailing list, and participating in discussions.

  • 00:40:00 In this section, the speakers answer various audience questions regarding topics such as potential book publications on the subject of ONNX, the logging capabilities of the ONNX runtime, and the methods used to decrease the time of machine learning training. They also mention some commercial deployments of ONNX-based models in various scenarios such as Azure Cognitive Services, office models, and the Bing search engine.

  • 00:45:00 In this section, the speakers discussed the commercial deployment of ONNX models outside of Microsoft and mentioned that there are a number of production-grade models being used by financial companies and other organizations on Azure. They also answered audience questions about ONNX, including whether it supports CNTK (yes, they have an ONNX export) and whether you need knowledge about hardware acceleration tools (no, as ONNX runtime provides an abstraction layer). They also touched on the benefits of using ONNX in training versus converting to ONNX after training, explaining that ONNX runtime can speed up the training process, resulting in faster transformers models. Finally, they expressed their willingness to hear about potential certification plans and different ways people use ONNX.

  • 00:50:00 In this section, the speakers answer a question about ONNX's ability to support all pipeline types. While it is not 100% due to some existing gaps, common model types are typically supported, and users can refer to the ONNX Converter GitHub for a list of supported pipelines or try it themselves to see if their pipeline can be fully converted. The speakers then wrap up the session by thanking the attendees and encouraging them to join the ONNX community.
ONNX: Past, Present, and Future - Jim Spohrer, IBM & Prasanth Pulavarthi, Microsoft
ONNX: Past, Present, and Future - Jim Spohrer, IBM & Prasanth Pulavarthi, Microsoft
  • 2020.09.11
  • www.youtube.com
ONNX: Past, Present, and Future - Jim Spohrer, IBM & Prasanth Pulavarthi, Microsoft
 

Onnx-mlir: an MLIR-based Compiler for ONNX Models - The Latest Status



Onnx-mlir: an MLIR-based Compiler for ONNX Models - The Latest Status

Onnx-mlir is a compiler for ONNX models that uses MLIR and LLVM for optimization and code generation, supporting CPUs and custom accelerators. Dong Lin from IBM Research emphasizes the importance of thorough testing and highlights the framework's use in online scoring services and model serving frameworks. Onnx-mlir has multiple dialects for CPU and accelerator, with optimizations at various levels, and has been shown to speed up a credit card fraud detection model by 11 times using an IBM accelerator. The project welcomes community contributions to optimize important operators and support niche ML operators and other accelerators such as GPUs.

  • 00:00:00 In this section, Dong Lin from IBM Research discusses ONNX-MLIR, a compiler for ONNX models that uses MLIR and LLVM for high-level optimization and low-level code generation. The compiler aims to provide a reference for ONNX dialect in MLIR and make optimization convenient for not only CPUs but also custom accelerators. It's easy to integrate with other MLIR-based compilers, and it supports different programming languages such as Python, C++, and Java. Dong Lin also highlights the importance of carefully testing the compiler, and he mentions that it's been used for online scoring services and model serving frameworks, with newly introduced support for custom accelerators.

  • 00:05:00 In this section, the speaker discusses the ONNX-MLIR compiler, which can optimize and support new accelerators. The compiler has multiple dialects for CPU and accelerator, with optimizations at various levels. The speaker demonstrated the framework’s ability to speed up a credit card fraud detection model by 11 times using an IBM accelerator but couldn’t disclose any further details. They emphasized their interest in contributions from the community to grow the open-source project, as they aim to optimize important operators, support niche machine learning operators, and other accelerators such as GPUs.
Onnx-mlir: an MLIR-based Compiler for ONNX Models - The Latest Status
Onnx-mlir: an MLIR-based Compiler for ONNX Models - The Latest Status
  • 2022.07.13
  • www.youtube.com
Onnx-mlir is an open source compiler implemented using the Multi-Level Intermediate Representation (MLIR) infrastructure recently integrated in the LLVM proj...
 

PFVM - A Neural Network Compiler that uses ONNX as its intermediate representation



PFVM - A Neural Network Compiler that uses ONNX as its intermediate representation

In this video, Zijian Xu from Preferred Networks introduces PFVM, a neural network compiler that uses ONNX as its intermediate representation for module optimization. He discusses how PFVM takes exported ONNX as input, optimizes it, and executes the model with specified backends using third-party APIs. Genji describes the importance of optimization, including extending ONNX with customer operators, shape inference, and graph simplification. He also addresses the limitations of current ONNX compilers, including the need for more support in the dynamic case, and suggests implementing more inference functions. Zijian Xu emphasizes the importance of reducing kernel range overhead and memory usage for faster computation and suggests utilizing static information available on machines for scheduling and shaping inference.

  • 00:00:00 In this section, Zijian Xu from Preferred Networks discusses PFVM, a neural network compiler that uses ONNX as its intermediate representation. He introduces the company and explains how they use deep learning to solve real-world problems. He then focuses on ONNX for module optimization rather than module deployment. He explains that PFVM works as a compiler and runtime, taking exported ONNX as input, optimizing it, and executing the model with specified backends using third-party APIs. Genji describes the process of optimization, discussing how they extend ONNX with customer operators for device and memory optimization. He also discusses the importance of shape inference in module optimization and introduces three case optimizations. The first case is element-wise fusion.

  • 00:05:00 In this section of the video, the speaker discusses the importance of reducing kernel range overhead for faster computation in executing computation graphs on CUDA. They propose fusing element-wise operators into a single fusion group operator to reduce the kernel range, but caution that not all operators can be successfully fused together. It is necessary to check whether operators can be broadcastable or not before a gradient fusion group. The speaker also emphasizes the importance of shape inference and graph simplification for optimizing neural network models. Finally, they address the question of whether models contain unnecessary operators, and respond that some optimizations are necessary for faster computation.

  • 00:10:00 In this section, the speaker discusses how models generated by programs like first generation or neural network and neural architecture search may contain unnecessary operators. They demonstrate the importance of optimizations using an example where a left computational graph uses a lot of memory to compute node five. By modifying the computational graph, the same output can be achieved with a reduced memory usage. PFVM can perform automatic checkpointing to reduce memory usage, but it requires knowledge of tensor sizes to estimate memory usage accurately. The speaker emphasizes the importance of shape reference and how unknown dimensions limit optimization opportunities in most shaping frames.

  • 00:15:00 In this section, the speaker discusses the limitations of the current state-of-the-art ONNX compiler, including the inability to perform certain tasks like element transfusion and automatic checkpointing, as well as the need for more support in the dynamic case. The speaker suggests implementing more inference functions for the dynamic case and solicits feedback from users to determine whether to support cases like concat of two tensors or not. The speaker also discusses the benefits of utilizing the static information available on machines like elements core and mn4 for scheduling and shaping inference.
PFVM - A Neural Network Compiler that uses ONNX as its intermediate representation
PFVM - A Neural Network Compiler that uses ONNX as its intermediate representation
  • 2022.07.13
  • www.youtube.com
PFVM is a neural network compiler developed by Preferred Networks, which relies on ONNX as the Intermediate Representation format. PFVM is used in production...
 

YVR18-332 TVM compiler stack and ONNX support



YVR18-332 TVM compiler stack and ONNX support

The YVR18-332 video discusses the TVM compiler stack, which is a community-led deep learning stack that supports a range of hardware and front-ends, including ONNX. The speaker discusses how TVM can optimize models at the stereo level, allowing developers to explore the search space and find the best configuration. They also discuss the automatic optimizations TVM offers, including loop transformations and GPU acceleration. The speaker talks about the TVM roadmap which includes enabling 8-bit support and automated tuning on the graph level. Additionally, they discuss the ONNX TV interface and the need to unify the standard interface for all ecosystems. Finally, the video pauses for lunch.

  • 00:00:00 In this section, the speaker introduces the TVM compiler stack and how it supports ONNX through GBM. TVM is a deep learning stack from the class level down to the cancer level and is a community project led by researchers from the University of Washington with contributions from several companies and organizations. The TVM compiler stack supports a variety of hardware including CPU, GPU, FPGA, and has plans to enable ASIC support, with a simulator for hardware design verification. The stack also supports various front-ends including MXNet, ONNX, and TensorFlow, and has computational class IR implementation called an NVM with a range of optimization options.

  • 00:05:00 In this section, the speaker discusses the TVM compiler stack and the ONNX support. They explain that the TVM compiler stack can do a lot of optimizations at the stereo level, such as loop transformations and GPU acceleration, and that ONNX is a new feature added to step in the auto TVM. The speaker also explains the diploma Canadian Miss applauded by TVM remote deployment mechanism, which allows users to compile a model on their host device and deploy it remotely to the target device. Additionally, they discuss the automatic optimizations introduced by TVM, which can reduce the tedious work for developers and are
    designed to explore the search space and find the best configuration.

  • 00:10:00 In this section, the speaker discusses the TVM compiler stack and support for ONNX. They mention that the TVM compiler stack incorporates the most advanced tuning algorithms, including isolated extreme gradient boosting algorithms, to provide better performance. They also highlight the open-source project which allows automated optimization and can leverage previous work to optimize the search space. The speaker then talks about how TVM can be used for the Vita open accelerator and its three major parts. They explain how TVM can be used for schedule optimization and deployed remotely using VPS RPC. Finally, they provide a roadmap for TVM, which includes enabling 8-bit support and automated tuning on the graph level while also planning to enable use on Xilinx's Ultra 9 sticks and Amazon's F1 instance.

  • 00:15:00 In this section, the speaker discusses the TVM compiler stack and how it is planning to upgrade to NVM v2, dubbed Relay. Relay's implementation requires choosing a pod control flow to improve the type system and control how to make the compile server system better. The speaker explains how Relay fits into the TVM compiler stack when supporting ONNX. ONNX defines three major parts, namely, the computation graph model and building operators, and a standard database. To support ONNX ml extensions, TVM front-end implemented a front-end in DVM. However, the conversion of ONNX to the NM VM symbolic graph may cause some mismatch issues. Moreover, the community is discussing whether to use ONNX or Relay as a graph IR in the TVM community, and the only way to progress is to work together to merge Relay to use for future model conversion.

  • 00:20:00 In this section, the speaker discusses the ONNX TV interface for framework integration, which is a standard interface for neural network inference on different accelerators. The interface includes pod runtime discovery, selection of execution backends, and long-term discovery of ONNX operators. The speaker suggests that the community should discuss how to unify the standard interface for all ecosystems. Additionally, the speaker talks about the TVM compiler stack and how it could incorporate a hand-coded implementation as part of its search space. However, there is still no good idea about this mechanism, so the speaker welcomes ideas and discussions.

  • 00:25:00 In this section, the topic of the discussion is the TVM compiler stack and its support for ONNX. It is clarified that ONNX has both a description format and a runtime API, and the ONNX equal system is expanding beyond just the open exchange format. The goal is to unify the API so that uplevel applications can call a single standard API for runtime inference, making it easier for developers in this area. There are no further questions, and the video pauses for lunch.
 

.NET MAUI Community Standup - ONNX Runtime with Mike Parker



.NET MAUI Community Standup - ONNX Runtime with Mike Parker

In this video the guest speaker Mike Parker introduces the ONNX runtime, an open-source and cross-platform tool that enables machine learning optimization and acceleration across multiple hardware platforms. Parker explains the importance of using the ONNX runtime and showcases how it can be used in .NET MAUI projects to classify images using the MobileNet object classification model. The hosts and Parker discuss the benefits of running machine learning models on a device and the ability to avoid backend infrastructure costs. Additionally, the team shares helpful resources, including Parker's blog on this subject and their partnership with Al Blount for .NET MAUI and Xamarin support.

  • 00:00:00 In this section of the community stand-up for .NET MAUI, the team introduces Mike Parker, a member of the modern client app customer advisory team, who shares his knowledge of ONNX Runtime, a machine learning tool for optimizing and accelerating models across multiple hardware platforms. He explains how ONNX Runtime is open-source and cross-platform, allowing developers to use various frameworks and hardware for machine learning applications. He also showcases how the .NET community can take advantage of ONNX Runtime in their projects.

  • 00:05:00 In this section, the hosts introduce themselves and their guest, Mike, who joins in to discuss the Onnx Runtime with them. The hosts mention that they will first look at some blogs from Microsoft and the community before moving on to discussing some PRs from the .NET MAUI and adjacent repositories, which they are excited about. Lastly, they will turn the discussion over to Mike Park to talk about the Onnx Runtime libraries, how he used them in Xamarin, and his writing and podcasts about the subject. The hosts also mention that it's the 20th anniversary of .NET and that .NET MAUI Preview 12 has shipped. They also caution users about a breaking change and mention that Dave has been working with community library maintainers on it.

  • 00:10:00 In this section, the speakers discuss the unification effort of .NET and the need for recompilation and updates for iOS libraries and dependencies as Xamarin transitions into MAUI. The team is currently working on a way to adopt these changes for any binding projects to native libraries and NuGet packages, and they assure users that guidance will be provided. Additionally, the team discusses the lack of MAUI support in VS Mac 2022 preview five and explains that they are working on it, but had to prioritize rebuilding all Xamarin work on .NET 6 runtime with a new UI SAC first. Lastly, the team announces the update of the Facebook SDK binding and mentions the effort to update and maintain other third-party libraries such as Google libraries.

  • 00:15:00 In this section, the Xamarin team talks about the components it maintains, which used to be a major focus but are now being narrowed down to what is most critical for support. The team encourages users to reach out if they have dependencies on these components during the transition into .NET 6. They also discuss a Maui Hello World tutorial and a .NET Maui source code dissecting blog series. Additionally, Andreas' blog on Z Index and UI customization is highlighted, showcasing the stacking of elements on top of each other using Z Index.

  • 00:20:00 In this section, the presenter showcases a few blog posts and designs that people have recreated using .NET MAUI. The blogs include recreating a boarding pass design in Xamarin Forms, discussing state machine development with Xamarin Forms, organizing your .NET MAUI startup file, a deep dive into the architecture of handlers, and a blog on Xamarin Forms to JavaScript two-way communication using a WebView for it. The presentations show how much focus there is on design these days, making Xamarin Forms/MAUI more extensible and useful, and how to use both JavaScript and bindings more effectively.

  • 00:25:00 In this section of the transcript, the hosts discuss the latest community contributions to .NET MAUI, including a release of the 5.0 Xamarin forum service and new documentation. They encourage contributors to provide feedback on repo participation and mention the availability of a chip control in the community toolkit, although it is not directly in the box. The hosts also mention the recent addition of shadows, which is a new feature of MAUI, and suggest jazzing up their bot by modding it.

  • 00:30:00 In this section, Mike Parker provides an update on the status of .NET MAUI preview releases, highlighting the progress made with preview 13. There is lots of green and lots of new features, including label formatted text, spans, and span gestures, which are all filling out the gaps in the platform. The community has also shared a new attached property called "Galadriel," which enables the simple addition of badges to tabs and menu items in Xamarin Forms Shell. Additionally, the .NET MAUI team has been working to improve the platform's startup performance, and the results are promising with the app starting up in 576 milliseconds on a Pixel 5 with profiled AOT.

  • 00:35:00 In this section, the .NET MAUI Community Standup discusses the availability of C# markup extensions for UI building on both Xamarin Forms and .NET MAUI, which provide a more fluent syntax for UI development. They also talk about the onnx runtime, a portable model that can run inferences across different platforms using a single API set, with examples like facial recognition and photo tagging. The onnx runtime is available on GitHub and can be used in Xamarin and mobile apps. The process for using the onnx runtime involves loading the model, preparing input, running the inference, and processing the output into a usable format.

  • 00:40:00 In this section of the video, Mike Parker explains how they used the ONNX runtime in a Xamarin Forms app to classify images using the MobileNet object classification model. He highlights the importance of following the model documentation and normalizing the RGB values. Parker also mentions a useful app called Netron that allows visualizing input and output sizes, shapes, and names of the inputs and outputs. The app is just a single button that loads and runs the model, and displays the top label in an alert. Parker notes that it's cool that all of this is happening on the device without involving the cloud.

  • 00:45:00 In this section, the speakers discuss the benefits of running machine learning models on device, including the ability to function without connectivity and avoid backend infrastructure costs. They also touch on their experiences using the cloud-based Microsoft Azure Vision APIs and how they were able to achieve faster processing times using the ONNX runtime. Furthermore, they explain how they simplified a team onboarding app experience by replacing platform-specific and model-specific code with a single ONNX model. Finally, they discuss the process of preparing a model, using the Azure Custom Vision Service, and creating a Nougat package that allows ONNX to work with Xamarin.

  • 00:50:00 In this section of the video, Mike Parker discusses their work with ONNX runtime and adapting their native interoperability code to support platforms with AOT, such as iOS. He also goes on to describe the real-world scenarios where this technology can be used, including streamlining workflows and improving accessibility in apps. However, he notes that working with pre-built models can be overwhelming for those without a traditional data science background and suggests being selective about the models to incorporate. Finally, some helpful resources, including Mike's blog on this topic, are shared.

  • 00:55:00 In this section, the hosts talk about Mike Parker's availability for hire and introduce Al Blount, who can provide support for companies needing help with Xamarin Forms and .NET Maui. They also briefly discuss Mike's team's current work on upgrading to Maui, but can't share any details yet. The hosts end the video thanking viewers for joining and announcing the upcoming .NET 20th anniversary birthday party.
.NET MAUI Community Standup - ONNX Runtime with Mike Parker
.NET MAUI Community Standup - ONNX Runtime with Mike Parker
  • 2022.02.03
  • www.youtube.com
Join Maddy Montaquila, David Ortinau, and special guest Mike Parker to learn about using the ONNX Runtime in your Xamarin app for Machine Learning!Community ...
 

[Virtual meetup] Interoperable AI: ONNX e ONNXRuntime in C++ (M. Arena, M. Verasani)



[Virtual meetup] Interoperable AI: ONNX e ONNXRuntime in C++ (M. Arena, M. Verasani)

The video discusses the challenges of using different frameworks to train machine learning algorithms, leading to a lack of interoperability, and introduces ONNX and ONNXRuntime that aim to create a universal format for deep learning models. ONNX converts neural networks into static computational graphs, allowing for optimized performance during inference. ONNXRuntime allows for the conversion of any framework into ONNX format and provides acceleration libraries that can be used to target any hardware platform. The video showcases examples of using ONNX and ONNXRuntime, as well as discussing their use in C++ and providing advice for better understanding the project and its documentation.

Marco Arena and Matteo Verasani also discuss the benefits of using ONNX and ONNXRuntime in C++ for machine learning models, highlighting the flexibility of the framework and its ability to easily convert models from different frameworks without sacrificing performance. They provide examples of converting models to ONNX format and demonstrate the use of the ONNXRuntime for inference mode, showcasing improvements in performance with a classic Python model. Additionally, they discuss their work with embedded systems and the potential benefits of benchmarking ONNXRuntime on GPUs. The speakers also mention future virtual meetups and express hope for incorporating more networking opportunities for attendees.

  • 00:00:00 In this section of the video, the speakers discuss the problems that arise when using different frameworks to train machine learning algorithms for various use cases, leading to a lack of interoperability. This can be a challenge when working in a team where members may have varying levels of expertise with different frameworks. To solve this problem, the speakers introduce ONNX and ONNXRuntime, which allow for interoperability between frameworks by converting networks into a common format. ONNXRuntime then allows the converted models to be deployed on any target hardware, including CPUs, GPUs, and FPUs.

  • 00:05:00 In this section, the speakers discuss the ONNX (Open Neural Network Exchange) project, which aims to be a universal format for deep learning models, allowing for interoperability between different frameworks. The project is community-driven and supported by numerous companies, with a focus on converting different model types and frameworks into a single format for production. ONNX converts neural networks into static computational graphs, which differ from dynamic graphs in that they are pre-initialized before training. While static graphs are more computationally efficient, dynamic graphs offer greater flexibility for varying input sizes.

  • 00:10:00 In this section, the speakers discuss how ONNX provides a static computational graph, which is very useful in the inference process. While other frameworks like Python have dynamic computational graphs, ONNX provides a static graph that has already been controlled during the training and development phases, allowing for more optimized performance. Furthermore, the ONNX runtime tool by Microsoft allows for the conversion of any framework into ONNX format and provides acceleration libraries that can be used to target any hardware platform, making it a helpful and versatile tool for inference and production.

  • 00:15:00 In this section of the video, the speakers talk about their experience using ONNX and ONNX Runtime for AI interoperability. They explain how they create a PyTorch model in Python, convert it to ONNX format, and use ONNX Runtime for deployment, allowing them to write their pipelines and target different platforms, such as GPUs or Android devices. They also demonstrate the performance improvements of using ONNX Runtime compared to other inference engines, achieving up to 4 times faster results. They highlight the flexibility of ONNX, allowing them to convert models created in other frameworks such as Matlab to use them with ONNX Runtime without having to re-write the deployment pipeline.

  • 00:20:00 In this section, the speakers discuss the process of using ONNX and ONNXRuntime in C++. They explain that models must first be converted to ONNX format before they can be run on ONNXRuntime. While TensorFlow serialization is not native to ONNXRuntime, there are open-source libraries available for conversion. They also respond to questions regarding the possibility of scripting the conversion process and the level of improvement seen with ONNX compared to C++. They note that further benchmarking and analysis is required. The ONNXRuntime repository is open-source and supported by Microsoft, offering a range of information, guides, and examples for users.

  • 00:25:00 In this section, the video discusses the features of ONNXRuntime in a scale of complexity from simple to more sophisticated. The green column contains basic features that are sufficient for simpler machine learning tasks, while the magenta column includes slightly more sophisticated features such as execution providers and profiling support. The red column represents advanced features for more complex tasks such as the ability to add ONNX custom operators or perform tuning on trading. The presenter also provides links to two demo repositories for ONNXRuntime in C++ and Python.

  • 00:30:00 In this section, the speaker introduces the ONNX and ONNXRuntime in C++. They explain that an environment should be created within the program to manage the trade pool, and the session, which establishes the model under consideration. The session's characteristics can also be customized, and the default settings can be run using the session. Further, the ONNXRuntime will optimize the model by implementing the required pre-processing to manage the data before executing the session. The tool can also perform inspection tasks like inquiring about the number of inputs and outputs, data types, and names. Ultimately, users can create their inputs and tensors in the required format.

  • 00:35:00 In this section of the video, the speaker discusses how to create an object that allocates tensors on the CPU and transfer them to the execution provider. The speaker creates input and output tensors, passing in the buffer with input values and the shape of the tensor as arguments. The tensor for the output is then created by passing the output values to the OnnxRuntime library. The speaker explains the importance of using names to launch inference, as it allows for flexibility in changing the ordering of inputs. The demo showcases a simple example of how the output values are printed on the screen.

  • 00:40:00 In this section, the speaker provides advice for those who want to better understand the ONNX project and its documentation. They recommend looking under the "Parts C" to access the better documented sections of ONNX, and detail their experience with inspecting data in ONNX. The speaker also explains that although ONNX runtime libraries are available through other package managers, they recommend recompiling the libraries since architectures and providers vary, and shows how they compiled the same code to create two ONNX runtime packages that target either a CPU or a GPU.

  • 00:45:00 In this section, the speaker discusses the ONNX Runtime with examples of using the DLL in C++. The session options added to the code vary depending on the client's preferences, and it tries to use the added providers in the specified order. The speaker provides a sample application of how the system occupies RAM and GPU usage. A primary game AI re-net, which was pre-trained on a 1,000 image dataset, was utilized as an example. The system has its post-processing and preprocessing requirements.

  • 00:50:00 In this section of the video, the speaker discusses the pre-processing of images using Python and the use of extension sources to simplify the process. The images are resized to an optimal size and converted to float before being normalized. The normalization process involves dividing the values by the mean and performing standard deviation. The post-processing of the neural network involves a simple softmax normalization. The speaker also demonstrates how to open, process, and output images using ONNX Runtime with minimal code. The class used to read classes from a text file is simplistic, and a few utilities are used to avoid unnecessary boilerplate code.

  • 00:55:00 In this section, the speaker discusses the process of measuring inference time for a neural network using ONNXRuntime in C++. He explains that he has added the inference portion to the previously discussed network and measured the inference time using system clock. He then goes on to demonstrate how to use a custom logger in ONNXRuntime for profiling purposes. The speaker also briefly discusses a separate mobile network that was developed in collaboration with the University of Modena and the complexity involved in its pre-processing and post-processing.

  • 01:00:00 In this section, the speaker discusses the output of a detector and shows how it draws bounding boxes on objects such as bears and signals. They also mention the use of ONNXRuntime and how it allows for session options and tuning, including enabling profiling for performance optimization. The resulting tracing file can be inspected in detail to see how long it takes to initialize the model and run it on images, including which operators were used and which provider was chosen. They also mention ONNX's ability to optimize a graph before running it, which can improve performance and shorten the time it takes to load the model.

  • 01:05:00 In this section, the presenter talks about optimizing a model by enabling or disabling optimizations, which may impact the model's portability across different targets. They explore different levels of optimizations and how each affects the model's performance. The presenter shows that enabling parallel execution mode may allow the model to utilize multiple threads, but it may not have a significant impact on performance in certain cases. They also mention the possibility of parallelizing the processing of multiple images using an utility. Finally, the presenter notes that optimizations can have a noticeable impact on the model's performance, as demonstrated by the reduced load time of optimized models.

  • 01:10:00 In this section, Marco Arena and Marco Verasani discuss the benefits of using ONNX and ONNX Runtime in C++. One major advantage is the ability to have a single layer of inference that accepts ONNX as a format, allowing for portability and flexibility in using different frameworks for creating models. This feature is particularly useful in situations where different teams may be using various frameworks, and a standard inference pipeline is needed for production. Additionally, the use of ONNX Runtime in C++ can lead to faster and more optimized runtime performance for deep learning models. Overall, the ONNX ecosystem provides many options and opportunities for fine-tuning and optimizing the performance of deep learning models.

  • 01:15:00 In this section, the speakers discuss the benefits of using ONNX and ONNXRuntime in C++ for machine learning models, as it allows for flexibility in frameworks and easy conversion between them without sacrificing performance. They also mention that ONNXRuntime is supported on Linux and demonstrate how to use Python and Jupyter notebooks for prototyping and exporting models to ONNX format. They use a small tutorial as an example to show how to convert models from other frameworks to ONNX and highlight the usefulness of the Netron tool for visualizing the computational graph of the models. The speakers encourage viewers to ask questions and share knowledge about the tool.

  • 01:20:00 In this section, the speakers discuss the process of converting a model to ONNX format and running it in inference mode using ONNXRuntime in C++. They demonstrate how to create a computational graph and define the input and output dimensionalities, as well as how to use time it to benchmark the performance of the model on CPU and GPU. They also showcase the use of the popular natural language processing model BERT, which utilizes the transformer operator and is implemented in the Hugging Face library. The speakers emphasize the importance of installing the correct package for using ONNXRuntime with CPU or GPU.

  • 01:25:00 In this section of the video, the presenter demonstrates how to convert a BERT model to ONNX format in Python. The process involves defining the inputs of the model and converting the model using the "torch.onnx.export" function. The presenter explains that ONNX adds operators to each version, which emphasizes the need to work on the correct version of ONNX with the necessary operators for the specific model. Dynamic Axis is also highlighted as an important feature to allow for dynamic input/output shapes, such as variable sequence lengths in natural language processing. Finally, the presenter shows a comparison between Python and ONNX in terms of performance when performing the inference on the model.

  • 01:30:00 In this section of the virtual meetup, the speakers share a demonstration of their work with ONNX and ONNXRuntime in C++. They showcase an improvement in performance seen in the inference of a classic Python model run through ONNXRuntime. They also demonstrate a tool called the "Network Viewer," which allows users to view the static computational graph of a model and see the operations that are being performed, including the input and output types expected. The tool also shows the version of ONNX used for conversion and the offset used during the conversion process. The speakers request feedback from viewers and provide a link for attendees to provide comments.

  • 01:35:00 In this section, Marco Arena and Matteo Verasani talk about their work studying platforms for embedded systems, including GPUs, FGAs, and CPUs. They ran four neural networks for object detection on these embedded systems and analyzed the results in terms of power consumption and inference speed. They also discuss the importance of using one-stage detectors for embedded systems and provide links to repositories for ONNX and ONNXRuntime. They mention the potential benefits of benchmarking ONNXRuntime on GPUs and express interest in inviting Microsoft's ONNXRuntime team to participate in future events. Lastly, they invite viewers to attend their upcoming online event and future meetups.

  • 01:40:00 This section of the video talks about finding reasons to not return to in-person meetups, and how they were lucky to have a concession from the attendees for their online meeting. They also discuss upcoming plans for their virtual meetup series, which includes taking out the Python section and focusing on the initial parts of their demonstration, based on the materials shown by Malattia, Verasani, and the ONNX and ONNX Runtime repositories. There are also links and official sources provided for those looking for more information on the topic. Ultimately, they express hope to incorporate more networking time and keep the chat open for those interested in staying after dinner, but they admit the limitations of online meetups.
[Virtual meetup] Interoperable AI: ONNX e ONNXRuntime in C++ (M. Arena, M. Verasani)
[Virtual meetup] Interoperable AI: ONNX e ONNXRuntime in C++ (M. Arena, M. Verasani)
  • 2020.10.22
  • www.youtube.com
Relatori: Marco Arena, Mattia Verasani📍 Slides: https://www.italiancpp.org/interoperable-ai-arena-verasani/💻 Demo: https://github.com/ilpropheta/onnxruntim...
 

[CppDay20] Interoperable AI: ONNX & ONNXRuntime in C++ (M. Arena, M.Verasani)



[CppDay20] Interoperable AI: ONNX & ONNXRuntime in C++ (M. Arena, M.Verasani)

The use of machine learning and deep learning algorithms is increasing, and there is a need for tools that can deploy these algorithms on different platforms. The ONNX tool provides interoperability between different frameworks and platforms, allowing developers to convert their algorithms from one framework to another and deploy them on different devices, even if they are not familiar with the specific framework or platform. ONNX Runtime is an inference engine that can leverage custom accelerators to accelerate models during the inference stage and can target a variety of hardware platforms. The speakers demonstrate the use of ONNX and ONNX Runtime in C++ programming, with examples of linear regression and neural network models. They also discuss the benefits of using ONNX and ONNX Runtime in fine-tuning a network's execution, optimizing loading time, and executing sequential images.

  • 00:00:00 In this section of the video, the speakers discuss the increasing use of machine learning and deep learning algorithms for various applications and the need for tools that can deploy these algorithms on different platforms. They introduce a tool called ONNX which provides interoperability between different frameworks and platforms. They explain how developers can use ONNX to convert their algorithms from one framework to another and deploy them on different devices, even if they are not familiar with the specific framework or platform. The speakers use the example of converting a Python algorithm to ONNX format and then to the Core ML framework to deploy on an Apple device. They emphasize the usefulness of ONNX in making deep learning and machine learning algorithms more accessible and deployable on a wide range of platforms.

  • 00:05:00 In this section, the speaker discusses ONNX and ONNX Runtime, which are tools that allow for interoperable AI. ONNX enables the transfer of models between different deep learning frameworks, such as PyTorch and Tensorflow, without requiring knowledge of each framework. ONNX Runtime, which is provided by Microsoft, is an inference engine that can leverage custom accelerators to accelerate models during the inference stage. It is able to target a variety of hardware platforms and does not require the user to create their own inference engine in C++.

  • 00:10:00 In this section, the speakers discuss the benefits of using the ONNX format for machine learning models and the interoperability it provides for different training frameworks. They explain the pipeline for developing deep learning algorithms, converting them into ONNX format, and using the ONNX runtime inference engine to run the model on different platforms and programming languages. The speakers also present performance graphs that show a significant improvement in the performances of the algorithms when using ONNX runtime, as compared to other frameworks such as PyTorch and scikit-learn. Finally, Marco takes over and talks about using the ONNX runtime engine in C++ programming.

  • 00:15:00 In this section, the speaker speaks about their experience with interoperability between machine learning frameworks and introduces the ONNX project as an important effort towards achieving this goal. They mention that they did not experience many conversion issues when converting models between frameworks, but the main issue arises when an operator is not supported in ONNX format or ONNX Runtime format. The speaker also answers a question about conversion issues and explains that operators not supported by ONNX can cause issues in conversion.

  • 00:20:00 In this section, the speakers discuss their experience with converting TensorFlow to ONNX and mention that they have not seen many conversion issues. They also discuss debugging and troubleshooting when it comes to manipulating tensors in C++, and mention the use of other libraries such as extensor or Python to do so. They introduce the entry point for ONNX, onx.ai, which allows users to select their desired architecture and programming language, and demonstrate the use of ONNXRuntime in C++. They mention that the code is the same for the GPU, the only difference being the library linked.

  • 00:25:00 In this section, the presenter shows a demo of using ONNXRuntime to load, inspect and run inference on a model. He starts by creating an environment for the underlying API, with optional features such as customizing log or threading. He then creates a session that represents the inference to run on a particular model, which can be loaded either from a path or a byte stream. He demonstrates how to use an allocator to inspect the model's information, such as the number and names of inputs and outputs. He notes that this demo showcases the raw library and that in real-life situations, a wrapper would be used to avoid managing strings and other complexities.

  • 00:30:00 In this section, the speaker discusses a simple linear regression model and how to pass an input to the network without copying data using an external api called CreateTensor. The speaker emphasizes the importance of going to the c api below the c++ api when documentation is unclear. Additionally, they discuss the various options available when running the inference session, including partial output retrieval and customizing output names. Finally, they note that output values are stored in a vector and are the same tensors allocated previously.

  • 00:35:00 In this section, the speaker discusses accessing data in C++ using the function get tensor mutable data and the need to specify the type being used due to type erasure. The example provided shows how to print values to standard output using this method. The speaker also mentions the need to be careful with the allocation of tensors and output buffers and how to use pre-allocated output buffers. The discussion then moves to running a linear model with a GPU execution provider using the Microsoft ML ONNXRuntime GPU package as the default choice for running ONNXRuntime against the CPU. Finally, the speaker briefly introduces two demo projects for vision networks: a classifier called ResNet and a detector called MobileNet. The demo code is similar to the previous example, and the speaker highlights the pre-processing and post-processing involved in these networks.

  • 00:40:00 In this section, the presenter demonstrates how to use ONNXRuntime to profile a neural network's execution using an external timer. By adding a profiling option during the session creation, ONNXRuntime can produce a JSON file that contains the execution time spent on each phase and the explosion of all the operators in the graph that had been executed. This tool can provide additional details like whether the model is running on CPU or GPU, or whether it is being executed sequentially or in parallel. Profiling can help in fine-tuning a network's execution and checking whether it is running on another accelerator.

  • 00:45:00 In this section, the speaker demonstrates the impact of optimization on the loading time and execution time of a model using ONNX and ONNXRuntime in C++. Disabling optimization results in a significantly longer execution time, while enabling it leads to longer loading time. However, it is possible to save an optimized version of the model that balance optimization and loading time. The speaker shows the audience how to optimize the model using different available options and save it. Additionally, the speaker briefly touches on parallel execution and demonstrates how it can significantly reduce the processing time for a batch of images.

  • 00:50:00 In this section, the speaker discusses the execution of sequential images and the contention in the global thread pool, causing longer execution time for each image. They also mention the use of a profiling tool to refine time measurements for single inputs and the explosion of all operators executing per image. The speaker explains the usage of the Extensor library for tensor manipulation, similar to numpy for Python, used for image preprocessing in a simpler ResNet classifier. The ONNX Runtime distinction for basic, intermediate, and advanced levels is also mentioned, with advanced features such as custom operators, memory arenas and allocators. Trading support and Python examples are also discussed, with links to demo and slides provided.

  • 00:55:00 In this section, the presenters discuss a benchmark they conducted on object detection algorithms, focusing on one-stage detectors which are useful for embedded devices. They benchmarked against FPGA, GPUs, and CPUs and found that NVIDIA devices, Intel Core E7 CPUs, and FPGAs were the best platforms for certain kinds of operations. They also mentioned that there is some support for training models in ONNX, although only in Python. When asked if they would consider using ONNXRuntime in production, they stated that they are already using it in testing and transitioning to production. They noted that Microsoft is also using it in many projects, including Windows ML, and that it has been in development for three years.
[CppDay20] Interoperable AI: ONNX & ONNXRuntime in C++ (M. Arena, M.Verasani)
[CppDay20] Interoperable AI: ONNX & ONNXRuntime in C++ (M. Arena, M.Verasani)
  • 2020.12.02
  • www.youtube.com
Event page: https://italiancpp.org/cppday20/Slides: https://github.com/italiancpp/cppday20---ONNX is an open source format built to represent machine learnin...
 

Accelerating Machine Learning with ONNX Runtime and Hugging Face



Accelerating Machine Learning with ONNX Runtime and Hugging Face

The video "Accelerating Machine Learning with ONNX Runtime and Hugging Face" discusses the creation of Hugging Face's Optimum library, which focuses on accelerating transformer models from training to inference by easily applying ONNX runtime. The library simplifies the bridge between the transformer library and hardware acceleration, creating an easy-to-use toolkit for production performance. By applying the optimizations provided by ONNX Runtime, users can benefit from all hardware acceleration, resulting in faster inference pipelines. A collaboration within the Hugging Face community is enabling sequence-to-sequence model optimization using these accelerated inference pipeline classes, and an end-to-end example showed that using the Optimum Library can result in a 44% throughput increase or latency decrease while conserving 99.6% of the original model accuracy.

  • 00:00:00 In this section, Jeff from Hugging Face discusses the company's goal of making the power of transformer models accessible to every company in the world through readily accessible pre-trained models and tools. He explains that transfer learning and the attention is all you need paper changed the field of machine learning, achieving breakthrough performance in natural language processing tasks and producing state-of-the-art results in every single modality of machine learning. Jeff introduces the Optimum library, designed to accelerate transformer models by easily applying ONNX runtime, making it easier for engineers and software developers to use these models in production.

  • 00:05:00 In this section, the speaker discusses the creation of the Hugging Face Optimum library which is focused on accelerating transformer models from training to inference. The library offers a reference toolkit for hardware acceleration with high-level APIs dedicated to production performance. The Onnx Runtime package within Optimum provides native integration of DeepSpeed, a way to accelerate training. Optimum also offers the Ort Optimizer to simplify graph models, the Rt Quantizer to optimize weights, and targets specific execution providers to take advantage of hardware-specific optimizations. Overall, Optimum simplifies the bridge between the transformer library and hardware acceleration, creating an easy-to-use toolkit for production performance.

  • 00:10:00 In this section, the speaker talks about optimizing machine learning models using ONNX Runtime and Hugging Face's Optimize Library. By switching from Auto Model for Task to RT Model for Task, users can easily apply the optimizations provided by ONNX Runtime and benefit from all hardware acceleration, resulting in faster inference pipelines. The Hugging Face community is also collaborating to enable sequence-to-sequence model optimization using these accelerated inference pipeline classes. The end-to-end example outlined in the blog post shows that using the Optimum Library can result in a 44% throughput increase or latency decrease while conserving 99.6% of the original model accuracy.
Accelerating Machine Learning with ONNX Runtime and Hugging Face
Accelerating Machine Learning with ONNX Runtime and Hugging Face
  • 2022.07.13
  • www.youtube.com
Hugging Face has democratized state of the art machine learning with Transformers and the Hugging Face Hub, but deploying these large and complex models into...
Reason: