This past summer, we partook in many exciting events. One in particular stood out because two of our engineers delivered exciting talks. This event was PyCon Colombia 2022, an annual conference that promotes the Python programming language. The informative event featured expert Python Programmers from LATAM and around the world who shared their expertise, knowledge, and experiences with the Python community.
We sat down with our PyCon speakers to learn more about why their techniques are an important part of machine learning today. We wanted to know the details of the real-world use cases of various techniques, and the main takeaways of their presentations. Our engineers take pride in sharing their ideas with the Python community and having the opportunity to represent LATAM in this important discussion. In case you missed it, here’s a recap and additional questions of the talks our Factored engineers delivered. We hope you find them useful for your work.
Model size reduction techniques by Andrés V.
Presentation Overview:
In his presentation, Andrés described why it is necessary to use a light version of a model in production when deploying large deep learning models at edge devices. He focused specifically on model distillation, teacher-student approximation, and how to use the model to create lighter versions of certain models that still perform well. At the end of his talk, he presented a use case of a model distillation for computer vision.
Why is model distillation an important part of ML today?
At the moment, bigger models are constructed to apply the power of self supervised techniques. Since 2018, we went from 100M parameter models with ELMo to models as big as bloom which has around 176B parameters. Models bigger than 300M parameters are notorious for being not only large, but slow, and computationally expensive. This puts increased focus onto techniques that allow the use of these models in constrained environments such as mobile devices.
Source: Towards Data Science
Why did you choose to give your talk on this topic?
I recently worked on a computer vision project that involved the deployment of a model in a mobile device. To overcome this objective we explored multiple model size reduction techniques, distillation being one of those. So, the topic was very timely.
Can you share real-world use cases for the techniques you described in your talk?
The applications for model distillation are wide. One well known example case is BERT. It is a model with 345 million parameters. BERT has inspired a generation of language models that have revolutionized the field of natural language understanding (NLU). Right now, many applications which involve NLP and edge devices use distiledBert, which is the distilled version of BERT. Furthermore, Amazon is currently working on a BERT variation called DQ-BART which uses model distillation techniques.
How has this technique helped you in your ML work so far?
This technique helped with the understanding that using reduced size models on edge devices is faster and reduces computational cost.
What would you say is the main takeaway from your talk?
Model size reductions techniques allow you to take a model and reduce the amount of parameters needed to make inference. There are multiple techniques to accomplish this objective. Distillation is one example. During its implementations, it takes a cumbersome model and extracts its knowledge to a smaller model by using a special loss function made of two components (mean square error and cross entropy).
Training PyTorch models on TPUs by Mateo Restrepo
Presentation Overview:
Mateo showed us why it is relevant to learn how to train neural networks using tensor processing units (TPUs) by showing examples of how computations work on central CPUs, GPUs, and TPUs. He then described in detail how to use PyTorch to train neural networks using TPUs. Specifically, he covers the different modifications to the training workflow to make it compatible with TPUs.
Why is this an important part of ML today?
Training deep neural networks is a very computation-intensive task that can be parallelized easily in hardware such as graphic processing units (GPUs) with significant performance gains as compared to central processing units (CPUs). However, GPUs are not specialized hardware for training deep learning models. They are just a convenient general purpose processor that can accommodate and execute very efficiently the parallel computations executed during the training stage of a deep learning model. In recent years, technology companies such as Google, Intel, AWS, Tesla etc. have been focusing on developing and building highly specialized hardware architectures with the purpose of increasing performance on machine learning tasks. Tensor processing units (TPUs) are one of these specialized chips developed by Google. Furthermore, TPUs achieve much better energy efficiency than conventional chips which means there is less power consumption in training and inference – which is very important – as the humongous deep learning models that power most hallmark products in big tech are extremely power hungry and consume incredible amounts of energy.
source: OpenAI
Why did you choose to give your talk on this topic?
I was working on an NLP project with big models like BERT and RoBERTa and training times in GPUs were slower than I’d hoped for. And while researching methods to accelerate the training, I came across TPUs and how to modify the training code to run it in this hardware. Moreover, I found that with some minor changes to the PyTorch code, I could significantly accelerate the training times which for ML practitioners means faster iterations while doing ML experimentation and development.
Can you describe real-world use cases for the techniques you described in your talk?
Most of the large scale ML models that serve Google’s products such as voice recognition, voice search, recommendation algorithms, translation algorithms and computer vision algorithms are trained and served for inference in TPUs. A hardware engineer from Google said the following about TPUs:
“The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our speech recognition system on the processing units we were using, we would have had to double the number of Google data centers!”
How has this technique helped you in your ML work so far?
Running training and fine-tuning workflows with large models is now faster as I can simply adapt the code to run on TPUs and get the advantages of highly specialized hardware with very little overhead. Hence, allowing faster and more efficient ML experimentation which in turn makes machine learning algorithm development and productionalization easier.
What is the main takeaway from your talk?
With minor changes in the training code and access to TPUs (which are readily available in Google cloud, Google Colab, Kaggle etc) you can speed significantly (between 4x – 20x) the training times of large neural networks.