# kerascv_and_kerasnlp_multiframework_models__552fda53.pdf Journal of Machine Learning Research 25 (2024) 1-10 Submitted 3/24; Revised 9/24; Published 11/24 Keras CV and Keras NLP: Multi-framework Models Lead Authors Matthew Watson, Divyashree Shivakumar Sreepathihalli, Fran cois Chollet Martin G orner, Kiranbir Sodhia, Ramesh Sampath, Tirth Patel Haifeng Jin, Neel Kovelamudi, Gabriel Rasskin, Samaneh Saadat Luke Wood, Chen Qian, Jonathan Bischof, Ian Stenbit {mattdangerw, divyasreepat, fchollet, mgorner, ksodhia}@google.com {rameshsampath, tirthp, haifengj, nkovela, grasskin, ssaadat}@google.com {lukewoodcs, qianchen94era, jbischof1}@gmail.com, ian@stenbit.com Keras Team, Google, USA Community Contributors Abheesht Sharma, Anshuman Mishra Editor: Joaquin Vanschoren We present the Keras domain packages Keras CV and Keras NLP, extensions of the Keras API for Computer Vision and Natural Language Processing workflows, capable of running on either JAX, Tensor Flow, or Py Torch. These domain packages are designed to enable fast experimentation, with a focus on ease-of-use and performance. We adopt a modular, layered design: at the library s lowest level of abstraction, we provide building blocks for creating models and data preprocessing pipelines, and at the library s highest level of abstraction, we provide pretrained task models for popular architectures such as Stable Diffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have built-in preprocessing, pretrained weights, and can be fine-tuned on raw inputs. To enable efficient training, we support XLA compilation for all models, and run all preprocessing via a compiled graph of Tensor Flow operations using the tf.data API. The libraries are fully open-source (Apache 2.0 license) and available on Git Hub. Keywords: Keras CV, Keras NLP, Keras multi-backend, Deep learning, Generative AI 1. Introduction Keras (Chollet et al., 2015) is among the most widely used tools for machine learning today1. The Keras library acts as a high-level abstraction for machine learning models and layers, and seeks to be accessible to a broad group of machine learning researchers and practitioners by focusing on rapid experimentation and progressive disclosure of complexity. Notably, recent developments in Computer Vision (CV) and Natural Language Processing (NLP) have created new challenges for practitioners. The most obvious is the shift towards larger and larger models trained on self-supervised tasks. Pretraining a state of 1. https://survey.stackoverflow.co/2022/ c 2024 Matthew Watson, Divyashree Shivakumar Sreepathihalli, Fran cois Chollet, Martin G orner, Kiranbir Sodhia, Ramesh Sampath, Tirth Patel, Haifeng Jin, Neel Kovelamudi, Gabriel Rasskin, Samaneh Saadat, Luke Wood, Chen Qian, Jonathan Bischof, Ian Stenbit, Abheesht Sharma, Anshuman Mishra. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/24-0404.html. Watson, Sreepathihalli, Chollet, et al. the art model is now cost-prohibitive for many researchers and practitioners, in particular in NLP. Access to open-source model architectures with pretrained weights is imperative in a large amount of CV and NLP. Additionally, pairing efficient preprocessing and metrics computation for modern models has become more difficult, with a proliferation of disparate techniques, backends, and licenses. An ML researcher or practitioner today must select among a range of auto differentiation frameworks such as JAX, Tensor Flow, and Py Torch, and even within each framework, they are often forced to stay within a specific modeling library for cross-compatibility of components. Further, improving the train-time performance of models on NLP problems presents additional hurdles. The XLA compiler (Sabne, 2020) offers dramatic speedups for many model architectures, but adds complex restrictions on the shape and flow of tensor operations. The Tensor Flow-based tf.data (Murray et al., 2021) and tf.text APIs provide a scalable, dynamic, and multi-process approach for preprocessing, but many common text operations do not easily compile to a Tensor Flow graph. Aiming to reduce these framework barriers for both practitioners and researchers, we present Keras CV and Keras NLP, extensions of the Keras API for CV and NLP workflows. These packages expand upon the modular approach of Keras, adding pretrained backbone models, easy-to-use domain-specific losses and metrics, and out-of-the-box support for XLA (Sabne, 2020) compilation and data and model parallelism. Because these domain packages are written on top of Keras 3, all of their modeling components natively support JAX (Bradbury et al., 2018), Tensor Flow (Abadi et al., 2015), and Py Torch (Paszke et al., 2019), and can be freely used in framework-native workflows that do not otherwise involve any Keras components. 2. The Keras Domain Packages API We adopt a layered approach to API design. Our library has three levels of abstraction: Foundational Components: A collection of composable modules for building and training preprocessing pipelines, models, and evaluation logic. These are pure Keras 3 components which can be used outside of the Keras Domain Packages ecosystem. Pretrained Backbones: We extend the common CV concept of a backbone, and use it as a general term for a pretrained model without a task specific head. We provide a collection of pretrained model backbones for fine-tuning. For NLP models, matching tokenizers can be created alongside backbones. Task Models: A collection of end-to-end models specialized for a specific task, e.g. text generation in NLP or object detection in CV. These task models combine the preprocessing and modeling modules from the lower API levels to create a unified training and inference interface that can operate directly on plain text or image input. Task models aim to allow fine-tuning with zero configuration for common use cases. Each additional API layer is built on top of the previous one. Modules from each level can be mixed and matched in usage, for example, extending a pretrained backbone with foundational preprocessing modules to pack input sequences or perform data augmentation. Keras CV and Keras NLP: Multi-framework Models Any Keras CV and Keras NLP model can be instantiated as a Py Torch torch.nn.Module, a Tensor Flow tf.Module, or as a stateless JAX function. This means that the models can be used with Py Torch ecosystem packages, with the full range of Tensor Flow deployment and production tools (such as TF-Serving, TF.js and TFLite), and with JAX large-scale TPU training infrastructure. 3. Training, Serving, and Deployment Keras CV and Keras NLP offer large vision and language models. State of the art models are expected to continually increase in size in the future. To address these problems, Keras CV and Keras NLP are compatible with the Keras Unified Distribution API (Qianli and others., 2023). This API enables both model parallelism and data parallelism across all Keras backends. The API maintains a clear separation between the model definition, training logic, and sharding configuration. As a result, models within Keras CV and Keras NLP can be written as if they were intended to run on a single device. Later, specific sharding configurations can be added to these models when it s time to train them. 4. Pretrained models on Kaggle Models All pretrained models of Keras CV and Keras NLP are published on Kaggle Models https: //www.kaggle.com/organizations/keras/models. Importantly, these models are also available on Kaggle competition notebooks in internet-offmode. Segment Anything Gemma BERT Mistral train predict train predict train predict train predict Batch Size 1 7 8 32 54 531 8 32 Keras 2 (TF) 386.93 3,187.09 NA NA 841.84 965.21 NA NA Keras 3 (TF) 355.25 762.67 232.52 1,134.91 404.17 962.11 185.92 966.06 Keras 3 (JAX) 361.69 660.16 273.67 1,128.21 414.26 865.29 213.22 957.25 Keras 3 (PT) 1,388.87 2,973.64 525.15 7, 952.67 1320.441 3869.72 452.12 10932.59 Keras 3 (best) 355.25 660.16 232.52 1,128.21 404.17 865.29 185.92 957.25 Table 1: Average time taken (in ms/step) per training or inference step across different models, namely Segment Anything (Kirillov et al., 2023), Gemma (Team et al., 2024), BERT (Devlin et al., 2019) and Mistral (Jiang et al., 2023). * LLM inference with the Py Torch backend is abnormally slow at this time because Keras NLP uses static sequence padding. This will be addressed soon. 5. Performance Framework performance depends on the specific model. Keras 3 offers flexibility by letting users select the fastest framework for their task. Picking the fastest backend for a given model consistently outperforms Keras 2 as seen in Table 1. All benchmarks are done with a single NVIDIA A100 GPU with 40GB of GPU memory on a Google Cloud Compute Engine of machine type a2-highgpu-1g with 12 v CPUs and 85GB host memory. For fair comparison, we use the same batch size across frameworks if it is the same model and task (fit or predict). However, for different models and tasks, due to their Watson, Sreepathihalli, Chollet, et al. different sizes and architectures, we use different batch sizes to avoid either running out of memory (too large) or under GPU utilization (too small). We also used the same batch size for Gemma and Mistral since they are the same model type with similar number of parameters. (see Table 1). XLA-compiled Keras models in JAX and Tensor Flow exhibit no overhead compared to equivalent code written without Keras. The resulting XLA graphs are virtually identical, ensuring identical performance. However, Keras 3 with Pytorch shows lower performance because writing performant Pytorch requires heavy manual optimization on the part of the end user. Do note that Keras models running on top of the JAX or Tensor Flow backends are nearly always significantly faster than the same models written in native Py Torch. The benchmarks will continue to be updated here https://keras.io/ getting_started/benchmarks/. 6. Related Work A library with clear parallels to Keras NLP and Keras CV is the Hugging Face Transformers library (Wolf et al., 2020). Both libraries offer access to pretrained model checkpoints for a number of widely-used transformer architectures. The Transformers library is built with a repeat yourself approach. Keras NLP, in contrast, is built with a layered approach, with an explicit goal of allowing the re-implementation of any large language model in a relatively small amount of code. We believe there are strengths and weaknesses to both of these approaches. 7. Future Work Future efforts are directed towards consolidating Keras NLP and Keras CV into a unified repository, Keras Hub (Watson et al., 2024), to simplify the development and maintenance of multimodal models. This initiative is already underway, and Keras Hub has now been officially released. Moving forward, we will expand the repository by integrating additional multimodal models and enhancing fine-tuning capabilities. 8. Conclusions Keras CV and Keras NLP are new toolboxes offering both modular components for rapid prototyping of new models, as well as standard pretrained backbones and task models for many computer vision and natural language processing workflows. They can be leveraged by users of either JAX (Bradbury et al., 2018), Tensor Flow (Abadi et al., 2015), or Py Torch (Paszke et al., 2019). Thanks to backend optionality and XLA (Sabne, 2020) compilation, Keras CV and Keras NLP deliver state-of-the-art training and inference performance. Keras CV and Keras NLP offer extensive user guides, available at keras.io. Acknowledgments We thank all contributors to Keras (Chollet et al., 2015), Keras CV, Keras NLP, Tensor Flow (Abadi et al., 2015), Tensor Flow Text, Tensor Flow Data, and the XLA (Sabne, 2020) compiler, all of which are crucial to the functionality provided in Keras CV and Keras NLP. Keras CV and Keras NLP: Multi-framework Models Appendix A. Preprocessing Layers Keras NLP offers a comprehensive suite of preprocessing layers that enable users to build state-of-the-art, industry-grade data augmentation pipelines for tasks such as text classification, text generation, language translation, and text feature extraction. These include tokenizers, samplers, and other data preprocessing layers. Below is an example demonstrating how to use Keras NLP s preprocessing layers. # Apply Random Swap preprocessing layer on input data augmenter = keras nlp . l a y e r s . Random Swap( rate =0.4 , seed =42) augmented data = augmenter ( input data ) # Example to demonstrate how to use a tokenizer vocab = [ [UNK] , the , qu , ##ick , br , ##own , fox , . ] inputs = [ The quick brown fox . ] tokenizer = keras nlp . tokenizers . Word Piece Tokenizer ( vocabulary=vocab , sequence length =10, lowercase=True , ) tokenized outputs = tokenizer ( inputs ) Keras CV provides a comprehensive suite of preprocessing layers that empower users to construct state-of-the-art, industry-grade data augmentation pipelines for image classification, object detection, image segmentation and image generation tasks. These layers implement a wide range of commonly used data augmentation techniques, enabling users to effortlessly enhance the robustness and generalizability of their models. By using preprocessing layers, users can ensure that their models are trained on data that is representative of the data that they will encounter at inference time. Keras CV offers 38 data augmentation layers. These layers implement a wide range of commonly used data augmentation techniques, enabling users to effortlessly manipulate image data in a variety of ways and handle all types of labels out-of-the-box (e.g. class labels, box labels, mask labels). TF Data is a Tensor Flow API for building input pipelines. Input pipelines are responsible for loading data from disk, preprocessing it, and batching it. TF Data provides a number of features that make it a powerful tool for preprocessing data for machine learning, such as: Dataset APIs: for loading data from a variety of sources, such as CSV files, TFRecords, and images. Preprocessing functions: for performing common preprocessing tasks, such as decoding images, resizing images, and normalizing images. Batching functions: for grouping data into batches. Prefetching and caching: for improving the performance of input pipelines. Watson, Sreepathihalli, Chollet, et al. # Applies grayscale preprocessing to input images . ( images , l a b e l s ) , = keras . datasets . c i f a r 1 0 . load data () t o g r a y s c a l e = keras cv . l a y e r s . preprocessing . Grayscale () augmented images = t o g r a y s c a l e ( images ) Appendix B. Preset API The presets API provides a convenient way to create state-of-the-art CV and NLP models. Presets are pre-configured models that have been trained on a specific dataset and can be used for a specific task. To use the presets API, one simply needs to import the keras_cv.models or keras_nlp.models module and then call the from_preset() method on the desired model class. The presets API provides a number of advantages over creating models from scratch. First, presets are pretrained on a large dataset, which means that they can achieve high accuracy on a variety of tasks. Second, presets are pre-configured, which means that users do not need to worry about setting hyperparameters. Third, presets are easy to use, which means that users can get started with them quickly. # Load a r c h i t e c t u r e and weights from preset model = keras nlp . models . Retina Net . from preset ( resnet50 imagenet , ) # Load randomly i n i t i a l i z e d model from the preset a r c h i t e c t u r e model = keras cv . models . Retina Net . from preset ( resnet50 imagenet , load weights=False , ) Appendix C. Backbone API Both Keras CV and Keras NLP offer a Backbone API. Backbones can be thought of as the central architecture of a model, without the final output layer. This allows users to leverage powerful pretrained backbones (often trained on vast datasets) as the starting point for their own customized models. The pretrained backbones within Keras CV and Keras NLP offer more than just a starting point, they are also finetunable. Several examples of how to do this can be seen on the Keras.io webpage (Chollet et al., 2015). # Load backbone and weights from preset model = keras cv . models . Res Net Backbone . from preset ( resnet50 imagenet , ) # Randomly i n i t i a l i z e d backbone with a custom config model = keras cv . models . Res Net Backbone ( Keras CV and Keras NLP: Multi-framework Models s t a c k w i s e f i l t e r s =[64 , 128 , 256 , 512] , stackwise blocks =[2 , 2 , 2 , 2] , s t a c k w i s e s t r i d e s =[1 , 2 , 2 , 2] , i n c l u d e r e s c a l i n g=False , ) Appendix D. Task Models Keras CV and Keras NLP provide a number of task models that are designed for specific tasks. These task models are built on top of the Keras CV and Keras NLP modeling layers and provide a high level of performance. These models are ready for use in applications, but can be further fine-tuned if desired. Some examples of available task models include image classification, object detection, semantic segmentation, image generation, text generation, text classification, and question answering. Pretrained Task Models. Pretrained task models can be used by using presets trained on different datasets. This allows users to quickly and easily get started with deep learning without having to train a model from scratch. For example, Keras CV provides a number of presets for image classification models that have been trained on different datasets, such as Image Net, COCO, and Pascal VOC. These presets can be used to create models that can achieve state-of-the-art results on a variety of image classification tasks. To use a pretrained task model with a preset, one simply needs to import the keras_cv.models or keras_nlp.models module and then call the from_preset() method on the desired model class. Specifying a Backbone in a Task Model. It is possible to specify a backbone for task models. This is done by passing the backbone argument to the task class constructor. By specifying a different backbone, users can change the features that are extracted. This can be useful if one wants to improve the performance of the model on a specific task. Users can also specify their own custom backbones. To do this, one simply need to create a subclass of the models.Backbone class. Fine-Tuning a Task Model. Fine-tuning a task model is the process of adapting a pretrained model to a specific task. This is done by training the model on a dataset of labeled data for the specific task. # Example of a BERT c l a s s i f i e r task model f e a t u r e s = [ The quick brown fox jumped . , I forgot my homework . ] l a b e l s = [0 , 3] # Pretrained c l a s s i f i e r task model . c l a s s i f i e r = keras nlp . models . B e r t C l a s s i f i e r . from preset ( bert base en , num classes =4, ) c l a s s i f i e r . f i t (x=features , y=labels , batch size =2) c l a s s i f i e r . predict (x=features , batch size =2) Watson, Sreepathihalli, Chollet, et al. # Completely customize the task model f e a t u r e s = [ The quick brown fox jumped . , I forgot my homework . ] l a b e l s = [0 , 3] vocab = [ [UNK] , [CLS] , [SEP] , [PAD] , [MASK] ] vocab += [ The , quick , brown , fox , jumped , . ] # Custom tokenizer tokenizer = keras nlp . models . Bert Tokenizer ( vocabulary=vocab , ) # Custom preprocessor preprocessor = keras nlp . models . B e r t C l a s s i f i e r P r e p r o c e s s o r ( tokenizer=tokenizer , sequence length =128, ) # Custom backbone backbone = keras nlp . models . Bert Backbone ( vocabulary size =30552, num layers =4, num heads=4, hidden dim =256, intermediate dim =512, max sequence length =128, ) # Custom task model c l a s s i f i e r = keras nlp . models . B e r t C l a s s i f i e r ( backbone=backbone , preprocessor=preprocessor , num classes =4, ) c l a s s i f i e r . f i t (x=features , y=labels , batch size =2) An extensive list of models offered by Keras CV and Keras NLP can be found at https:// keras.io/api/keras_cv/models/ and https://keras.io/api/keras_nlp/models/. Pretrained weights are available on Kaggle at https://www.kaggle.com/organizations/ keras/models. Appendix E. Resources To facilitate users in exploring the full potential of our libraries and tools through hands-on experimentation, we provide a comprehensive collection of guides and illustrative examples. Keras CV guides are accessible at https://keras.io/guides/keras_cv/, while Keras NLP guides can be found at https://keras.io/guides/keras_nlp/. For practical demonstrations, Keras CV example guides are located at https://keras.io/examples/vision/, and Keras NLP example guides can be found at https://keras.io/examples/nlp/. Keras CV and Keras NLP: Multi-framework Models Furthermore, our pretrained models are hosted on Kaggle at https://www.kaggle. com/organizations/keras/models. Each model is accompanied by a detailed model card that provides a comprehensive description and illustrative examples of how to effectively utilize them. Mart ın Abadi, Ashish Agarwal, Paul Barham, and others. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. James Bradbury, Roy Frostig, and others. JAX: Composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Fran cois Chollet et al. Keras, 2015. URL https://keras.io. Jacob Devlin, Ming-Wei Chang, and others. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the NAACL-HLT 2019, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. Albert Q. Jiang, Alexandre Sablayrolles, and others. Mistral 7b, 2023. Alexander Kirillov, Eric Mintun, and others. Segment anything. In Proceedings of the IEEE/CVF ICCV, pages 4015 4026, October 2023. Derek Gordon Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. tf.data: A machine learning data processing framework. Co RR, abs/2101.12127, 2021. URL https://arxiv. org/abs/2101.12127. Adam Paszke, Sam Gross, and others. Py Torch: An imperative style, highperformance deep learning library. In Neur IPS 2019, pages 8024 8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf. Zhu Qianli and others. Keras distribution, 2023. URL https://keras.io/api/ distribution/. Amit Sabne. XLA : Compiling machine learning for peak performance, 2020. Gemma Team, Thomas Mesnard, and others. Gemma: Open models based on Gemini research and technology, 2024. Matthew Watson, Fran cois Chollet, Divyashree Sreepathihalli, Samaneh Saadat, Ramesh Sampath, Gabriel Rasskin, , Scott Zhu, Varun Singh, Luke Wood, Zhenyu Tan, Ian Stenbit, Chen Qian, Jonathan Bischof, et al. Kerashub. https://github.com/keras-team/ keras-hub, 2024. Watson, Sreepathihalli, Chollet, et al. Thomas Wolf, Lysandre Debut, and others. Transformers: State-of-the-art natural language processing. In EMNLP 2020, pages 38 45, 2020.