Getting Started Using Phi-3-mini-4k-instruct-onnx for Text Generation with NLP Techniques (2024)

The Phi-3 mini models are AI models. The short context version Phi-3-mini-4k-instruct-onnx has a prompt length of 4k words, while the long context version can accept much longer prompts and produce longer output text.

In this tutorial, we will be using the short context version of the Phi-3 ONNX models ( Phi-3-mini-4k-instruct-onnx) and using the model available from Hugging Face.

Before we begin, it is important to install the git large file system extension and the Hugging Face CLI. These tools are necessary for downloading the ONNX models. Additionally, we will focus this tutorial on using the CPU to run the models. If you have a GPU, you can use DirectML or NVIDIA CUDA GPU setups for optimal performance depending on your operating system.

Setting up your Python Environment

Navigate to your project directory using the cd command.
For example:

cd path/to/your/project

Create a new virtual environment by running the following command:

python -m venv .venv

This will create a .venv directory in your project folder, containing an isolated Python environment.

Activate the virtual environment

On Windows:

Run Phi-3-mini-4k-instruct-onnx model on CPU or GPU with DirectML or NvidiaCuda

Prerquesties:Install Git Large File System Support

For Windows
First you install some prerequsities
Use the winget tool to install and manage applications | Microsoft Learn

AfterApp Installeris installed, you can runwingetby typing 'winget' from a Command Prompt.

winget install -e --id GitHub.GitLFS

For MacOS

brew install git-lfs

For Linux

apt-get install git-lfs

We now need to run the Gif-Lfs

git lfs install

Deploying the Phi-3 model from Hugging Face

Install the Hugging Face CLI

pip install huggingface-hub[cli]

Now were are going to download the Phi-3 model and run this on the device CPU

Dowloading Phi-3 from Hugging Face

Download the Phi-3-mini-4k-instruct-onnx model. Below is the batch script that allows you to download the correct version of the Phi-3 model based on your preference. You can save this script with a .bat extension (e.g., download_phi3_model.bat) and run it:

@echo offsetlocalREM Select which model to downloadecho.echo Choose an option:echo 1. Download the Phi-3 Model for CPUecho 2. Download the Phi-3 Model for Nvidia Cudaecho 3. Download the Phi-3 Model for DirectMLset /p option=Enter the option number: if "%option%"=="1" ( huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .) else if "%option%"=="2" ( huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .) else if "%option%"=="3" ( huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .) else ( echo Invalid option. Please choose 1, 2, or 3.)endlocal

This command downloads the model into a folder calledcpu_and_mobile

Below is a batch script that allows the user to select the ONNX runtime installation option. . Save this script with a.batextension (e.g.,install_onnx_runtime.bat) and run it:

@echo offsetlocalREM Install numpy libarypip install numpyREM Pick which ONNX runtime to installecho.echo Choose an option:echo 1. For CPU (onnxruntime-genai)echo 2. For GPU (onnxruntime-genai-cuda)echo 3. For DirectML (onnxruntime-genai-directml)set /p option=Enter the option number: if "%option%"=="1" ( pip install --pre onnxruntime-genai) else if "%option%"=="2" ( pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/) else if "%option%"=="3" ( pip install --pre onnxruntime-genai-directml) else ( echo Invalid option. Please choose 1, 2, or 3.)endlocal

Run the model using a Python Script and switch command for model selection

import onnxruntime_genai as og import argparseimport timedef main(args): # If verbose mode is on, print loading model message if args.verbose: print("Loading model...") # If timings mode is on, initialize timing variables if args.timings: started_timestamp = 0 first_token_timestamp = 0 # Load the model model = og.Model(f'{args.model}') if args.verbose: print("Model loaded") # Initialize the tokenizer with the model tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() if args.verbose: print("Tokenizer created") # Print a newline for readability if verbose mode is on if args.verbose: print() # Create a dictionary of search options from the command line arguments search_options = {name:getattr(args, name) for name in ['do_sample', 'max_length', 'min_length', 'top_p', 'top_k', 'temperature', 'repetition_penalty'] if name in args} # Set a default max length if one is not provided if 'max_length' not in search_options: search_options['max_length'] = 2048 # Define a template for the chat input chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>' # Main loop: ask for input and generate responses while True: # Get user input text = input("Input: ") # If the input is empty, print an error message and continue to the next iteration if not text: print("Error, input cannot be empty") continue # If timings mode is on, record the start time if args.timings: started_timestamp = time.time() # Format the input with the chat template prompt = f'{chat_template.format(input=text)}' # Tokenize the input input_tokens = tokenizer.encode(prompt) # Set up the generator parameters params = og.GeneratorParams(model) params.try_use_cuda_graph_with_max_batch_size(1) params.set_search_options(**search_options) params.input_ids = input_tokens # Create the generator generator = og.Generator(model, params) if args.verbose: print("Generator created") # Print a message if verbose mode is on if args.verbose: print("Running generation loop ...") # If timings mode is on, initialize variables for the generation loop if args.timings: first = True new_tokens = [] # Print the output prompt print() print("Output: ", end='', flush=True)

If you do install the requirements for DirectML, Cuda and CPU support you can run the Python file above with the following switch

For CPU

python filename.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

For DirectML

python filename.py -m directml\directml-int4-awq-block-128

For Cuda

python filename.py -m cuda/cuda-int4-rtn-block-32

Running this a simple batch file

Below is the runnable Python script based on your provided code. You can save this script to a.pyfile and execute it. Make sure to replace--modelwith the actual path to your ONNX model file. You can run this script usingpython your_script_name.py

import onnxruntime_genai as ogimport argparseimport timedef main(args): # If verbose mode is on, print loading model message if args.verbose: print("Loading model...") # If timings mode is on, initialize timing variables if args.timings: started_timestamp = 0 first_token_timestamp = 0 # Load the model model = og.Model(f'{args.model}') if args.verbose: print("Model loaded") # Initialize the tokenizer with the model tokenizer = og.Tokenizer(model) tokenizer_stream = tokenizer.create_stream() if args.verbose: print("Tokenizer created") # Print a newline for readability if verbose mode is on if args.verbose: print() # Create a dictionary of search options from the command line arguments search_options = {name:getattr(args, name) for name in ['do_sample', 'max_length', 'min_length', 'top_p', 'top_k', 'temperature', 'repetition_penalty'] if name in args} # Set a default max length if one is not provided if 'max_length' not in search_options: search_options['max_length'] = 2048 # Define a template for the chat input chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>' # Main loop: ask for input and generate responses while True: # Get user input text = input("Input: ") # If the input is empty, print an error message and continue to the next iteration if not text: print("Error, input cannot be empty") continue # If timings mode is on, record the start time if args.timings: started_timestamp = time.time() # Format the input with the chat template prompt = f'{chat_template.format(input=text)}' # Tokenize the input input_tokens = tokenizer.encode(prompt) # Set up the generator parameters params = og.GeneratorParams(model) params.try_use_cuda_graph_with_max_batch_size(1) params.set_search_options(**search_options) params.input_ids = input_tokens # Create the generator generator = og.Generator(model, params) if args.verbose: print("Generator created") # Print a message if verbose mode is on if args.verbose: print("Running generation loop ...") # If timings mode is on, initialize variables for the generation loop if args.timings: first = True new_tokens = [] # Print the output prompt print() print("Output: ", end='', flush=True)if __name__ == "__main__": parser = argparse.ArgumentParser(description="Run the chatbot script") parser.add_argument("--model", type=str, required=True, help="Path to the ONNX model file") parser.add_argument("--verbose", action="store_true", help="Enable verbose mode") parser.add_argument("--timings", action="store_true", help="Enable timings mode") args = parser.parse_args() main(args)

In conclusion, the Phi-3 mini models are powerful AI tools for text generation using NLP techniques. These models can be run on a variety of devices, including GPUs and CPUs. By following the instructions in this tutorial, you can easily download and run these models on your own computer.

Getting Started Using Phi-3-mini-4k-instruct-onnx for Text Generation with NLP Techniques (2024)

Run Phi-3-mini-4k-instruct-onnx model on CPU or GPU with DirectML or NvidiaCuda

References