Managed Inference - Concepts
Allowed IPs
Allowed IPs are single IPs or IP blocks that have the required permissions to remotely access a deployment. They allow you to define which host and networks can connect to your Managed Inference endpoints. You can add, edit, or delete allowed IPs. In the absence of allowed IPs, all IP addresses are allowed by default.
Access control is handled directly at the network level by Load Balancers, making the filtering more efficient and universal and relieving the Managed Inference server from this task.
Context size
The context size refers to the length or size of the input text used to generate predictions or responses from a Large Language Model (LLM). It is crucial in determining the model’s understanding of the given prompt or query.
Deployment
A deployment makes a trained language model available for real-world applications. It encompasses tasks such as integrating the model into existing systems, optimizing its performance, and ensuring scalability and reliability.
Embedding models
Embedding models are a representation-learning technique that converts textual data into numerical vectors. These vectors capture semantic information about the text and are often used as input to downstream machine-learning models, or algorithms.
Endpoint
In the context of LLMs, an endpoint refers to a network-accessible URL or interface through which clients can interact with the model for inference tasks. It exposes methods for sending input data and receiving model predictions or responses.
Fine-tuning
Fine-tuning involves further training a pre-trained language model on domain-specific or task-specific data to improve performance on a particular task. This process often includes updating the model’s parameters using a smaller, task-specific dataset.
Few-shot prompting
Few-shot prompting uses the power of language models to generate responses with minimal input, relying on just a handful of examples or prompts. It demonstrates the model’s ability to generalize from limited training data to produce coherent and contextually relevant outputs.
Function calling
Function calling allows a large language model (LLM) to interact with external tools or APIs, executing specific tasks based on user requests. The LLM identifies the appropriate function, extracts the required parameters, and returns the results as structured data, typically in JSON format.
Hallucinations
Hallucinations in LLMs refer to instances where generative AI models generate responses that, while grammatically coherent, contain inaccuracies or nonsensical information. These inaccuracies are termed “hallucinations” because the models create false or misleading content. Hallucinations can occur because of constraints in the training data, biases embedded within the models, or the complex nature of language itself.
Inference
Inference is the process of deriving logical conclusions or predictions from available data. This concept involves using statistical methods, machine learning algorithms, and reasoning techniques to make decisions or draw insights based on observed patterns or evidence. Inference is fundamental in various AI applications, including natural language processing, image recognition, and autonomous systems.
JSON mode
JSON mode allows you to guide the language model in outputting well-structured JSON data.
To activate JSON mode, provide the response_format
parameter with {"type": "json_object"}
.
JSON mode is useful for applications like chatbots or APIs, where a machine-readable format is essential for easy processing.
Large Language Model Applications
LLM Applications are applications or software tools that leverage the capabilities of LLMs for various tasks, such as text generation, summarization, or translation. These apps provide user-friendly interfaces for interacting with the models and accessing their functionalities.
Large Language Models
LLMs are advanced artificial intelligence systems capable of understanding and generating human-like text on various topics. These models, such as Llama-3, are trained on vast amounts of data to learn the patterns and structures of language, enabling them to generate coherent and contextually relevant responses to queries or prompts. LLMs have applications in natural language processing, text generation, translation, and other tasks requiring sophisticated language understanding and production.
Prompt
In the context of generative AI models, a prompt refers to the input provided to the model to generate a desired response. It typically consists of a sentence, paragraph, or series of keywords or instructions that guide the model in producing text relevant to the given context or task. The quality and specificity of the prompt greatly influence the generated output, as the model uses it to understand the user’s intent and create responses accordingly.
Quantization
Quantization is a technique used to reduce the precision of numerical values in a model’s parameters or activations to improve efficiency and reduce memory footprint during inference. It involves representing floating-point values with fewer bits while minimizing the loss of accuracy.
AI models provided for deployment are named with suffixes that denote their quantization levels, such as :int8
, :fp8
, and :fp16
.
Retrieval Augmented Generation (RAG)
RAG is an architecture combining information retrieval elements with language generation to enhance the capabilities of LLMs. It involves retrieving relevant context or knowledge from external sources and incorporating it into the generation process to produce more informative and contextually grounded outputs.
Structured outputs
Structured outputs enable you to format the model’s responses to suit specific use cases. To activate structured outputs, provide the response_format
parameter with "type": "json_schema"
and define its "json_schema": {}
.
By customizing the structure, such as using lists, tables, or key-value pairs, you ensure that the data returned is in a form that is easy to extract and process.
By specifying the expected response format through the API, you can make the model consistently deliver the output your system requires.