
Managing context successfully is a essential problem when working with massive language fashions, particularly in environments like Google Colab, the place useful resource constraints and lengthy paperwork can shortly exceed accessible token home windows. On this tutorial, we information you thru a sensible implementation of the Mannequin Context Protocol (MCP) by constructing a ModelContextManager that robotically chunks incoming textual content, generates semantic embeddings utilizing Sentence-Transformers, and scores every chunk primarily based on recency, significance, and relevance. You’ll learn to combine this supervisor with a Hugging Face sequence-to-sequence mannequin, demonstrated right here with FLAN-T5, so as to add, optimize, and retrieve solely probably the most pertinent items of context. Alongside the way in which, we’ll cowl token counting with a GPT-2 tokenizer, context-window optimization methods, and interactive periods that allow you to question and visualize your dynamic context in actual time.
import torch
import numpy as np
from typing import Checklist, Dict, Any, Elective, Union, Tuple
from dataclasses import dataclass
import time
import gc
from tqdm.pocket book import tqdm
We import important libraries for constructing a dynamic context supervisor: torch and numpy deal with tensor and numerical operations, whereas typing and dataclasses present structured kind annotations and information containers. Utility modules, akin to time and gc, assist timestamping and reminiscence cleanup, in addition to tqdm.pocket book presents interactive progress bars for chunk processing in Colab.
@dataclass
class ContextChunk:
"""A piece of textual content with metadata for the Mannequin Context Protocol."""
textual content: str
embedding: Elective[torch.Tensor] = None
significance: float = 1.0
timestamp: float = 0.0
metadata: Dict[str, Any] = None
def __post_init__(self):
if self.metadata is None:
self.metadata = {}
if self.timestamp == 0.0:
self.timestamp = time.time()
The ContextChunk dataclass encapsulates a single section of textual content together with its embedding, a user-assigned significance rating, a timestamp, and arbitrary metadata. Its __post_init__ technique ensures that every chunk is stamped with the present time upon creation and that metadata defaults to an empty dictionary if none is offered.
class ModelContextManager:
"""
Supervisor for implementing Mannequin Context Protocol in LLMs on Google Colab.
Handles context window optimization, token administration, and relevance scoring.
"""
def __init__(
self,
max_context_length: int = 8192,
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
relevance_threshold: float = 0.7,
recency_weight: float = 0.3,
importance_weight: float = 0.3,
semantic_weight: float = 0.4,
gadget: str = "cuda" if torch.cuda.is_available() else "cpu"
):
"""
Initialize the Mannequin Context Supervisor.
Args:
max_context_length: Most variety of tokens in context window
embedding_model: Mannequin to make use of for textual content embeddings
relevance_threshold: Threshold for chunk relevance to be included
recency_weight: Weight for recency in relevance calculation
importance_weight: Weight for significance in relevance calculation
semantic_weight: Weight for semantic similarity in relevance calculation
gadget: Machine to run computations on
"""
self.max_context_length = max_context_length
self.gadget = gadget
self.chunks = []
self.current_token_count = 0
self.relevance_threshold = relevance_threshold
self.recency_weight = recency_weight
self.importance_weight = importance_weight
self.semantic_weight = semantic_weight
strive:
from sentence_transformers import SentenceTransformer
print(f"Loading embedding mannequin {embedding_model}...")
self.embedding_model = SentenceTransformer(embedding_model).to(self.gadget)
print(f"Embedding mannequin loaded efficiently on {self.gadget}")
besides ImportError:
print("Putting in sentence-transformers...")
import subprocess
subprocess.run(["pip", "install", "sentence-transformers"])
from sentence_transformers import SentenceTransformer
self.embedding_model = SentenceTransformer(embedding_model).to(self.gadget)
print(f"Embedding mannequin loaded efficiently on {self.gadget}")
strive:
from transformers import GPT2Tokenizer
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
besides ImportError:
print("Putting in transformers...")
import subprocess
subprocess.run(["pip", "install", "transformers"])
from transformers import GPT2Tokenizer
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def add_chunk(self, textual content: str, significance: float = 1.0, metadata: Dict[str, Any] = None) -> None:
"""
Add a brand new chunk of textual content to the context supervisor.
Args:
textual content: The textual content content material so as to add
significance: Significance rating (0-1)
metadata: Further metadata for the chunk
"""
with torch.no_grad():
embedding = self.embedding_model.encode(textual content, convert_to_tensor=True)
chunk = ContextChunk(
textual content=textual content,
embedding=embedding,
significance=significance,
timestamp=time.time(),
metadata=metadata or {}
)
self.chunks.append(chunk)
self.current_token_count += len(self.tokenizer.encode(textual content))
if self.current_token_count > self.max_context_length:
self.optimize_context()
def optimize_context(self) -> None:
"""Optimize context by eradicating much less related chunks to suit inside token restrict."""
if not self.chunks:
return
print("Optimizing context window...")
scores = self.score_chunks()
sorted_indices = np.argsort(scores)[::-1]
new_chunks = []
new_token_count = 0
for idx in sorted_indices:
chunk = self.chunks[idx]
chunk_tokens = len(self.tokenizer.encode(chunk.textual content))
if new_token_count + chunk_tokens self.relevance_threshold * 1.5:
for i, included_chunk in enumerate(new_chunks):
included_idx = sorted_indices[i]
if scores[included_idx] np.ndarray:
"""
Rating chunks primarily based on recency, significance, and semantic relevance.
Args:
question: Elective question to calculate semantic relevance towards
Returns:
Array of scores for every chunk
"""
if not self.chunks:
return np.array([])
current_time = time.time()
max_age = max(current_time - chunk.timestamp for chunk in self.chunks) or 1.0
recency_scores = np.array([
1.0 - ((current_time - chunk.timestamp) / max_age)
for chunk in self.chunks
])
importance_scores = np.array([chunk.importance for chunk in self.chunks])
if question will not be None:
query_embedding = self.embedding_model.encode(question, convert_to_tensor=True)
similarity_scores = np.array([
torch.cosine_similarity(chunk.embedding, query_embedding, dim=0).item()
for chunk in self.chunks
])
similarity_scores = (similarity_scores - similarity_scores.min()) / (similarity_scores.max() - similarity_scores.min() + 1e-8)
else:
similarity_scores = np.ones(len(self.chunks))
final_scores = (
self.recency_weight * recency_scores +
self.importance_weight * importance_scores +
self.semantic_weight * similarity_scores
)
return final_scores
def retrieve_context(self, question: str = None, okay: int = None) -> str:
"""
Retrieve probably the most related context for a given question.
Args:
question: The question to retrieve context for
okay: The utmost variety of chunks to return (None = all related chunks)
Returns:
String containing the mixed related context
"""
if not self.chunks:
return ""
scores = self.score_chunks(question)
relevant_indices = np.the place(scores >= self.relevance_threshold)[0]
relevant_indices = relevant_indices[np.argsort(scores[relevant_indices])[::-1]]
if okay will not be None:
relevant_indices = relevant_indices[:k]
relevant_texts = [self.chunks[i].textual content for i in relevant_indices]
return "nn".be a part of(relevant_texts)
def get_stats(self) -> Dict[str, Any]:
"""Get statistics in regards to the present context state."""
return {
"chunk_count": len(self.chunks),
"token_count": self.current_token_count,
"max_tokens": self.max_context_length,
"usage_percentage": self.current_token_count / self.max_context_length * 100 if self.max_context_length else 0,
"avg_chunk_size": self.current_token_count / len(self.chunks) if self.chunks else 0,
"oldest_chunk_age": time.time() - min(chunk.timestamp for chunk in self.chunks) if self.chunks else 0,
}
def visualize_context(self):
"""Visualize the present context window distribution."""
strive:
import matplotlib.pyplot as plt
import pandas as pd
if not self.chunks:
print("No chunks to visualise")
return
scores = self.score_chunks()
chunk_sizes = [len(self.tokenizer.encode(chunk.text)) for chunk in self.chunks]
timestamps = [chunk.timestamp for chunk in self.chunks]
relative_times = [time.time() - ts for ts in timestamps]
significance = [chunk.importance for chunk in self.chunks]
df = pd.DataFrame({
'Measurement (tokens)': chunk_sizes,
'Age (seconds)': relative_times,
'Significance': significance,
'Rating': scores
})
fig, axs = plt.subplots(2, 2, figsize=(14, 10))
axs[0, 0].bar(vary(len(chunk_sizes)), chunk_sizes)
axs[0, 0].set_title('Token Distribution by Chunk')
axs[0, 0].set_ylabel('Tokens')
axs[0, 0].set_xlabel('Chunk Index')
axs[0, 1].scatter(chunk_sizes, scores)
axs[0, 1].set_title('Rating vs Chunk Measurement')
axs[0, 1].set_xlabel('Tokens')
axs[0, 1].set_ylabel('Rating')
axs[1, 0].scatter(relative_times, scores)
axs[1, 0].set_title('Rating vs Chunk Age')
axs[1, 0].set_xlabel('Age (seconds)')
axs[1, 0].set_ylabel('Rating')
axs[1, 1].scatter(significance, scores)
axs[1, 1].set_title('Rating vs Significance')
axs[1, 1].set_xlabel('Significance')
axs[1, 1].set_ylabel('Rating')
plt.tight_layout()
plt.present()
besides ImportError:
print("Please set up matplotlib and pandas for visualization")
print('!pip set up matplotlib pandas')
The ModelContextManager class orchestrates the end-to-end dealing with of context for LLMs by chunking enter textual content, producing embeddings, and monitoring token utilization towards a configurable restrict. It implements relevance scoring (combining recency, significance, and semantic similarity), automated context pruning, retrieval of probably the most pertinent chunks, and handy utilities for monitoring and visualizing context statistics.
class MCPColabDemo:
"""Demonstration of Mannequin Context Protocol in Google Colab with a Language Mannequin."""
def __init__(
self,
model_name: str = "google/flan-t5-base",
max_context_length: int = 2048,
gadget: str = "cuda" if torch.cuda.is_available() else "cpu"
):
"""
Initialize the MCP Colab demo with a specified mannequin.
Args:
model_name: Hugging Face mannequin title
max_context_length: Most context size for the MCP supervisor
gadget: Machine to run the mannequin on
"""
self.gadget = gadget
self.context_manager = ModelContextManager(
max_context_length=max_context_length,
gadget=gadget
)
strive:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
print(f"Loading mannequin {model_name}...")
self.mannequin = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(gadget)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Mannequin loaded efficiently on {gadget}")
besides ImportError:
print("Putting in transformers...")
import subprocess
subprocess.run(["pip", "install", "transformers"])
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
self.mannequin = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(gadget)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Mannequin loaded efficiently on {gadget}")
def add_document(self, textual content: str, chunk_size: int = 512, overlap: int = 50) -> None:
"""
Add a doc to the context by chunking it appropriately.
Args:
textual content: Doc textual content
chunk_size: Measurement of every chunk in characters
overlap: Overlap between chunks in characters
"""
chunks = []
for i in vary(0, len(textual content), chunk_size - overlap):
chunk = textual content[i:i + chunk_size]
if len(chunk) > 20:
chunks.append(chunk)
print(f"Including {len(chunks)} chunks to context...")
for i, chunk in enumerate(tqdm(chunks)):
pos = i / len(chunks)
significance = 1.0 - 0.5 * min(pos, 1 - pos)
self.context_manager.add_chunk(
textual content=chunk,
significance=significance,
metadata={"supply": "doc", "place": i, "total_chunks": len(chunks)}
)
def process_query(self, question: str, max_new_tokens: int = 256) -> str:
"""
Course of a question utilizing the context supervisor and mannequin.
Args:
question: The question to course of
max_new_tokens: Most variety of tokens in response
Returns:
Mannequin response
"""
self.context_manager.add_chunk(question, significance=1.0, metadata={"kind": "question"})
relevant_context = self.context_manager.retrieve_context(question=question)
immediate = f"Context: {relevant_context}nnQuestion: {question}nnAnswer:"
inputs = self.tokenizer(immediate, return_tensors="pt").to(self.gadget)
print("Producing response...")
with torch.no_grad():
outputs = self.mannequin.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
self.context_manager.add_chunk(
response,
significance=0.9,
metadata={"kind": "response", "question": question}
)
return response
def interactive_session(self):
"""Run an interactive session within the pocket book."""
from IPython.show import clear_output
print("Beginning interactive MCP session. Kind 'exit' to finish.")
conversation_history = []
whereas True:
question = enter("nYour question: ")
if question.decrease() == 'exit':
break
if question.decrease() == 'stats':
print("nContext Statistics:")
stats = self.context_manager.get_stats()
for key, worth in stats.objects():
print(f"{key}: {worth}")
self.context_manager.visualize_context()
proceed
if question.decrease() == 'clear':
self.context_manager.chunks = []
self.context_manager.current_token_count = 0
conversation_history = []
clear_output(wait=True)
print("Context cleared!")
proceed
response = self.process_query(question)
conversation_history.append((question, response))
print("nResponse:")
print(response)
print("n" + "-"*50)
stats = self.context_manager.get_stats()
print(f"Context utilization: {stats['token_count']}/{stats['max_tokens']} tokens ({stats['usage_percentage']:.1f}%)")
The MCPColabDemo class ties the context supervisor to a seq2seq LLM, loading FLAN-T5 (or any specified Hugging Face mannequin) on the chosen gadget, and supplies utility strategies for chunking and ingesting total paperwork, processing consumer queries by prepending solely probably the most related context, and working an interactive Colab session full with real-time stats, visualizations, and instructions for clearing or inspecting the evolving context window.
def run_mcp_demo():
"""Run a easy demo of the Mannequin Context Protocol."""
print("Operating Mannequin Context Protocol Demo...")
context_manager = ModelContextManager(max_context_length=4096)
print("Including pattern chunks...")
context_manager.add_chunk(
"The Mannequin Context Protocol (MCP) is a framework for managing context "
"home windows in massive language fashions. It helps optimize token utilization and enhance relevance.",
significance=1.0
)
context_manager.add_chunk(
"Context administration includes strategies like sliding home windows, chunking, "
"and relevance filtering to deal with massive paperwork effectively.",
significance=0.8
)
for i in vary(10):
context_manager.add_chunk(
f"That is take a look at chunk {i} with some filler content material to simulate a bigger context "
f"window that wants optimization. This helps exhibit the MCP performance "
f"for context window administration in language fashions on Google Colab.",
significance=0.5 - (i * 0.02)
)
stats = context_manager.get_stats()
print("nInitial Statistics:")
for key, worth in stats.objects():
print(f"{key}: {worth}")
question = "How does the Mannequin Context Protocol work?"
print(f"nRetrieving context for: '{question}'")
context = context_manager.retrieve_context(question)
print(f"nRelevant context:n{context}")
print("nVisualizing context:")
context_manager.visualize_context()
print("nDemo full!")
The run_mcp_demo operate ties every thing collectively in a single script: it instantiates the ModelContextManager, provides a collection of pattern chunks with various significance, prints out preliminary statistics, retrieves and shows probably the most related context for a take a look at question, and eventually visualizes the context window, offering an entire, end-to-end demonstration of the Mannequin Context Protocol in motion.
if __name__ == "__main__":
run_mcp_demo()
Lastly, this customary Python entry-point guard ensures that the run_mcp_demo() operate executes solely when the script is run instantly (fairly than imported as a module), triggering the end-to-end demonstration of the Mannequin Context Protocol workflow.
In conclusion, we may have a totally practical MCP system that not solely curbs runaway token utilization but in addition prioritizes context fragments that actually matter on your queries. The ModelContextManager equips you with instruments to stability semantic relevance, temporal freshness, and user-assigned significance. On the similar time, the accompanying MCPColabDemo class supplies an accessible framework for real-time experimentation and visualization. Armed with these patterns, you possibly can prolong the core rules by adjusting relevance thresholds, experimenting with varied embedding fashions, or integrating with different LLM backends to tailor your domain-specific workflows. Finally, this strategy allows you to create concise but extremely related prompts, leading to extra correct and environment friendly responses out of your language fashions.
Right here is the Colab Pocket book. Additionally, don’t overlook to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to affix our 90k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.