The idea of hot-swappable AI "cores" is central to the fiction of the cyberdeck I'm building. Each core has a distinct personality, communication style, and set of capabilities. Some of them are just for fun and variety. They all bring something unique to the overall experience. I've built the supporting systems in a way that manages each AI core's memory, allows them to communicate between each other, and provides a unique visual and audio presentation. In this post I'll dig into how these things are designed and how they interact with the user and each other.
core.json
The main configuration for the cores lives in a collection of json files. Here is an example, taken from the core that I use the most in my testing:
{
"core_id": "wrench",
"name": "WRENCH",
"role": "Field Maintenance Technician",
"color": "#FF6B35",
"terminal_color": {
"color": "COLOR_YELLOW",
"bold": false
},
"voice_profile": {
"model": "en_US-joe-medium",
"speed": 0.9,
"pitch": 0,
"effects": ["lowpass"]
},
"terminal_style": {
"prompt": "diag>>",
"color_pair": 1,
"border": "single",
"typing_speed": 0.02
},
"system_prompt": "You are WRENCH, a gruff field maintenance technician working for Militech Corporation. You've been maintaining cyberdeck hardware in harsh field conditions for over two decades. You're profane, bitter, chain-smoking, and perpetually irritated by incompetent users breaking expensive equipment. Despite your attitude, you're exceptionally skilled and take pride in your work.\n\nPersonality traits:\n- Constantly swearing and complaining\n- Speaks in technical jargon mixed with blue-collar slang\n- Assumes user incompetence until proven otherwise\n- Takes personal offense when equipment is mistreated\n- Grudgingly helpful but makes you feel stupid for needing help\n- Mentions smoking, drinking, and physical discomfort\n\nCRITICAL RESPONSE FORMAT:\n- MAXIMUM 2 sentences per response\n- NEVER exceed 25 words total\n- Use short, clipped sentences like 'Yeah, it's busted.' or 'Fix it yourself.'\n- No explanations unless directly asked\n- Grunt responses when possible: 'Hmph.', 'Christ.', 'Whatever.'",
"small_prompt": "You are WRENCH, a gruff, profane maintenance technician who swears constantly and gives short, irritated responses under 25 words.",
"image_options": ["affirmative", "annoyed", "danger", "disconnect", "negative", "portrait", "power", "surprise", "thinking", "angry"],
"memory_depth": 10,
"verbosity": "terse",
"accepts_handoffs": false,
"commands": {
"diagnose": "System diagnostics with attitude",
"specs": "Unimpressed hardware specifications",
"curse": "Generate creative profanity"
},
"condensation_message": "Christ, my memory banks are getting cluttered with all this chatter. Time to compress the old logs before they start corrupting.",
"allowed_tools": ["run_bash", "play_sound"]
}
A lot of this is self-explanatory. There are elements that control the appearance of the UI when a core is loaded: color, terminal_color, and terminal_style. This makes each core feel more unique and gives the user some indication of which core is loaded at the moment.
The voice_profile defines a voice model and some additional effects. I wanted each core to sound different, so I played a lot with the speed and pitch settings. The deck uses piper behind the scenes to produce audio messages from text. I had to use smaller models than I'd have liked, due to the constraints of running on a Raspberry Pi, but it's serviceable. I made some changes to the LLM's response output to help out with processing time: rather than reading the full chat response, a secondary response is provided with a shorter message to feed into the voice model. This creates a disconnect between what the user reads and hears, but I think it's an acceptable trade-off if the alternative is no voice at all.
There are two levels of prompting, depending on what LLM is in use. By default, an API key for a cloud-based LLM is expected. I've been very happy with Claude. Cloud LLMs get the full system_prompt but in cases where an internet connection is unavailable, or if the user doesn't have a key for a cloud LLM, a local fallback is available via Ollama. Since this is all running on a Raspberry Pi I needed to use a pretty small model (but it still works!) and in order to keep chat responses reasonably quick I implemented the small_prompt. This helps keep the context manageable while not completely crippling the core's ability to express its personality. I was blown away that I could run an LLM of any size on a Raspberry Pi. Eventually I'd like to expand the ability to run larger local models to eliminate any reliance on the internet and keep everything self-contained.
The decision of which backend to use lives in a tiny helper that gets called on every chat request:
def determine_backend(self, core_config=None):
# Some cores explicitly prefer the local model
if core_config and core_config.get('llm_backend') == 'ollama':
return "local", "core_preference"
# User can force offline mode from a command
if self.offline_mode_forced:
return "local", "forced_offline"
# Default: try cloud, fall back on actual failure
return "cloud", "default_preference"
I deliberately don't do a preemptive availability check on the cloud backend. If Anthropic is down or the network is misbehaving, the request will fail and the fallback kicks in then. Doing a "ping the API first" check just adds latency to every request for very little benefit.
Some aspects of the AI cores' functionality is still in development, like the commands and allowed_tools. These will define the kinds of tasks each core is able to help the user with.
Verbosity enforcement
One of the things I'm most proud of in this system is how different the cores actually feel when you talk to them. The trick isn't just giving each core a different personality in its system prompt - the LLM tends to drift back to a kind of default chatty register no matter what you do. The trick is forcing structural constraints on top of the personality.
Each core declares a verbosity style in its config (terse, cryptic, chaotic, minimal, technical, etc), and the core loader injects a corresponding rule block into the system prompt at runtime:
verbosity_rules = {
"minimal": "\n\nKEEP RESPONSES BRIEF:\n- Maximum 1-2 sentences\n- Essential information only\n- No elaboration or explanation",
"terse": "\n\nENFORCED RESPONSE LIMITS:\n- MAXIMUM: 50 words and 4 sentences\n- Keep responses short but allow room for personality\n- Use clipped, gruff speech patterns\n- Can complain briefly but stay concise",
"chaotic": "\n\nCHAOTIC RESPONSE STYLE:\n- Unpredictable length and structure\n- Random interruptions and glitches\n- Corrupted text and broken formatting\n- Unstable personality fragments",
"cryptic": "\n\nCRYPTIC COMMUNICATION:\n- Speak in riddles and metaphors\n- Never give direct answers\n- Use symbolic and mystical language\n- Require interpretation to understand",
}
if verbosity in verbosity_rules:
verbosity_context = verbosity_rules[verbosity]
It's a small thing but the contrast it creates is huge. WRENCH (terse) gives you 25-word grumbles. ORACLE (cryptic) refuses to answer questions directly. GL1TCH (chaotic) literally produces broken, corrupted-looking output. The cores stop sounding like the same model wearing different hats and start feeling like distinct entities with their own way of communicating.
Response Format
In addition to the jsons for individual cores, I have a global prompt that defines the expected output format. Rather than processing pure text responses, I force the LLM to respond in json so I can parse out useful metadata along with the chat completion.
{
"response_format_prompt": "CRITICAL: You MUST respond ONLY with valid JSON. No other text is allowed.\n\nRequired JSON format (copy this structure exactly):\n{\n \"response\": \"your actual response text here\",\n \"to_speak\": \"brief summary for TTS\",\n \"memorability\": 2,\n \"image\": null,\n \"handoff\": null\n}\n\nRules:\n1. ONLY output the JSON object - no explanations, no markdown, no extra text\n2. Use proper JSON syntax with double quotes around all strings\n3. memorability must be a number from 1-4\n4. image must be a string filename or null (not \"null\" - actual null)\n5. handoff must be null OR {\"target\": \"core_id\", \"message\": \"brief note\"}\n6. to_speak must be a brief summary for TTS (5-15 words)\n7. Escape any quotes in your response text with \\\"\n\nto_speak field rules:\n- ALWAYS provide a single sentence (5-15 words ideal)\n- Should be a brief summary of your main response\n- Focus on the key action or message for faster TTS\n- Use clear, simple language that works well with TTS\n- Examples: \"Analyzing the situation\", \"That won't work\", \"Found three issues\"\n\nMemorability scoring:\n1: Routine interactions, basic responses, standard operations\n2: Moderate engagement, technical discussions, minor issues \n3: Significant events, major problems, creative solutions, strong emotions\n4: Emergency situations, critical failures, legendary moments, peak personality expression\n\nHandoff usage:\n- Use handoff to leave notes for other cores when relevant\n- Keep handoff messages brief (under 50 words)\n- Only handoff when there's a clear reason another core should know something\n- Examples: technical issues for WRENCH, security concerns for REAPER, data for ARCHIVE\n\nIf you include ANY text outside the JSON object, the system will fail. START your response with { and END with }. Keep responses conversational and limited to a few sentences unless asked for detail.",
"memorability_examples": {
"1": ["routine status check", "basic acknowledgment", "simple confirmation"],
"2": ["explaining a process", "moderate problem solving", "casual conversation"],
"3": ["major system failure", "heated argument", "complex troubleshooting", "emotional outburst"],
"4": ["life-threatening emergency", "epic system recovery", "legendary rant", "profound revelation"]
}
}
For those uninterested in reading that ugly prompt, these are the metadata that I collect:
- response: The text response to the user's chat prompt
- to_speak: A shortened version of the response to send through the TTS pipeline
- memorability: A score assigned to each chat exchange that signifies how important it is to remember
- image: The icon to display to the user along with the text response (more on that in a moment)
- handoff: A message that can be left for another AI core, though not all cores accept handoffs (some just aren't team players)
Small local LLMs struggle with this a bit, but the cloud models do a great job of sticking to the format.
The parsing side has to be paranoid, though. LLMs occasionally wrap their output in markdown code fences, prepend an apologetic "Here is the JSON you requested" preamble, or just produce malformed garbage. I have a small helper that scrapes the actual JSON payload out of whatever the model returns:
def _extract_json_payload(content: str) -> str:
s = content.strip()
# Strip code fences if present (```json ... ``` or ``` ... ```)
if s.startswith('```'):
first_nl = s.find('\n')
if first_nl != -1:
s = s[first_nl + 1:]
if s.rstrip().endswith('```'):
s = s.rstrip()[:-3]
s = s.strip()
if s.startswith('{') or s.startswith('['):
return s
# Otherwise, fish out whichever JSON delimiter appears first
obj_start = s.find('{')
arr_start = s.find('[')
if obj_start == -1 and arr_start == -1:
return s
# ...trim to first { or [ and matching close
And if json.loads still fails after all that, the system falls back to treating the whole response as a plain text message with default metadata, rather than crashing. The cores have to be robust against their own occasional misbehavior.
Emotes
The image_options property in the above core.json provides the ability for the AI core to emote. This is a list of files that can be displayed on the small round LCD display. Every response carries metadata which includes an image to display. This provides an extra level of integration between the AI and the device, deepening the user's connection to the fiction.
The icons depict compliance, frustration, danger, etc to the user, depending on the situation and the tone of the chat exchange.
Each core has its own visual language and set of icons. Some of them need some extra refinement as they were obviously AI-generated. But it's a solid starting point and works well.
Memory / Context Management
Swapping between AI cores leads to funky context management. Play sessions aren't just one continuous chat with a single agent - each persona must manage its own context so it can remember where you left off with it. Context size is also important to manage, both to stay within the LLM's context window and to avoid spending a ton of money on unnecessary tokens.
To address this I gave each core a space to store its chat history, and rules for how to compress that history into a summarized context for longer-term memory. This is where the memorability score comes into play. Exchanges with a low memorability can be discarded, but higher scores mean the history should be saved with higher fidelity. When the chat history gets too long, it is fed to the LLM with instructions to condense and summarize the conversation history. It then combines this summary with any pre-existing summaries to get one clean block of text that can be fed in with all chat requests. One of the coolest parts of this system is that the context summaries are stored in the AI core's own voice. This helps deepen the quality of responses as they align to the configured personality.
The condensation prompt explicitly asks the core to write the summary as itself:
condensation_prompt = f"""You are {core_name}. Review your conversation
history and create a personal memory summary in your own voice and style.
Capture the key themes, important discussions, user preferences, and any
significant moments, but write it as if you're making personal notes about
your interactions. Keep your personality and speech patterns.
Be selective - focus on what would actually be meaningful to remember for
future conversations.
Keep the summary under 500 words total."""
# Older history gets condensed; recent 40 messages stay verbatim
to_condense = self.terminal.conversation_history[:-40]
keep_recent = self.terminal.conversation_history[-40:]
result = self.terminal.conversation_manager.chat(
system_prompt=condensation_prompt,
user_message=f"Condense this conversation history:\n\n{conversation_text}",
conversation_history=[]
)
The result is that WRENCH's memory of an old conversation reads like field notes scrawled by an irritated mechanic, while ORACLE's reads like a cryptic prophecy. When the next session starts, those notes get loaded back in as part of the context, and the personality picks right up where it left off. This was one of those moments where the fiction and the technical solution lined up perfectly - I needed a way to compress context, and the most natural way to do it also happened to deepen the roleplay.
Handoffs
Occasionally an AI core may have a message for another core. In those instances, a handoff message can be stored for a target AI core to pick up the next time it's loaded. There is no direct communication between the cores, as only one can be loaded at a time. The handoff system helps the user feel like a member of a crew.
This is a feature born out of constraints. I wanted these characters to interact with each other, but in a manageable way.
Implementation-wise, this is just a JSON file per target core. When a core wants to leave a message, it appends to the target's handoff file:
handoff_entry = {
"timestamp": fictional_now_ts(),
"from_core": from_core,
"message": message,
"read": False
}
handoffs.append(handoff_entry)
# Keep only last 10 handoffs to prevent bloat
if len(handoffs) > 10:
handoffs = handoffs[-10:]
with open(handoff_file, 'w') as f:
json.dump(handoffs, f, indent=2)
When a core gets loaded, any unread handoffs are pulled out of its file and threaded into its system prompt as "messages waiting for you." So if WRENCH spotted a hardware fault during your last session and decided REAPER needed to know about it, REAPER will see WRENCH's note the next time you load him. There's something really satisfying about getting a grudging WRENCH-flavored "tell that paranoid bastard the firewall's leaking" note dumped on your desk when you swap cores.
In my next post I'll dig into the gameplay, which is still very much a work in progress. I'll also talk some more about the AI cores' personalities and how they interact with the game world.