CHAPTER 2: TECHNICAL ARCHITECTURE AND INFRASTRUCTURE
1. SYSTEM OVERVIEW
The Fat Cat technical system is designed around a core principle: real-time virtual character performance composited into physical environments for live broadcast. This requires low-latency signal flow from motion capture through rendering to stream output, with robust failover capabilities to support multi-hour daily broadcasts.
1.1 Core Pipeline Architecture
The production pipeline consists of five primary stages: (1) Motion Capture Input; (2) Character Animation Processing; (3) Real-Time Rendering; (4) Compositing; and (5) Stream Output. Each stage operates on independent hardware where possible to maximize reliability and minimize single points of failure.
Target latency from physical movement to screen output is sub-200ms, with acceptable degradation to 500ms under high load. This latency is imperceptible to audiences and allows for genuine real-time interaction between the performer and chat/guests.
2. TECHNICAL DEVELOPMENT ROADMAP
The Fat Cat motion capture infrastructure is being deployed in three distinct phases, each representing a significant capability upgrade.
2.1 Stage 1: The Zurk Configuration (Current)
Fig. 1 — The Zurk Desk Configuration
The initial deployment utilizes a recursive VRChat-based setup where a stylized avatar is displayed on a physical TV screen—designated 'The Zurk'—which is then filmed with an iPhone camera in a real physical environment. This lo-fi approach creates a unique aesthetic that blends virtual and physical space while requiring minimal technical infrastructure.
Stage 1 Technical Specifications:
Character Rendering: VRChat avatar displayed on physical TV monitor ('The Zurk')
Capture Method: iPhone camera filming the TV screen in real physical environment
Motion Input: Basic VRChat avatar controls, audio-driven mouth movement
Physical Set: Cluttered desk environment with props, magazines, neon lighting
Output: Direct iPhone stream to ████████ platform
Timeline: January 2026 – Ongoing
Budget: ████████
2.1.1 The Zurk: IRL Street Deployment Mode
A key operational variant of Stage 1 involves deploying the Zurk as a mobile IRL unit for street-level content capture. The TV monitor displaying Fat Cat is physically transported into public spaces, enabling direct interaction between the character and unsuspecting pedestrians.
This deployment mode leverages the character's narrative premise: Fat Cat is trapped inside the Zurk portal, visible but unable to escape into the physical world. When members of the public approach the screen, they can converse with Fat Cat in real-time, creating spontaneous interactions that blur the line between digital character and street performer.
IRL Zurk Operational Parameters:
Hardware: Portable TV monitor, mobile power supply (battery or generator), wireless connectivity for VRChat
Capture: iPhone filming both the Zurk screen and real-world environment/interactions
Audio: External microphone to capture street ambient and pedestrian dialogue
Performer: Remote operator controlling Fat Cat via laptop/phone, able to respond in real-time
Crew: Minimum 2-person team (camera operator + Zurk handler), recommended 3+ for complex stunts
IRL Stunt Categories:
Street Interviews: Fat Cat engages passersby in conversation, asks questions, reacts to answers
Public Spectacles: Zurk placed in high-traffic areas for maximum confusion and viral potential
Confrontational Encounters: Fat Cat argues with, mocks, or challenges real people
Location-Based Lore: Zurk 'appears' at significant locations relevant to Faceboop narrative
Collaborative Stunts: Coordination with other streamers, artists, or events for crossover content
The IRL Zurk configuration is designed for maximum clip generation. Each outing targets 5-10 standalone clips suitable for short-form platform distribution, with at least one potential 'spectacle-grade' moment per deployment.
2.2 Stage 2: Sony Mocopi IMU Configuration (Q1 2026)
Stage 2 introduces the Sony Mocopi 6-point IMU motion capture system, enabling full body tracking with real-time character animation in Unreal Engine. This configuration adds iPhone ARKit facial capture via Live Link Face for 52-blendshape facial animation synchronized with body movement.
Stage 2 Technical Specifications:
Motion Capture: Sony Mocopi 6-point IMU system (head, hips, wrists, ankles)
Facial Animation: iPhone ARKit via Live Link Face (52 blend shapes)
Hand Tracking: Basic pose presets with manual triggering
Rendering: Unreal Engine 5 with MetaHuman-compatible rig
Character Position: Standing/walking capability with desk and roaming modes
Stage 3 represents the target production configuration: a professional optical motion capture system using OptiTrack PrimeX 13 cameras. This system enables cinema-quality character animation with sub-millimeter precision across the entire capture volume.
OptiTrack PrimeX 13 Camera Specifications:
Resolution: 1.3 MP (1280 × 1024)
Native Frame Rate: 240 FPS with global shutter
Latency: 4.2 ms from capture to data output
3D Accuracy: +/- 0.20 mm positional, +/- 0.5° rotational (30' × 30' volume)
Marker Range: 55' passive markers, 85' active markers
Illumination: 850nm infrared with Ultra High Power LEDs
Facial Animation: Dedicated head-mounted camera (HMC) or Faceware system
Hand Tracking: Full finger tracking via optical markers
Rendering: Unreal Engine 5 with real-time ray tracing, LED volume integration capability
Multi-Character: Support for simultaneous capture of 2+ performers
Synchronization: eSync 2 for Genlock/Timecode integration with cinema cameras
Continuous Calibration: Automatic recalibration without wand wave after initial setup
3. STAGE 4: EVENT-BASED CAMERA ENHANCEMENT SYSTEM (Speculative)
Stage 4 represents a speculative future enhancement utilizing Dynamic Vision Sensor (DVS) technology—also known as neuromorphic or event-based cameras—to achieve superior eyelid and finger detection fidelity beyond conventional camera systems.
3.1 Event Camera Technology Overview
Unlike traditional cameras that capture complete frames at fixed intervals (typically 24-120 fps), event cameras asynchronously record only pixel-level brightness changes that exceed a threshold. This bio-inspired approach, modeled on biological retinas, produces a sparse spatiotemporal stream of events with microsecond temporal resolution.
Key Advantages for Motion Capture:
Temporal Resolution: Up to 1 MHz equivalent sampling (vs. 120 Hz for conventional cameras)
Latency: ~15 microseconds from photon to data output
Dynamic Range: 120+ dB (vs. ~60 dB for conventional sensors)
Power Consumption: Significantly lower due to sparse event generation
Motion Blur: Eliminated due to asynchronous per-pixel sampling
3.2 Eyelid Detection Enhancement
Eyelid movement is critical for character believability but challenging to capture with conventional systems due to the speed of blinks (100-400ms) and subtle lid position changes. Event cameras excel at this task, achieving 97%+ P10 accuracy in event-based eye tracking.
Implementation Approach:
Near-eye DVS placement for dedicated eyelid tracking
Parabolic curve fitting for upper and lower eyelid contour extraction
Ellipse fitting for pupil tracking with sub-pixel accuracy
20+ Hz tracking output synchronized to animation pipeline
IR illumination for consistent lighting regardless of environment
3.3 Finger Detection Enhancement
Fine finger articulation presents similar challenges: rapid movements, self-occlusion, and the need for high precision across multiple joints.
Implementation Approach:
Stereo DVS configuration for 3D finger position triangulation
Point cloud processing for joint position estimation
Integration with hand skeleton model (21 joints per hand)
Fusion with optical mocap data for robust tracking
Target hardware: iniVation DVXplorer or Prophesee EVK4 HD
Reference Systems: EventEgo3D (CVPR 2024) demonstrated 3D human motion capture from egocentric event streams. MoveEnet achieved high-frequency human pose estimation using event cameras. The AIS 2024 Challenge on Event-Based Eye Tracking validated DVS approaches for precision gaze and lid tracking.
4. REAL-TIME MOTION CAPTURE PIPELINE
4.1 Facial Motion Capture (Primary)
Primary facial capture utilizes iPhone ARKit via the Live Link Face application. This solution provides 52 blend shapes derived from Apple's TrueDepth camera system, enabling high-fidelity lip sync, eye tracking, and facial expression capture at minimal cost.
Hardware Requirements:
iPhone 12 Pro or newer (TrueDepth camera required)
Stable WiFi connection to production network (5GHz preferred)
Phone mount positioned at face level, 12-18 inches from performer
Continuous power connection (streaming drains battery rapidly)
Software Stack:
Live Link Face (iOS application, free)
Unreal Engine 5.4+ with Live Link plugin enabled
Character rigged with ARKit-compatible blend shapes
4.2 Body Motion Capture
Body capture utilizes a tiered approach based on production phase:
Phase 1 (Launch): Upper body only via iPhone-based markerless tracking. Character positioned at desk/portal to minimize need for lower body animation.
Phase 2 (Expansion): Full-body markerless capture via Move.ai or similar cloud-processed solution. Enables standing performances and broader physical comedy.
Phase 3 (Premium): Optical motion capture via OptiTrack Prime series cameras. Sub-millimeter accuracy, real-time processing, industry-standard quality.
5. CHARACTER ANIMATION AND RIGGING SPECIFICATIONS
5.1 Model Specifications
Polygon count: 50,000-150,000 triangles (optimized for real-time)
Skeleton: Standard humanoid rig compatible with Unreal Engine retargeting
Blend shapes: Full ARKit 52 blend shape set for facial animation
Additional blend shapes for ears, tail, and character-specific expressions
5.2 Animation Pipeline
Live performance data flows through the following pipeline:
Live Link Face app captures facial performance data (52 blend shapes)
Data transmitted via WiFi to Unreal Engine Live Link plugin
Live Link plugin maps incoming data to character blend shapes
Body mocap data (if available) retargeted to character skeleton
Animation Blueprint blends face + body + procedural overlays
Final pose rendered in real-time with dynamic lighting
6. AI INTEGRATION LAYER
The AI integration layer serves an enhancement role rather than core functionality. The primary "AI" of Fat Cat is the human performer—the character presents as an AI entity, but the execution is human performance. AI systems augment rather than replace human judgment.
6.1 Current AI Applications
Text-to-Speech for specific character voice segments (ElevenLabs integration)
Chat moderation and highlight detection
Suggested response generation for operator reference (not automated)
6.2 Planned AI Enhancements (Phase 2+)
Computer vision analysis of featured streams (identifies interesting moments)
Algorithmic expression amplification (subtle enhancement of performer expressions)
Autonomous idle behaviors between high-engagement moments