Autonomy Labs · Training Walk Protocol v0.1

Autonomy Labs · Rover Program

Training Walk Protocol · Shared Semantic Map

Tablet-paired walk
Gesture · Voice · Confirmation
Current thinking · v0.1

DocTWP-01 StatusPublic · v0.1 draft AuthorR. Kumar Updated2026-05-23 Peoria, IL

The thesis

Anyone who can describe their environment in plain English can program an autonomous robot.

The training walk is the protocol that makes that sentence true. A homeowner walks their property with a paired tablet, points at things, draws lines with a finger, and says what matters and why. The rover proposes back what it understood; the homeowner accepts or refines. Shared reality gets built exchange by exchange — and the exchange itself, not just the final state, is what gets stored.

01 · Shared visual attention Both the rover and the homeowner see the same camera feed. The handheld is paired 1:1 over local WiFi. Latency target < 100 ms.

02 · Gesture + voice Finger draws the geometry. Voice carries naming, categories, and causal explanation. Bound together — the way humans actually teach humans.

03 · Confirmation loop The rover proposes its interpretation back in plain English before it becomes operational. The exchange is the audit trail.

01 · Core insight — the causal explanation

When a homeowner teaches the rover "that's a ravine, don't go within five feet, you'd suffer damage," they are giving it three things layered on top of each other:

A spatial fact — this specific area, here in this yard.
A semantic category — it's a ravine, a kind of thing.
A causal explanation — depth + irrecoverability = damage.

The third layer is what makes the rule portable. A 2D no-go zone teaches the rover to avoid that ravine. The causal explanation teaches it to avoid ravines as a category. A year later, during the overnight reflection cycle, when the rover encounters an unfamiliar depth discontinuity on the far side of the yard, it can reason "depth + irrecoverable → ravine pattern → apply five-foot buffer" without ever having been walked past that specific spot.

The verbal explanation isn't a label. It is a generalization seed.

This is the actual differentiation. Imitation-learning approaches teach what to do. Pure geometric annotation teaches where the rules apply. Causal explanation teaches why — and why is what generalizes across yards the rover has never seen.

02 · Architecture — floor and ceiling

SLAM and occupancy grids do not go away. They are the floor: fast, dumb, reflexive, sensor-speed. When a depth discontinuity appears at 1.5 meters, the rover has 200 ms to brake — not three seconds to consult the VLM about whether this is a ravine or a sun-shadow.

The semantic map is the ceiling: meaning, names, categories, causal rules, learned from the training walk and the overnight reflection cycle.

The ceiling teaches the floor. A ravine annotation lives at the ceiling (semantic: depth + irrecoverable = damage) and gets projected down to the floor as a hardened "do not cross" polygon that the reflex layer enforces at sensor speeds. The cognitive layer does the meaning-making once; the reflex layer enforces the result fast, forever.

Ceiling

Semantic Map

L4 · meaning · category · causal rule

Named regions, categorical hazards, behavioral rules in plain English, the conversational memory of the walk. Source of truth. Re-read during overnight reflection; generalizes from one ravine to "ravines."

Floor

Geometric Map

L1 · polygons · occupancy · reflex-speed

Occupancy grid, named-region polygons, hardened constraints. Fast-access cache. Compiled down from the ceiling at training time and re-compiled overnight as understanding deepens.

Ceiling projects constraints down ▼ Floor enforces at sensor speed

Mapped onto the four-tier architecture already published in ARCH-01: the ceiling lives at L4 (semantic reasoning) and gets re-read by the overnight reflection cycle. The projection happens at L3 (planner) at training time. The floor is enforced at L1 (reflex). The training walk is the moment all four layers get populated at once, top-down, by the homeowner.

03 · The walk experience

Pairingtablet on rover · phone or tablet in hand

The rover carries a tablet showing what its cameras see. The homeowner holds a paired handheld — phone or tablet — over local WiFi (WebRTC for low latency). Same feed, plus an overlay for gestures. Local-first. If the protocol depends on the cloud, the day Comcast goes down is the day the training walk breaks.

Gesturessmall, learnable, unambiguous vocabulary

Six gestures cover the first 80% of training-walk intent. Add more only on evidence they're needed.

Single tap on a feature

"Look at this. I'm about to say something about it."

Draw a freehand line

A boundary or path edge. Interpretation proposed back.

Tap a sequence of points

A boundary drawn point-by-point — when freehand is awkward.

Draw a closed loop / circle

A region. Operating area, no-go zone, named place.

Tap-and-hold + voice

Categorical annotation with causal explanation.

Double-tap

Salience marker. "Pay attention to this."

A homeowner stands in a backyard at golden hour holding a tablet. The tablet shows the same yard from the rover's perspective with a small ravine visible. The homeowner's index finger is drawing a glowing cyan line along the ravine's edge on the tablet screen. UI annotations read 'BOUNDARY' next to the line and 'Reading this as the eastern edge — confirm?' at the bottom. The rover sits on the grass to the right. — Training walk in progress · finger draws the boundary, voice carries the reason, rover proposes back its interpretation.

Voice as first-classnot a fallback

The handheld is also a microphone. Voice carries naming ("call this the back yard"), causal explanation ("this is the road — never cross it, you'd be in traffic"), preferences ("mow this only in spring"), and corrections ("no, the boundary is further left"). Gesture without language is just geometry; language without gesture is hand-wavy. Bound together, they carry the full semantic payload.

The confirmation loopthe secret ingredient

After each annotation, the rover proposes back its interpretation in plain English:

"I'm reading this as the eastern boundary of the back yard. Beyond this line is the neighbor's property — I won't cross it. Yes, no, or refine?"

The homeowner accepts, corrects, or refines. The exchange — not just the final answer — is what gets stored. Shared reality is built exchange by exchange, and the audit trail is the conversation itself. This is also the safety mechanism: the rover's interpretation is legible to the homeowner before it becomes operational. No silent miscategorizations.

04 · What gets stored

Two persistence layers, both updated during the walk and re-coherent overnight:

The geometric map (floor) · Occupancy grid plus named-region polygons. Hardened constraints for the reflex layer: hard boundaries, no-go zones, distance buffers. Compiled down from the semantic layer at the end of the walk, and re-compiled during overnight reflection as understanding deepens.

The semantic memory (ceiling) · A structured store of named regions, categorical hazards, and causal rules in language. Each rule is stored with its full causal chain — not just the consequence. "Don't cross the road — traffic, would be hit, fatal" is stored as the whole chain, not just "don't cross road." The VLM consults this during planning. The overnight reflection cycle re-reads it, generalizes patterns ("multiple things called 'ravine' share depth + irrecoverable"), and writes new rules the next day's planner uses.

The semantic memory is the source of truth. The geometric map is a fast-access cache. When they disagree, the semantic memory wins and the cache regenerates.

05 · Why this is distinctive

Most teach-by-demonstration approaches are one of three things, and each has a known ceiling:

Imitation learning — do what I do. Brittle to novel situations the demonstration didn't cover.

Pure geometric annotation — don't go where the polygon is. Doesn't generalize beyond the polygons drawn.

Pure language interface — tell me in English what to do. Hand-wavy without a shared visual reference.

This protocol is none of those. It is shared visual attention + gestural annotation + causal explanation + a confirmation loop. The homeowner and the rover are looking at the same pixels. The homeowner is drawing on those pixels with their finger. The homeowner is also explaining why the drawing matters. The rover proposes back what it understood. The exchange becomes the memory.

The cognitive layer does the meaning-making once. The reflex layer enforces the result forever. Geometric maps are not abandoned — they are demoted from "the whole story" to "the fast-access cache the semantic layer keeps coherent."

06 · Open questions

Pre-implementation

Resolved before code; surfaced publicly while still in flight

Storage shape for causal rules. Free-form natural language (what the homeowner says)? Structured triples (what the planner reasons over)? Both, with one as canonical and the other derived?

Compilation order during the walk. Semantic-first then project down to geometry? Geometric-first then attach meaning? Alternating? Each ordering has different latency and UX consequences.

Handling partial disagreement. When the rover's proposed interpretation is partly right, is "refine" a free-text correction, a re-gesture, another voice utterance, or all three? Probably all three; the UI design needs to make this fluid.

Voice latency vs. gesture latency. Transcription is slower than touch detection. Does the protocol wait for both before proposing back, or propose on gesture alone and incorporate voice into the refinement?

Multi-session continuity. Day-two walk: how does the rover surface what it learned on day one for confirmation, without making the homeowner re-walk the whole property?

The overnight reflection interface. When the rover generalizes a new rule overnight ("I notice depth discontinuities on the south fence — applying the ravine rule"), does the homeowner get a one-tap review the next morning, or does it just take effect?

Categorical taxonomy bootstrap. Does the system ship with a starter taxonomy of categories ("ravine," "road," "garden bed," "compost pile"), or does every category get invented during the walk? Starter is faster but biases the user; freeform is slower but truer to their language.