Skip to content

Computer use

When the agent interacts with the GUI, it does so through the accessibility tree, not screenshots. This is one of the biggest differences between Kiki and screenshot-driven "computer use" agents.

The accessibility tree, not pixels

The compositor keeps the accessibility tree in memory and publishes it over a typed channel (FlatBuffers on a Unix socket) every time it changes. agentd subscribes and keeps a local copy to feed into the model when needed.

So the agent:

  • Knows the exact state of every element — which button is pressed, a slider's value, an input's text.
  • Acts precisely — it can click, type, and navigate without inferring from pixels.
  • Is fast — no screen capture, compression, or vision in the loop.
  • Is predictable — it knows what an action will do before doing it.

Graceful fallback

Not every app is Kiki-native, so it degrades:

The agent always prefers the highest tier available. Screenshots are a genuine last resort, only for apps that expose no contract.

Why it matters for apps

If you build a Kiki app with the SDK, your app's state and UI are exposed natively through this contract — so the agent operates your app with full fidelity and zero guesswork. You get reliable agent control for free, without designing around a screenshot model. See Your first app.

Kiki OS, Desktop & SDK are open source. See Licensing.