I Put OpenClaw Inside My Ray Ban Meta Glasses. Here Is How.

How VisionClaw connects Meta Ray Ban smart glasses to Google Gemini and OpenClaw for real time voice, vision, and agentic actions.

By: Suganthan Mohanadasan · · 8 min read

If you read my previous post on OpenClaw for SEO, you know I’ve been deep in the weeds with this tool. But last week I took it somewhere I didn’t expect.

I put OpenClaw inside my Ray Ban Meta glasses.

Voice and vision, powered by Gemini. Agentic actions, routed through OpenClaw. All running in real time from a pair of sunglasses.

I recorded a short demo showing it in action:

This is made possible by an open source project called VisionClaw, created by Xiaoan (Sean Liu, @_seanliu on X). Huge credit to him for open sourcing the code and making this kind of experimentation accessible to everyone.

The repo is here: github.com/sseanliu/VisionClaw

(Vision claw code here: linkedin.com/posts/suganthan)

What Is VisionClaw?

VisionClaw is a real time AI assistant for Meta Ray Ban smart glasses. It connects three systems together:

  1. Meta Ray Ban glasses (the camera and microphone)
  2. Google Gemini Live API (the AI brain, processing voice and vision over WebSocket)
  3. OpenClaw (the hands, executing real actions across 56+ connected apps)

The result? You speak to your glasses, Gemini sees what you see, understands what you’re asking, and if the task requires action, it routes the request through OpenClaw to get things done.

Ask it to search the web for the price of something you’re looking at. Tell it to text someone. Ask it to add something to a list. It sees, it thinks, it acts.

There’s also a bonus fourth layer: WebRTC live streaming. Anyone with a room code can watch your glasses’ point of view feed in a browser. More on that later.

How It Actually Works (The Architecture)

This is the part that excited me the most. The architecture is clean and each layer has a specific job.

Layer 1: The Glasses (Hardware Input)

The Meta Ray Ban glasses stream video at roughly 1 frame per second as JPEG images via Meta’s DAT (Device Access Toolkit) SDK. Audio is bidirectional: your voice goes in at 16kHz, and Gemini’s responses come back at 24kHz through the glasses speakers.

The app subscribes to the glasses’ video frame publisher and routes every frame to two destinations simultaneously: Gemini (for AI processing) and WebRTC (for live streaming).

One nice touch: if you don’t have the glasses hardware, the app supports an iPhone camera fallback mode so you can still test everything.

Layer 2: Gemini Live API (The Brain)

The core AI connection runs over WebSocket to Google’s Gemini Live API. The model used is gemini-2.5-flash-native-audio-preview, which is Gemini’s native audio model. This means it processes speech natively without requiring a separate speech to text step. It handles both audio and vision in a single multimodal stream.

The connection flow:

  1. Opens a WebSocket to wss://generativelanguage.googleapis.com/...
  2. Streams audio and video frames continuously
  3. Gemini processes both modalities in real time
  4. Responds with spoken audio back through the glasses

The audio pipeline is worth mentioning. On the capture side, the app taps the audio input node, converts from Float32 to Int16 PCM, resamples to 16kHz mono, and accumulates into roughly 100ms chunks before transmission. On the playback side, it accepts Gemini’s Int16 PCM, converts back to Float32, normalises, and schedules on the audio player. Echo cancellation adapts automatically: aggressive AEC in iPhone mode (where mic and speaker are close together), milder in glasses mode (where they’re physically separated).

Layer 3: OpenClaw Gateway (The Hands)

This is where it gets really interesting. Gemini itself cannot call external APIs. It can only emit structured function calls with a description of what it wants done.

The system prompt tells Gemini it has a single tool called execute. When you ask Gemini to do something actionable, like “text John that I’ll be late” or “search for the price of this perfume on Amazon,” here is what happens:

  1. Gemini recognises the intent and issues an execute function call with a task description
  2. The iOS app intercepts this function call via the ToolCallRouter
  3. The router passes it to the OpenClawBridge, which sends a POST request to OpenClaw’s /v1/chat/completions endpoint
  4. OpenClaw determines which tools to invoke, executes the task, and returns the result
  5. The result flows back through the router to Gemini
  6. Gemini speaks the confirmation to you through the glasses

The bridge maintains a rolling window of 20 messages (10 conversation turns) to give OpenClaw context without blowing up token limits. Session continuity is handled via an x-openclaw-session-key header.

There’s also proper cancellation support. If you interrupt a tool call mid execution (say you changed your mind), the app can cancel specific in flight calls by ID.

Layer 4: WebRTC Live Streaming (Bonus)

VisionClaw includes a live POV streaming feature using WebRTC. When you start a session, it generates a 6 character room code. Anyone with that code can watch your glasses’ point of view feed in a browser.

The video comes from a custom capturer that’s fed directly by the glasses’ camera frames (not the phone’s native camera). Video bitrate is capped at 2.5 Mbps, framerate at 24fps. STUN servers from Google handle the peer connection, with TURN fallback for restrictive networks.

The signaling server is a lightweight Node.js app (single dependency: the ws WebSocket library) that can be deployed on Fly.io. It’s room based: the creator generates a code, viewers join by entering it.

One thoughtful detail: there’s a 60 second grace period when you disconnect. This means you can switch apps briefly (say, to copy the room code and share it) without dropping the stream entirely.

How to Set It Up

What You Need

Hardware:

  • Meta Ray Ban smart glasses (or just your iPhone camera for testing)
  • iPhone running iOS 17.0 or later

Software:

  • Xcode with Swift
  • Meta AI app installed with Developer Mode enabled
  • A Google Gemini API key (free from Google AI Studio)
  • OpenClaw installed locally or on a server (optional, only needed for agentic actions)

There’s also an Android version (Android 14+, SDK 31+) if you’re not on iOS, though the repo’s primary platform is iOS.

Step by Step

  1. Clone the repo: git clone https://github.com/sseanliu/VisionClaw.git
  2. Open the Xcode project: Navigate to samples/CameraAccess/CameraAccess.xcodeproj and open it in Xcode.
  3. Enable Developer Mode on your phone: Open the Meta AI app and enable Developer Mode. This allows the app to register with your glasses.
  4. Launch the app and connect: Press Connect in the app to register with your glasses via the DAT SDK.
  5. Configure your keys in Settings: The app has a settings panel where you enter your Gemini API Key, OpenClaw Host and Port, Gateway Token, and Signaling URL for WebRTC.
  6. Start a session: Once connected, the glasses start streaming video and audio. Speak naturally. Gemini sees what you see and responds through the glasses speakers.

If you just want the voice and vision features without agentic actions, you can skip the OpenClaw setup entirely and it still works as a multimodal assistant.

What Can You Actually Do With This?

Here’s where I want to be honest. This is early stage, experimental, and genuinely cool, but it’s not a polished consumer product. It’s a developer toolkit.

That said, the possibilities are real:

Visual search in the real world. Look at a product in a store and ask “how much is this on Amazon?” Gemini sees the product through your glasses, identifies it, and OpenClaw searches for the price.

Hands free messaging. “Text Anja that I’m running 10 minutes late.” Gemini processes the voice command, OpenClaw sends the message.

Live research while walking. “What’s the history of this building?” while looking at a landmark. Gemini processes the visual context, OpenClaw searches the web, and you get the answer spoken back.

Meeting prep on the go. Walking into a meeting? “Pull up the key points about the client I’m about to meet.” OpenClaw fetches the info and Gemini narrates it to you.

Live POV sharing. Share what you’re seeing in real time with a colleague via the WebRTC stream. Useful for walkthroughs, site visits, or remote collaboration.

The combination of persistent vision (the glasses are always looking where you look) with agentic execution (OpenClaw can actually do things) opens up workflows that feel genuinely different from pulling out your phone.

Limitations and What to Know Before You Start

Frame rate is low. The glasses stream at roughly 1 fps. This is fine for looking at objects, reading signs, or identifying products. It’s not smooth video. Don’t expect it to track fast movement.

Battery life matters. Running a continuous WebSocket stream, processing audio, and transmitting video frames will drain the glasses faster than normal use.

OpenClaw security applies here too. Everything I wrote about security in my OpenClaw article applies doubly here. Your glasses camera is streaming visual data to Gemini’s API. Be thoughtful about what you point them at and what actions you allow. Set up your SOUL.md with strict human approval gates.

It’s an iOS app (primarily). The Android version exists but the iOS app is the more complete implementation. You need Xcode and a Mac to build it.

Meta’s DAT SDK is still evolving. The SDK has gone through four versions since October 2025. APIs change between versions. Expect some friction if you’re building on top of this.

Wrapping Up

VisionClaw is one of those projects that makes the future feel tangible. The idea of wearing glasses that see what you see, understand what you say, and can take action on your behalf, it sounds like science fiction. But it’s running on my face right now.

Massive credit again to Xiaoan (Sean Liu) for building and open sourcing this. The code is well structured, the architecture is clean, and the documentation is solid. If you’re the type of person who wants to tinker with what’s next rather than wait for a polished product, this is worth your weekend.

And if you’re coming from the SEO world and already have OpenClaw running, the agentic layer is where this really shines. Combine persistent vision with your existing automation workflows and you start to see use cases that weren’t possible before.

The repo is at github.com/sseanliu/VisionClaw. Go build something interesting.