Skip to main content
NanoClaw can understand images sent as message attachments using Claude’s multimodal capabilities. The agent sees the image content and can describe, analyze, or act on it.
Image vision is currently WhatsApp-only. This skill lives on the nanoclaw-whatsapp fork.

How it works

  1. A WhatsApp image attachment arrives
  2. The WhatsApp channel auto-downloads the image
  3. The image is resized using sharp (to fit within Claude’s input limits)
  4. The image is base64-encoded and passed to the agent as a multimodal content block
  5. Claude sees the image alongside the text message and can reason about it
The agent doesn’t need special instructions — it sees the image natively as part of the conversation.

Prerequisites

  • WhatsApp channel installed (/add-whatsapp)
  • The sharp library (installed automatically by the skill)

Installation

# On your nanoclaw-whatsapp fork
git fetch whatsapp skill/image-vision
git merge whatsapp/skill/image-vision
Or via Claude Code:
/add-image-vision
After merging, rebuild:
npm run build

Usage examples

Send an image to a WhatsApp group where the agent is active, then ask:
@Andy what's in this image?
@Andy extract the text from this screenshot
@Andy describe this chart
Last modified on March 19, 2026