AI Image Caption Generator

Upload any photo and let on-device AI write a descriptive caption — free, no account required, and nothing leaves your device. Powered by ViT-GPT2 running entirely in your browser.

Drop an image here, or browse files

JPG, PNG, WebP, GIF, BMP — up to 10 MB

How It Works

Upload your image

Drag and drop or click to select a photo. JPG, PNG, WebP, GIF, or BMP up to 10 MB. Nothing is sent anywhere.

Model downloads once

On first use, the ViT-GPT2 model (~250 MB) is downloaded and saved to your browser for offline reuse.

AI describes locally

The ViT encoder + GPT-2 decoder run in a Web Worker right here in your browser — never on a server.

Copy the caption

The descriptive caption appears instantly. Copy it to your clipboard with one click.

How to Use the AI Image Caption Generator

Drag and drop or click to upload an image — JPG, PNG, WebP, GIF, or BMP up to 10 MB.
Click Generate Caption. On first use, a one-time model download (~250 MB) is required — your browser saves it locally for instant reuse and offline use.
The ViT-GPT2 model analyses the image inside a Web Worker in your browser. A progress indicator shows when inference is running.
A descriptive caption appears in the result area below.
Click Copy to copy the caption to your clipboard, or Reset to start over with a new image.

Features

Your image stays private: The AI model runs entirely in your browser — nothing is uploaded, stored, or shared.
No account required: Open the page and start captioning images straight away.
Descriptive captions: A natural-language sentence describing the contents of the image, generated locally.
Works offline: After the one-time model download, the tool runs with no internet connection.
Powered by ViT-GPT2 (Apache-2.0): A vision-encoder-decoder model that pairs a Vision Transformer image encoder with a GPT-2 text decoder. Served as a quantised ONNX model and running via Transformers.js in a Web Worker using WebAssembly.
Broad format support: JPG, PNG, WebP, GIF, and BMP images up to 10 MB.

When to Use AI Image Captioning

Alt-text drafts for accessibility — generate a first-pass alt attribute for an image, then refine the wording manually before publishing.
Stock photo metadata — produce starter descriptions for a batch of images you are tagging for a catalogue or DAM.
Caption ideas for social posts — use the generated description as a creative jumping-off point for an Instagram or LinkedIn caption.
Personal photo organisation — describe the contents of holiday or family photos to make them easier to find later.
Private analysis — describe sensitive or confidential images (medical scans, internal screenshots) without uploading them to a cloud service.

How the AI Generates a Caption

Under the hood the tool runs the ViT-GPT2 vision-encoder-decoder model entirely in your browser, in two cooperating stages:

Vision encoding. A Vision Transformer (ViT) image encoder reads the image, resizes it to 224×224, normalises the pixel values using the ImageNet mean / standard deviation, and produces a sequence of image embeddings — one per image patch.
Caption decoding. A GPT-2 autoregressive text decoder cross-attends over those image embeddings and generates the caption one token at a time, until the model produces an end-of-sequence token.

Because the entire pipeline runs in a Web Worker on your device, the image never leaves your browser — there is no upload, no API call, and no server-side processing.

Frequently Asked Questions

Is my image uploaded to your servers?

No. Your image never leaves your device. The entire captioning process runs inside your browser using a Web Worker. Nothing is sent to or stored on any server.

Is this free to use?

Yes, completely free. No account required, no usage limits, and no hidden charges.

Why is the first caption slow?

The first time you use the tool, your browser downloads the ViT-GPT2 AI model (~250 MB — vision encoder + text decoder ONNX files). After that one-time download, the model loads from your browser’s local storage and the tool works offline too.

Which image formats are supported?

JPG, PNG, WebP, GIF, and BMP images up to 10 MB.

Does it work offline?

Yes. After the model is downloaded on your first use, it is saved to your browser’s local storage. You can then caption images with no internet connection.

Which AI model does this use?

nlpconnect/vit-gpt2-image-captioning (Apache-2.0), a vision-encoder-decoder model pairing a Vision Transformer image encoder with a GPT-2 text decoder. Served as a quantised ONNX model via Transformers.js. It runs in a Web Worker in your browser using WebAssembly. The one-time download is ~250 MB.

Why does the caption generator not work in Safari?

Safari’s JavaScript engine does not eagerly free WebAssembly memory, and the ViT-GPT2 model exhausts the tab’s memory budget during autoregressive caption generation. Use Chrome, Brave, Firefox, or Edge.

Privacy & Security

All caption generation happens locally in your browser using on-device AI (ViT-GPT2 via Transformers.js). Your image is never transmitted to any server. The model is cached locally after a one-time ~250 MB download and works offline thereafter.