Run Gemma on the edge with the Coral Board

This is the Coral Board, a small, low-power dev board with Google’s Coral NPU machine learning accelerator inside, built for developers to experiment with on-device AI. It runs Gemma, and everything happens on the board. For I/O it ships as a kit with a screen, a camera, microphones, and LEDs, so you can see what’s possible on the edge.

What’s covered: Live translation with speech in and translated speech out running entirely on the board, natural language controlling physical hardware, and vision and sound working together in a lightweight version of the pre-I/O show that generates music from an aquarium of jellyfish, all on a single board.

The Coral Board is available this summer. Every demo in the video is open source and on GitHub to get you started. Follow the links below to learn more.

What will you build on the edge with Gemma? Drop it in the comments.

Subscribe to Google for Developers → https://goo.gle/developers

Speaker: Ian Ballantyne Products Mentioned:

Google AI, Gemini

Running Gemma on-device with the Coral Board allows for high-performance, private, and offline AI inference. Everything happens directly on the board, eliminating the need for cloud connectivity for these tasks (0:14–0:17).

How Gemma runs on-device:

Hardware Acceleration: The Coral Board utilizes Google’s Coral NPU (Neural Processing Unit), an ultra-low-power, RISC-V based accelerator specifically designed to handle machine learning workloads efficiently at the edge (0:04–0:06).
Software Integration: Developers use the MediaPipe LLM Inference API, which provides specialized wrappers to manage on-device memory and handle the model’s operations directly on the hardware accelerator.
Quantization: Because edge devices have resource constraints, Gemma models are typically converted to 4-bit or 8-bit quantized formats. This significantly reduces the model’s memory footprint and improves inference speed without sacrificing performance, making it suitable for hardware like the Coral Board.
Workflow: The process generally involves taking Gemma model weights, quantizing them for compatibility with LiteRT (formerly TensorFlow Lite), and deploying them via the MediaPipe framework to execute on the NPU.

This setup enables developers to build sophisticated edge applications—such as live speech translation, natural language hardware control, and real-time creative audio-visual generation—all running locally (0:30–0:55).

Run Gemma on the edge with the Coral Board

Running Gemma on the edge is highly achievable, especially with Google’s dedicated hardware and software stacks designed for low-power, local AI inference. The Gemma model family includes small-sized versions tailored exactly for edge and ultra-mobile deployment. [1, 2]

Hardware Options

To run Gemma locally on specialized edge hardware, you have a few primary routes depending on the exact board version you are using:

The Coral Dev Board (Gemma-optimized): Google features a dedicated Coral development board specifically engineered to run lightweight Gemma models (such as Gemma3-270M chips). It acts as a full-stack, open hardware platform capable of on-device AI tasks like real-time voice translation and natural language hardware control.
Classic Coral Dev Board (Edge TPU): If you are using older Coral Dev Boards (which have ~1 to 4 TOPS ML accelerators), running a full large language model can be resource-heavy. However, these boards excel at highly optimized encoder-based, vision, and audio tasks at the edge. [4, 5]

Deployment Tools for Edge & Mobile

To execute Gemma or its optimized variants (including the Gemma 4 small sizes), Google provides specific cross-platform runtimes:

LiteRT (formerly TFLite) for LLMs: The fully open-source framework specifically built to run LLMs directly on user devices, offering fine-grained control and direct NPU/GPU acceleration.
MediaPipe: Provides the LLM Inference API, which is the easiest way to integrate Gemma into cross-platform edge applications seamlessly. [6]

For more context on how these tiny, low-power NPUs are bringing generative AI directly to edge devices and wearables, check out the Google Research innovations:

If you want, I can:

Detail the specific compiler steps required to convert Gemma models for Edge TPU execution
Provide links to the open-source GitHub repositories for Coral hardware demos
Explain how to deploy Gemma models via LiteRT [7]