Browser Extension2026

YouTube Real-Time Bilingual Subtitles

A Chrome Extension and Go WebSocket backend system that seamlessly intercepts video audio and overlays real-time bilingual subtitles leveraging Deepgram STT and a Google/DeepL API failover mechanism.

YouTube Real-Time Bilingual Subtitles

Project Overview

This project is a decoupled real-time subtitle translation system designed to break down language barriers during YouTube videos and livestreams. It intercepts video audio seamlessly using the Web Audio API within a Chrome Extension, streams chunks via WebSockets to a Go backend for ultra-low-latency Speech-to-Text (STT) via Deepgram, and implements load-balancing/failover strategies across Google Translate and DeepL APIs. The translated dual-language subtitles are then rendered dynamically over the player.

Technical Challenges & Solutions

Frontend Dynamic Audio Interception and Buffered WebSocket Streaming

To achieve 'real-time' performance, the system must losslessly intercept the audio track from YouTube's HTML5 <video> tag and continuously transmit audio chunks to the backend, which is prone to memory leaks and stream stuttering.

Solution:
Leveraged the Web Audio API to parse AudioContext and create bypass audio nodes. Used ScriptProcessor/AudioWorklet to capture and dynamically downsample Float32 audio into 16-bit PCM format before streaming seamlessly via WebSocket to the Go backend.

Multi-Translation API Load Balancing and Failover

Free tiers of machine translation services (like Google Translate/DeepL) can easily hit rate limits when flooded with high-frequency short-sentence requests, causing immediate translation failures.

Solution:
Implemented an interface-driven translation engine in the Go backend utilizing a Round-Robin load balancing strategy. The system automatically rotates through multiple API keys or providers; if one engine gets rate-limited, the failover engine instantly takes over to guarantee uninterrupted stream subtitles.

Dynamic Flicker-Free Subtitle Rendering

Because speech recognition results (STT) can continuously update mid-sentence (interim results), frequent DOM repainting causes severe subtitle flickering, ruining the viewing experience.

Solution:
Defined a structured message protocol between the frontend and backend to distinguish between Final and Interim results. By maintaining an active subtitle state alongside CSS Transforms, the frontend smoothly updates targeted text nodes, optimizing repaint costs, and enabling fluid user-draggable subtitle positioning to prevent view occlusion.

Architecture

The frontend (Manifest V3 extension) handles audio interception, WebSocket communication, and draggable, flicker-free subtitle rendering. The Go backend establishes a low-latency WebSocket connection, processes audio streams with Deepgram, and wraps multiple translation APIs using Round-Robin pooling for continuous language translation. The front-end persistently stores all translation histories.

Learnings

Developing this system thoroughly deepened my understanding of streaming media processing and the architectural difficulties of real-time applications. From solving sample rate conversions when intercepting YouTube's <video> tracks via the Web Audio API, to tuning the stability of WebSocket communication, and designing an auto-failover mechanism independent of a single translation provider. Successfully building a tool that adds lag-free bilingual subtitles to live streams was immensely rewarding.

Tech Stack

Extension

JavaScriptChrome Extension APIWeb Audio API

Backend

GoWebSocket

AI Service

Deepgram STT APIGoogle TranslateDeepL API