整合Gemini realtime模型
本文档面向希望在 tio-boot 框架上快速接入 Gemini Live(支持实时语音问答与语音合成/转写)的 Java 开发者。内容包含:依赖与准备、整体架构、后端关键实现说明、前端示例、音频格式/协议、部署与调试建议、常见问题与扩展点。文档以已提供的实现为基础,补充说明与最佳实践,便于直接复制到工程中使用。
1 概述
目标:在 tio-boot(基于 tio 的 WebSocket 服务)中集成 Google/某厂商的 Gemini Live 实时模型,以实现:
- 浏览器端采集麦克风音频(16 kHz PCM)并实时发送到模型;
- 模型实时返回文本转写、合成音频(24 kHz PCM)与事件;
- 后端负责会话管理、转发与封装 WebSocket 协议,前端负责音频采集、播放与 UI 控制。
本文的实现包含完整示例:后端 Java(GeminiLiveBridge、VoiceSocketHandler 等),前端静态页面(index.html + app.js + mic-worklet.js)。可直接在项目中复用或按需改造。
2 先决条件
- Java 开发环境(JDK 11+ 推荐)。
- tio/tio-boot 框架(已有基于 tio 的 WebSocket 服务)。
- Maven(或其他构建工具)。
- 从 Gemini 平台(或对应 SDK 提供方)获取 API Key,并在环境变量或安全配置中存放(示例代码中通过
GeminiClient.GEMINI_API_KEY读取 — 请改为工程的安全配置方式)。 - 浏览器支持
getUserMedia、AudioWorklet(若浏览器不支持,示例中有 ScriptProcessor 回退实现)。 - 已添加 SDK 依赖(示例使用
com.google.genai:google-genai:1.41.0):
Maven 示例:
<dependency>
<groupId>com.google.genai</groupId>
<artifactId>google-genai</artifactId>
<version>1.41.0</version>
</dependency>
- 申请google gemini api key
GEMINI_API_KEY=
3.完整后端代码
添加依赖
<dependency>
<groupId>com.google.genai</groupId>
<artifactId>google-genai</artifactId>
<version>1.41.0</version>
</dependency>
RealtimeBridgeCallback
package com.litongjava.voice.agent.bridge;
public interface RealtimeBridgeCallback {
void sendText(String json);
void sendBinary(byte[] bytes);
void close(String reason);
void session(String sessionId);
void turnComplete(String role, String text);
void start(RealtimeSetup setup);
}
package com.litongjava.voice.agent.bridge;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
@Data
@NoArgsConstructor
@AllArgsConstructor
@Accessors(chain=true)
public class RealtimeSetup {
private String system_prompt; // 当 type == "setup" 时的系统提示
private String user_prompt; // 当 type == "setup" 时的用户提示
private String job_description;
private String resume;
private String questions;
private String greeting;
}
GeminiLiveBridge
package com.litongjava.voice.agent.bridge;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Base64;
import java.util.List;
import java.util.Optional;
import java.util.concurrent.CompletableFuture;
import com.google.genai.AsyncSession;
import com.google.genai.Client;
import com.google.genai.types.ActivityHandling;
import com.google.genai.types.AudioTranscriptionConfig;
import com.google.genai.types.AutomaticActivityDetection;
import com.google.genai.types.Blob;
import com.google.genai.types.ClientOptions;
import com.google.genai.types.Content;
import com.google.genai.types.EndSensitivity;
import com.google.genai.types.LiveConnectConfig;
import com.google.genai.types.LiveSendClientContentParameters;
import com.google.genai.types.LiveSendRealtimeInputParameters;
import com.google.genai.types.LiveServerContent;
import com.google.genai.types.LiveServerMessage;
import com.google.genai.types.Modality;
import com.google.genai.types.Part;
import com.google.genai.types.PrebuiltVoiceConfig;
import com.google.genai.types.RealtimeInputConfig;
import com.google.genai.types.SpeechConfig;
import com.google.genai.types.StartSensitivity;
import com.google.genai.types.ThinkingConfig;
import com.google.genai.types.TurnCoverage;
import com.google.genai.types.VoiceConfig;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;
import com.litongjava.gemini.GeminiClient;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.json.JsonUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class GeminiLiveBridge {
private static final String MODEL = "models/gemini-2.5-flash-native-audio-preview-12-2025";
// 输入输出音频 mime
private static final String INPUT_MIME = "audio/pcm;rate=16000";
private static final String OUTPUT_MIME_PREFIX = "audio/pcm"; // 输出是 audio/pcm(24k)
private final Object transcriptLock = new Object();
private final StringBuilder turnUserTranscript = new StringBuilder();
private final StringBuilder turnAssistantTranscript = new StringBuilder();
private final Client client;
private volatile AsyncSession session;
private final RealtimeBridgeCallback callback;
public GeminiLiveBridge(RealtimeBridgeCallback sender) {
this.callback = sender;
Client.Builder b = Client.builder().apiKey(GeminiClient.GEMINI_API_KEY);
ClientOptions clientOptions = ClientOptions.builder().build();
b.clientOptions(clientOptions);
this.client = b.build();
}
public CompletableFuture<Void> connect(RealtimeSetup realtimeSetup) {
LiveConnectConfig config = buildLiveConfig();
// AsyncLive.connect(model, config) -> CompletableFuture<AsyncSession>
return client.async.live.connect(MODEL, config).thenCompose(sess -> {
this.session = sess;
String sessionId = sess.sessionId();
callback.session(sessionId);
send(new WsVoiceAgentResponseMessage("gemini_connected", sessionId));
// 如果连接建立时已存在 prompts(可能前端先发 setup),立即发送到模型
try {
sendPromptsIfAny(sess, realtimeSetup);
} catch (Exception ex) {
log.error("send setup prompts error (connect)", ex);
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
}
// 注册 receive 回调(只注册一次)
CompletableFuture<Void> receiveFuture = sess.receive(this::onGeminiMessage);
receiveFuture.whenComplete((v, ex) -> {
log.info("v:{},ex:{}", v, ex);
if (ex != null) {
}
});
return receiveFuture;
}).exceptionally(ex -> {
log.error("Gemini live connect failed", ex);
String safe = safe(ex.getMessage());
send(new WsVoiceAgentResponseMessage("error", safe));
callback.close("gemini connect failed");
return null;
});
}
public CompletableFuture<Void> close() {
try {
AsyncSession s = this.session;
if (s != null) {
return s.close().exceptionally(ex -> null);
}
} finally {
try {
client.close();
callback.close("close");
} catch (Exception ignore) {
}
}
return CompletableFuture.completedFuture(null);
}
/** 前端推来的 16k PCM 裸流 */
public CompletableFuture<Void> sendPcm16k(byte[] pcm16k) {
AsyncSession s = this.session;
if (s == null)
return CompletableFuture.completedFuture(null);
Blob audioBlob = Blob.builder().mimeType(INPUT_MIME).data(pcm16k).build();
LiveSendRealtimeInputParameters params = LiveSendRealtimeInputParameters.builder().audio(audioBlob).build();
return s.sendRealtimeInput(params).exceptionally(ex -> {
String message = ex.getMessage();
log.error(message);
send(new WsVoiceAgentResponseMessage("error", safe(message)));
if ("org.java_websocket.exceptions.WebsocketNotConnectedException".equals(message)) {
close();
}
return null;
});
}
/** 前端说“音频结束”(可选,用于提示流结束) */
public CompletableFuture<Void> sendAudioStreamEnd() {
AsyncSession s = this.session;
if (s == null) {
return CompletableFuture.completedFuture(null);
}
LiveSendRealtimeInputParameters params = LiveSendRealtimeInputParameters.builder().audioStreamEnd(true).build();
return s.sendRealtimeInput(params).exceptionally(ex -> {
String message = ex.getMessage();
log.error(message);
send(new WsVoiceAgentResponseMessage("error", safe(message)));
if ("org.java_websocket.exceptions.WebsocketNotConnectedException".equals(message)) {
close();
}
return null;
});
}
/** 前端发文本输入(可选) */
public CompletableFuture<Void> sendText(String text) {
AsyncSession s = this.session;
if (s == null) {
return CompletableFuture.completedFuture(null);
}
Content userMessage = Content.fromParts(Part.fromText(text));
LiveSendClientContentParameters cc = LiveSendClientContentParameters.builder().turns(List.of(userMessage))
.turnComplete(true).build();
return s.sendClientContent(cc).exceptionally(ex -> {
log.error(ex.getMessage());
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
return null;
});
}
private void sendPromptsIfAny(AsyncSession s, RealtimeSetup realtimeSetup) {
if (realtimeSetup == null) {
return;
}
String systemPrompt = realtimeSetup.getSystem_prompt();
String job_description = realtimeSetup.getJob_description();
String resume = realtimeSetup.getResume();
String questions = realtimeSetup.getQuestions();
String greeting = realtimeSetup.getGreeting();
List<Content> initialTurns = new ArrayList<>();
if (StrUtil.notBlank(systemPrompt)) {
initialTurns.add(Content.fromParts(Part.fromText(systemPrompt)));
}
if (StrUtil.notBlank(job_description)) {
initialTurns.add(Content.fromParts(Part.fromText(job_description)));
}
if (StrUtil.notBlank(resume)) {
initialTurns.add(Content.fromParts(Part.fromText(resume)));
}
if (StrUtil.notBlank(questions) || StrUtil.notBlank(greeting)) {
initialTurns.add(Content.fromParts(Part.fromText(greeting + "\n\n" + questions)));
}
if (!initialTurns.isEmpty()) {
// 初始化指令通常作为 context 而非单次完成的用户 turn,可按需调整 turnComplete
LiveSendClientContentParameters cc = LiveSendClientContentParameters.builder().turns(initialTurns)
.turnComplete(true).build();
s.sendClientContent(cc).exceptionally(ex -> {
log.error(ex.getMessage());
send(new WsVoiceAgentResponseMessage("error", safe(ex.getMessage())));
return null;
});
send(new WsVoiceAgentResponseMessage("setup_sent_to_model"));
}
}
private LiveConnectConfig buildLiveConfig() {
// 自动VAD配置:AutomaticActivityDetection/RealtimeInputConfig
AutomaticActivityDetection vad = AutomaticActivityDetection.builder().disabled(false)
.startOfSpeechSensitivity(StartSensitivity.Known.START_SENSITIVITY_HIGH)
.endOfSpeechSensitivity(EndSensitivity.Known.END_SENSITIVITY_LOW).prefixPaddingMs(100).silenceDurationMs(500)
.build();
RealtimeInputConfig realtimeInput = RealtimeInputConfig.builder().automaticActivityDetection(vad)
.activityHandling(ActivityHandling.Known.START_OF_ACTIVITY_INTERRUPTS)
.turnCoverage(TurnCoverage.Known.TURN_INCLUDES_ONLY_ACTIVITY).build();
PrebuiltVoiceConfig prebuiltVoiceConfig = PrebuiltVoiceConfig.builder().voiceName("Puck").build();
VoiceConfig voiceConfig = VoiceConfig.builder().prebuiltVoiceConfig(prebuiltVoiceConfig).build();
SpeechConfig speech = SpeechConfig.builder().voiceConfig(voiceConfig).build();
ThinkingConfig thinkingConfig = ThinkingConfig.builder().thinkingBudget(0).build();
AudioTranscriptionConfig inputAudioTranscription = AudioTranscriptionConfig.builder().build();
LiveConnectConfig liveConnectConfig = LiveConnectConfig.builder()
.responseModalities(List.of(new Modality(Modality.Known.AUDIO)))
//
.speechConfig(speech).thinkingConfig(thinkingConfig).realtimeInputConfig(realtimeInput)
// 可选:转写
.inputAudioTranscription(inputAudioTranscription).outputAudioTranscription(inputAudioTranscription)
//
.build();
return liveConnectConfig;
}
private void onGeminiMessage(LiveServerMessage msg) {
try {
if (msg.setupComplete().isPresent()) {
send(new WsVoiceAgentResponseMessage("setup_complete"));
}
Optional<LiveServerContent> serverContentOpt = msg.serverContent();
if (serverContentOpt.isPresent()) {
LiveServerContent sc = serverContentOpt.get();
// 输入转写
sc.inputTranscription().ifPresent(t -> {
Optional<String> optional = t.text();
String text = optional.orElse(null);
send(new WsVoiceAgentResponseMessage("transcript_in", text));
appendTurnTranscript("user", text);
});
// 输出转写
sc.outputTranscription().ifPresent(t -> {
Optional<String> optional = t.text();
String text = optional.orElse(null);
send(new WsVoiceAgentResponseMessage("transcript_out", text));
appendTurnTranscript("model", text);
});
// 模型输出(音频/文本 part)
sc.modelTurn().ifPresent(content -> {
content.parts().ifPresent(parts -> {
for (Part p : parts) {
// 文本
p.text().ifPresent(txt -> send(new WsVoiceAgentResponseMessage("text", txt)));
// 音频(inlineData)
p.inlineData().ifPresent(blob -> {
String mt = blob.mimeType().orElse("");
byte[] data = blob.data().orElse(null);
if (data != null && mt.startsWith(OUTPUT_MIME_PREFIX)) {
// 直接二进制回传(建议前端按 24k PCM 播放)
callback.sendBinary(data);
} else if (data != null) {
// 兜底:非音频inlineData,走base64文本
String b64 = Base64.getEncoder().encodeToString(data);
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("inline_data");
m.setText(b64);
send(m);
}
});
// functionCall 等后续再扩展
p.functionCall().ifPresent(fc -> {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("function_call");
m.setText(fc.name().orElse(""));
send(m);
});
}
});
});
// turnComplete
if (sc.turnComplete().orElse(false)) {
send(new WsVoiceAgentResponseMessage("turn_complete"));
flushTurnTranscriptOnComplete();
}
}
// goAway(服务端提示将断开)
msg.goAway().ifPresent(g -> {
String timeLeft = g.timeLeft().map(Duration::toString) // 例如 PT30S / PT1M
.orElse("");
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("go_away");
m.setText(timeLeft);
send(m);
});
// usage
msg.usageMetadata().ifPresent(u -> {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("usage");
// 将 usage 信息编码到 text 字段或扩展 DTO;此处把三个数字拼接(可按需改成专门字段)
Optional<Integer> promptTokenCount = u.promptTokenCount();
Optional<Integer> responseTokenCount = u.responseTokenCount();
Optional<Integer> totalTokenCount = u.totalTokenCount();
m.setPromptTokenCount(promptTokenCount);
m.setResponseTokenCount(responseTokenCount);
m.setTotalTokenCount(totalTokenCount);
send(m);
});
} catch (Exception e) {
log.error("onGeminiMessage error", e);
send(new WsVoiceAgentResponseMessage("error", safe(e.getMessage())));
}
}
private void send(WsVoiceAgentResponseMessage msg) {
try {
String json = JsonUtils.toSkipNullJson(msg);
callback.sendText(json);
} catch (Exception e) {
log.error("serialize message error", e);
// fallback minimal message
// WsVoiceAgentResponseMessage wsVoiceAgentResponseMessage = new WsVoiceAgentResponseMessage();
// wsVoiceAgentResponseMessage.setType(WsVoiceAgentType.ERROR.name());
// wsVoiceAgentResponseMessage.setMessage("serialize error");
// 手写性能高
callback.sendText("{\"type\":\"error\",\"message\":\"serialize error\"}");
}
}
private static String safe(String s) {
if (s == null) {
return "";
}
return s.replace("\\", "\\\\").replace("\"", "\\\"").replace("\n", "\\n").replace("\r", "");
}
private void appendTurnTranscript(String role, String text) {
if (text == null || text.isEmpty())
return;
synchronized (transcriptLock) {
StringBuilder sb = "user".equals(role) ? turnUserTranscript : turnAssistantTranscript;
if (sb.length() > 0)
sb.append(' ');
sb.append(text);
}
}
private void flushTurnTranscriptOnComplete() {
String userText;
String assistantText;
synchronized (transcriptLock) {
userText = turnUserTranscript.toString().trim();
assistantText = turnAssistantTranscript.toString().trim();
turnUserTranscript.setLength(0);
turnAssistantTranscript.setLength(0);
}
// 一次 turn complete,分别按角色回调(各一次;如果为空则不回调)
try {
if (!userText.isEmpty()) {
callback.turnComplete("user", userText);
}
if (!assistantText.isEmpty()) {
callback.turnComplete("assistant", assistantText);
}
} catch (Exception e) {
log.error("turnComplete callback error", e);
}
}
}
WsRealtimeBridgeCallback
package com.litongjava.voice.agent.callback;
import java.nio.file.Path;
import com.litongjava.voice.agent.utils.ChannelContextUtils;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.media.NativeMedia;
import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import com.litongjava.tio.websocket.common.WebSocketResponse;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class WsRealtimeBridgeCallback implements RealtimeBridgeCallback {
private ChannelContext channelContext;
private String sessionId;
public WsRealtimeBridgeCallback(ChannelContext channelContext) {
this.channelContext = channelContext;
this.sessionId = ChannelContextUtils.key(channelContext);
}
@Override
public void sendText(String json) {
WebSocketResponse wsResp = WebSocketResponse.fromText(json, InterviewConst.CHARSET);
Tio.send(channelContext, wsResp);
}
@Override
public void sendBinary(byte[] bytes) {
WebSocketResponse wsResp = WebSocketResponse.fromBytes(bytes);
Tio.send(channelContext, wsResp);
}
@Override
public void close(String reason) {
Tio.remove(channelContext, reason);
}
@Override
public void session(String sessionId) {
}
@Override
public void turnComplete(String role, String text) {
// log.info("role:{},text:{}", role, text);
}
}
package com.litongjava.voice.agent.utils;
import com.litongjava.tio.core.ChannelContext;
public class ChannelContextUtils {
public static String key(ChannelContext ctx) {
// 用 tio 自己的唯一标识
return ctx.getId();
}
}
InterviewConst
package com.litongjava.voice.agent.consts;
public interface InterviewConst {
String CHARSET = "utf-8";
}
WsVoiceAgentRequestMessage
package com.litongjava.voice.agent.model;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* WebSocket 前端发来的消息结构
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class WsVoiceAgentRequestMessage {
private String type; // 消息类型:setup | text | audio_end | close | ...
private String text; // 当 type == "text" 时的文本
private String system_prompt; // 当 type == "setup" 时的系统提示
private String user_prompt; // 当 type == "setup" 时的用户提示
private String job_description;
private String resume;
private String questions;
private String greeting;
}
WsVoiceAgentResponseMessage
package com.litongjava.voice.agent.model;
import java.util.Optional;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* 后端 -> 前端 统一消息体
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class WsVoiceAgentResponseMessage {
private String type;
// 通用字段
private String text;
private String message;
private String where;
// setup / session
private String sessionId;
// goAway
private String timeLeft;
// usage
private Integer promptTokenCount;
private Integer responseTokenCount;
private Integer totalTokenCount;
// function call
private String name;
public WsVoiceAgentResponseMessage(String type) {
this.type = type;
}
public WsVoiceAgentResponseMessage(String type, String text) {
this.type = type;
this.text = text;
}
public WsVoiceAgentResponseMessage(String type, Optional<String> optional) {
this.type = type;
this.text = optional.orElse(null);
}
public void setPromptTokenCount(Optional<Integer> promptTokenCount) {
this.promptTokenCount = promptTokenCount != null ? promptTokenCount.orElse(null) : null;
}
public void setResponseTokenCount(Optional<Integer> responseTokenCount) {
this.responseTokenCount = responseTokenCount != null ? responseTokenCount.orElse(null) : null;
}
public void setTotalTokenCount(Optional<Integer> totalTokenCount) {
this.totalTokenCount = totalTokenCount != null ? totalTokenCount.orElse(null) : null;
}
}
WsVoiceAgentType
package com.litongjava.voice.agent.model;
/**
* WebSocket 消息类型枚举
*/
public enum WsVoiceAgentType {
SETUP,
//
TEXT,
//
AUDIO_END,
//
CLOSE,
// server -> client 响应类型(也可复用同一枚举)
ERROR,
//
SETUP_RECEIVED,
//
SETUP_SENT_TO_MODEL,
//
GEMINI_CONNECTED,
//
SETUP_COMPLETE,
//
TRANSCRIPT_IN,
//
TRANSCRIPT_OUT,
//
TURN_COMPLETE,
//
GO_AWAY,
//
USAGE,
//
INLINE_DATA,
//
FUNCTION_CALL,
//
IGNORED
}
VoiceSocketHandler
package com.litongjava.voice.agent.handler;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import com.litongjava.voice.agent.audio.SessionAudioRecorder;
import com.litongjava.voice.agent.bridge.GeminiLiveBridge;
import com.litongjava.voice.agent.bridge.RealtimeBridgeCallback;
import com.litongjava.voice.agent.bridge.RealtimeSetup;
import com.litongjava.voice.agent.callback.WsRealtimeBridgeCallback;
import com.litongjava.voice.agent.consts.InterviewConst;
import com.litongjava.voice.agent.model.WsVoiceAgentRequestMessage;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;
import com.litongjava.voice.agent.model.WsVoiceAgentType;
import com.litongjava.voice.agent.utils.ChannelContextUtils;
import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.http.common.HttpRequest;
import com.litongjava.tio.http.common.HttpResponse;
import com.litongjava.tio.utils.json.JsonUtils;
import com.litongjava.tio.websocket.common.WebSocketRequest;
import com.litongjava.tio.websocket.common.WebSocketResponse;
import com.litongjava.tio.websocket.common.WebSocketSessionContext;
import com.litongjava.tio.websocket.server.handler.IWebSocketHandler;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class VoiceSocketHandler implements IWebSocketHandler {
// 一个前端连接一个 bridge
private static final Map<String, GeminiLiveBridge> BRIDGES = new ConcurrentHashMap<>();
@Override
public HttpResponse handshake(HttpRequest httpRequest, HttpResponse response, ChannelContext channelContext)
throws Exception {
log.info("请求信息: {}", httpRequest);
return response;
}
@Override
public void onAfterHandshaked(HttpRequest httpRequest, HttpResponse httpResponse, ChannelContext channelContext)
throws Exception {
log.info("握手完成: {}", httpRequest);
}
@Override
public Object onClose(WebSocketRequest wsRequest, byte[] bytes, ChannelContext channelContext) throws Exception {
String k = ChannelContextUtils.key(channelContext);
GeminiLiveBridge bridge = BRIDGES.remove(k);
if (bridge != null) {
bridge.close();
}
Tio.remove(channelContext, "客户端主动关闭连接");
return null;
}
@Override
public Object onBytes(WebSocketRequest wsRequest, byte[] bytes, ChannelContext channelContext) throws Exception {
String k = ChannelContextUtils.key(channelContext);
// 前端推:16k PCM mono 裸流,记录用户上行音频(前端发来 16k PCM)
try {
SessionAudioRecorder.appendUserPcm(k, bytes);
} catch (Exception ex) {
log.warn("appendUserPcm failed: {}", ex.getMessage());
}
GeminiLiveBridge bridge = BRIDGES.get(ChannelContextUtils.key(channelContext));
if (bridge != null) {
bridge.sendPcm16k(bytes);
}
return null;
}
@Override
public Object onText(WebSocketRequest wsRequest, String text, ChannelContext channelContext) throws Exception {
WebSocketSessionContext wsSessionContext = (WebSocketSessionContext) channelContext.get();
String path = wsSessionContext.getHandshakeRequest().getRequestLine().path;
log.info("路径:{},收到消息:{}", path, text);
String t = text == null ? "" : text.trim();
// 先尝试解析为 JSON -> WsMessage
WsVoiceAgentRequestMessage msg = null;
try {
msg = JsonUtils.parse(t, WsVoiceAgentRequestMessage.class);
} catch (Exception je) {
// 解析失败:降级为普通文本处理
log.debug("收到非 JSON 文本或无法解析为 WsMessage", je.getMessage());
return null;
} catch (Throwable e) {
log.error("解析收到的消息异常", e);
return null;
}
GeminiLiveBridge bridge = BRIDGES.get(ChannelContextUtils.key(channelContext));
if (bridge == null && msg != null && msg.getType() != null) {
String typeStr = msg.getType().trim().toUpperCase();
WsVoiceAgentType typeEnum = null;
try {
typeEnum = WsVoiceAgentType.valueOf(typeStr);
} catch (Exception ex) {
// 未识别的 type,降级处理
log.debug("未知的 type: {}", typeStr);
}
switch (typeEnum) {
case SETUP:
String systemPrompt = msg.getSystem_prompt();
String user_prompt = msg.getUser_prompt();
String job_description = msg.getJob_description();
String resume = msg.getResume();
String questions = msg.getQuestions();
String greeting = msg.getGreeting();
RealtimeSetup realtimeSetup = new RealtimeSetup(systemPrompt, user_prompt, job_description, resume, questions,
greeting);
connectLLM(channelContext, realtimeSetup);
// 回显确认
String json = toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.SETUP_RECEIVED.name()));
Tio.send(channelContext, WebSocketResponse.fromText(json, InterviewConst.CHARSET));
break;
default:
break;
}
return null;
}
if (bridge == null) {
String respJson = toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.ERROR.name(), "no bridge"));
Tio.send(channelContext, WebSocketResponse.fromText(respJson, InterviewConst.CHARSET));
return null;
}
try {
if (msg != null && msg.getType() != null) {
String typeStr = msg.getType().trim().toUpperCase();
WsVoiceAgentType typeEnum = null;
try {
typeEnum = WsVoiceAgentType.valueOf(typeStr);
} catch (Exception ex) {
// 未识别的 type,降级处理
log.debug("未知的 type: {}", typeStr);
}
if (typeEnum != null) {
switch (typeEnum) {
case AUDIO_END:
bridge.sendAudioStreamEnd();
break;
case TEXT:
String userText = msg.getText() == null ? "" : msg.getText();
bridge.sendText(userText);
break;
case CLOSE:
bridge.close();
Tio.remove(channelContext, "client requested close");
break;
default:
// 其它类型:回显原始 JSON
Tio.send(channelContext, WebSocketResponse.fromText(
toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.IGNORED.name(), t)), InterviewConst.CHARSET));
break;
}
}
}
} catch (Exception e) {
log.error(e.getMessage(), e);
}
return null;
}
private String toJson(WsVoiceAgentResponseMessage wsVoiceAgentResponseMessage) {
return JsonUtils.toSkipNullJson(wsVoiceAgentResponseMessage);
}
private void connectLLM(ChannelContext channelContext, RealtimeSetup setup) {
String k = ChannelContextUtils.key(channelContext);
RealtimeBridgeCallback callback = new WsRealtimeBridgeCallback(channelContext);
callback.start(setup);
// 启动 recorder(用户是 16k,模型默认 24k)
try {
SessionAudioRecorder.start(k, 16000, 24000);
} catch (Exception e) {
log.warn("start recorder failed: {}", e.getMessage());
}
GeminiLiveBridge bridge = new GeminiLiveBridge(callback);
BRIDGES.put(k, bridge);
// 连接 Gemini Live(异步)
bridge.connect(setup);
}
}
4. 完整前端代码
index.html
<!doctype html>
<html lang="zh-CN">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<title>Voice Agent (WS)</title>
<style>
body { font-family: system-ui, -apple-system, Segoe UI, Roboto, Arial; margin: 24px; }
.row { display: flex; gap: 12px; flex-wrap: wrap; align-items: center; }
button { padding: 10px 14px; cursor: pointer; }
input, select, textarea { padding: 10px; min-width: 280px; }
.card { border: 1px solid #ddd; border-radius: 10px; padding: 14px; margin-top: 14px; }
.muted { color: #666; font-size: 12px; }
pre { white-space: pre-wrap; word-break: break-word; background: #f7f7f7; padding: 12px; border-radius: 8px; }
.grid { display: grid; grid-template-columns: 1fr; gap: 12px; }
@media (min-width: 900px) { .grid { grid-template-columns: 1fr 1fr; } }
.col { display:flex; flex-direction:column; gap:8px; }
label.small { font-size: 13px; color:#333; }
</style>
</head>
<body>
<h2>Voice Agent 前端 (纯静态)</h2>
<div class="card">
<div class="row">
<label>WS 地址:</label>
<input id="wsUrl" />
<button id="btnConnect">连接</button>
<button id="btnDisconnect" disabled>断开</button>
</div>
<div class="muted">
默认会用当前站点推导:ws(s)://{host}/api/v1/voice/agent
</div>
<!-- 新增:系统提示词 与 用户提示词 -->
<div style="margin-top:12px; display:flex; gap:12px; flex-wrap:wrap;">
<div class="col" style="flex:1; min-width:240px;">
<label class="small">系统提示词(system prompt)</label>
<textarea id="systemPrompt" rows="3" placeholder="在此输入系统提示词,例如:是一个友好的语音助手,始终简洁回答。"></textarea>
</div>
<div class="col" style="flex:1; min-width:240px;">
<label class="small">用户提示词(user prompt)</label>
<textarea id="userPrompt" rows="3" placeholder="在此输入给模型的用户提示词(可选)"></textarea>
</div>
</div>
</div>
<div class="card">
<div class="row">
<button id="btnStartMic" disabled>开始说话</button>
<button id="btnStopMic" disabled>停止说话</button>
<button id="btnAudioEnd" disabled>发送 audio_end</button>
<label class="muted">输入:16k PCM / 输出:24k PCM</label>
</div>
<div class="row" style="margin-top: 10px;">
<input id="textInput" placeholder="输入文字发送给模型(可选)" />
<button id="btnSendText" disabled>发送文本</button>
</div>
</div>
<div class="grid">
<div class="card">
<div class="row">
<strong>转写/事件</strong>
<button id="btnClearLog">清空</button>
</div>
<pre id="log"></pre>
</div>
<div class="card">
<strong>播放状态</strong>
<pre id="playState"></pre>
</div>
</div>
<script src="./app.js"></script>
</body>
</html>
mic-worklet.js
class MicProcessor extends AudioWorkletProcessor {
constructor() {
super();
this._srcRate = sampleRate; // AudioContext 的采样率
this._dstRate = 16000;
this._ratio = this._srcRate / this._dstRate;
this._buffer = new Float32Array(0);
this._enabled = false;
this.port.onmessage = (e) => {
const msg = e.data || {};
if (msg.type === "enable") this._enabled = true;
if (msg.type === "disable") this._enabled = false;
};
}
_concat(a, b) {
const out = new Float32Array(a.length + b.length);
out.set(a, 0);
out.set(b, a.length);
return out;
}
// 简单线性重采样到 16k
_resampleTo16k(input) {
const inLen = input.length;
if (inLen === 0) return new Float32Array(0);
// 目标长度
const outLen = Math.floor(inLen / this._ratio);
const out = new Float32Array(outLen);
for (let i = 0; i < outLen; i++) {
const t = i * this._ratio;
const i0 = Math.floor(t);
const i1 = Math.min(i0 + 1, inLen - 1);
const frac = t - i0;
out[i] = input[i0] * (1 - frac) + input[i1] * frac;
}
return out;
}
process(inputs) {
if (!this._enabled) return true;
const input = inputs[0];
if (!input || !input[0]) return true;
// 只取单声道
const chan0 = input[0];
// 拼接到内部缓冲,避免每帧太短
this._buffer = this._concat(this._buffer, chan0);
// 每次至少攒够 ~40ms 再下发(16k * 0.04 = 640 samples)
// 这里用源采样率对应长度
const minSrc = Math.floor(this._srcRate * 0.04);
if (this._buffer.length < minSrc) return true;
const chunk = this._buffer;
this._buffer = new Float32Array(0);
const resampled = this._resampleTo16k(chunk);
this.port.postMessage({ type: "pcm_f32_16k", data: resampled }, [resampled.buffer]);
return true;
}
}
registerProcessor("mic-processor", MicProcessor);
app.js
const el = (id) => document.getElementById(id);
const wsUrlInput = el("wsUrl");
const btnConnect = el("btnConnect");
const btnDisconnect = el("btnDisconnect");
const btnStartMic = el("btnStartMic");
const btnStopMic = el("btnStopMic");
const btnAudioEnd = el("btnAudioEnd");
const textInput = el("textInput");
const btnSendText = el("btnSendText");
const logEl = el("log");
const playStateEl = el("playState");
const btnClearLog = el("btnClearLog");
// 新增两个 prompt 元素
const systemPromptEl = el("systemPrompt");
const userPromptEl = el("userPrompt");
function logLine(s) {
logEl.textContent += s + "\n";
logEl.scrollTop = logEl.scrollHeight;
}
function setPlayState(obj) {
playStateEl.textContent = JSON.stringify(obj, null, 2);
}
function defaultWsUrl() {
const loc = window.location;
const proto = loc.protocol === "https:" ? "wss:" : "ws:";
return `${proto}//${loc.host}/api/v1/voice/agent`;
}
wsUrlInput.value = defaultWsUrl();
/** ---------- WebSocket ---------- */
let ws = null;
/** ---------- Audio (Mic) ---------- */
let micStream = null;
let micCtx = null;
let micNode = null; // AudioWorkletNode 或 ScriptProcessorNode
let micEnabled = false;
/** ---------- Audio (Playback) ---------- */
let playCtx = null;
let nextPlayTime = 0;
let playedChunks = 0;
let droppedChunks = 0;
const INPUT_RATE = 16000;
const OUTPUT_RATE = 24000;
function pcm16ToFloat32(int16) {
const f32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) f32[i] = int16[i] / 32768;
return f32;
}
function float32ToInt16PCM(f32) {
const out = new Int16Array(f32.length);
for (let i = 0; i < f32.length; i++) {
let s = Math.max(-1, Math.min(1, f32[i]));
out[i] = s < 0 ? s * 32768 : s * 32767;
}
return out;
}
// 把 24k Float32 重采样到 playCtx.sampleRate(通常 48k)
function resampleLinear(input, inRate, outRate) {
if (inRate === outRate) return input;
const ratio = inRate / outRate;
const outLen = Math.floor(input.length / ratio);
const out = new Float32Array(outLen);
for (let i = 0; i < outLen; i++) {
const t = i * ratio;
const i0 = Math.floor(t);
const i1 = Math.min(i0 + 1, input.length - 1);
const frac = t - i0;
out[i] = input[i0] * (1 - frac) + input[i1] * frac;
}
return out;
}
async function ensurePlaybackContext() {
if (playCtx) return;
playCtx = new (window.AudioContext || window.webkitAudioContext)();
nextPlayTime = playCtx.currentTime + 0.05;
setPlayState({
playSampleRate: playCtx.sampleRate,
nextPlayTime,
playedChunks,
droppedChunks
});
}
function schedulePcmPlayback(pcmInt16_24k) {
if (!playCtx) return;
const f32_24k = pcm16ToFloat32(pcmInt16_24k);
const f32 = resampleLinear(f32_24k, OUTPUT_RATE, playCtx.sampleRate);
const buffer = playCtx.createBuffer(1, f32.length, playCtx.sampleRate);
buffer.copyToChannel(f32, 0);
const src = playCtx.createBufferSource();
src.buffer = buffer;
src.connect(playCtx.destination);
// 简单队列:保持连续播放
const now = playCtx.currentTime;
if (nextPlayTime < now) {
// 播放落后了,直接追上(丢弃间隙)
nextPlayTime = now + 0.01;
droppedChunks++;
}
src.start(nextPlayTime);
nextPlayTime += buffer.duration;
playedChunks++;
setPlayState({
playSampleRate: playCtx.sampleRate,
nextPlayTime,
playedChunks,
droppedChunks
});
}
/** ---------- WS Handlers ---------- */
function setUiConnected(connected) {
btnConnect.disabled = connected;
btnDisconnect.disabled = !connected;
btnStartMic.disabled = !connected;
btnStopMic.disabled = !connected;
btnAudioEnd.disabled = !connected;
btnSendText.disabled = !connected;
}
function connectWs() {
const url = wsUrlInput.value.trim();
ws = new WebSocket(url);
ws.binaryType = "arraybuffer";
ws.onopen = async () => {
logLine(`[ws] open: ${url}`);
setUiConnected(true);
await ensurePlaybackContext();
// 连接成功后,读取两个 prompt 的值并发送到后端
try {
const systemPrompt = systemPromptEl.value?.trim() || "";
const userPrompt = userPromptEl.value?.trim() || "";
const setupMsg = {
type: "setup",
system_prompt: systemPrompt,
user_prompt: userPrompt
};
ws.send(JSON.stringify(setupMsg));
logLine(`[send] setup: ${JSON.stringify({ system_prompt: systemPrompt, user_prompt: userPrompt })}`);
} catch (e) {
logLine("[send] setup error: " + (e?.message || e));
}
};
ws.onclose = (e) => {
logLine(`[ws] close: code=${e.code} reason=${e.reason || ""}`);
setUiConnected(false);
stopMic().catch(() => {});
};
ws.onerror = (e) => {
logLine("[ws] error");
};
ws.onmessage = async (evt) => {
if (typeof evt.data === "string") {
// JSON 文本消息
try {
const obj = JSON.parse(evt.data);
if (obj.type === "transcript_in") logLine(`[in ] ${obj.text || ""}`);
else if (obj.type === "transcript_out") logLine(`[out] ${obj.text || ""}`);
else if (obj.type === "text") logLine(`[txt] ${obj.text || ""}`);
else if (obj.type === "turn_complete") logLine("[turn] complete");
else if (obj.type === "setup_complete") logLine("[setup] complete");
else if (obj.type === "usage") logLine(`[usage] prompt=${obj.promptTokenCount} response=${obj.responseTokenCount} total=${obj.totalTokenCount}`);
else if (obj.type === "go_away") logLine(`[goAway] timeLeft=${obj.timeLeft}`);
else if (obj.type === "error") logLine(`[err] ${obj.where || ""}: ${obj.message || ""}`);
else logLine(`[evt] ${evt.data}`);
} catch {
logLine(`[text] ${evt.data}`);
}
return;
}
// 二进制:24k 16-bit PCM mono
if (evt.data instanceof ArrayBuffer) {
const bytes = new Uint8Array(evt.data);
// Int16 little-endian
const i16 = new Int16Array(bytes.buffer, bytes.byteOffset, Math.floor(bytes.byteLength / 2));
// 浏览器端播放
if (playCtx && playCtx.state === "suspended") await playCtx.resume();
schedulePcmPlayback(i16);
}
};
}
/** ---------- Mic Capture ---------- */
async function startMic() {
if (!ws || ws.readyState !== WebSocket.OPEN) {
logLine("WS 未连接");
return;
}
if (micEnabled) return;
// 需要用户手势触发
await ensurePlaybackContext();
if (playCtx.state === "suspended") await playCtx.resume();
micStream = await navigator.mediaDevices.getUserMedia({
audio: {
channelCount: 1,
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
}
});
// 单独的采集 ctx(让 worklet 的 sampleRate 可控,但浏览器未必按指定)
micCtx = new (window.AudioContext || window.webkitAudioContext)();
const source = micCtx.createMediaStreamSource(micStream);
// 优先 AudioWorklet
try {
await micCtx.audioWorklet.addModule("./mic-worklet.js");
micNode = new AudioWorkletNode(micCtx, "mic-processor");
micNode.port.onmessage = (e) => {
const msg = e.data || {};
if (msg.type === "pcm_f32_16k") {
// Float32(16k) -> Int16 PCM -> binary WS
const f32 = new Float32Array(msg.data);
const i16 = float32ToInt16PCM(f32);
if (ws && ws.readyState === WebSocket.OPEN) {
ws.send(i16.buffer);
}
}
};
source.connect(micNode);
// 不需要接到 destination(避免回放啸叫);但有些浏览器要求链路存在
micNode.connect(micCtx.destination);
micNode.port.postMessage({ type: "enable" });
micEnabled = true;
logLine("[mic] started (AudioWorklet)");
} catch (err) {
// fallback:ScriptProcessor(兼容老浏览器)
logLine("[mic] AudioWorklet 不可用,回退到 ScriptProcessor");
const bufferSize = 4096;
const sp = micCtx.createScriptProcessor(bufferSize, 1, 1);
micNode = sp;
sp.onaudioprocess = (e) => {
const input = e.inputBuffer.getChannelData(0);
// 把 micCtx.sampleRate 下采样到 16k
const resampled = resampleLinear(input, micCtx.sampleRate, INPUT_RATE);
const i16 = float32ToInt16PCM(resampled);
if (ws && ws.readyState === WebSocket.OPEN) ws.send(i16.buffer);
};
source.connect(sp);
sp.connect(micCtx.destination);
micEnabled = true;
logLine("[mic] started (ScriptProcessor)");
}
btnStartMic.disabled = true;
btnStopMic.disabled = false;
}
async function stopMic() {
micEnabled = false;
try {
if (micNode && micNode.port) {
micNode.port.postMessage({ type: "disable" });
}
} catch {}
if (micStream) {
micStream.getTracks().forEach(t => t.stop());
micStream = null;
}
if (micCtx) {
try { await micCtx.close(); } catch {}
micCtx = null;
}
micNode = null;
btnStartMic.disabled = !ws || ws.readyState !== WebSocket.OPEN ? true : false;
btnStopMic.disabled = true;
logLine("[mic] stopped");
}
/** ---------- UI actions ---------- */
btnConnect.onclick = () => {
if (ws && ws.readyState === WebSocket.OPEN) return;
connectWs();
};
btnDisconnect.onclick = () => {
if (ws) ws.close(1000, "client close");
};
btnStartMic.onclick = () => startMic().catch(e => logLine("[mic] start error: " + (e?.message || e)));
btnStopMic.onclick = () => stopMic().catch(() => {});
btnAudioEnd.onclick = () => {
if (!ws || ws.readyState !== WebSocket.OPEN) return;
ws.send(JSON.stringify({ type: "audio_end" }));
logLine("[send] audio_end");
};
btnSendText.onclick = () => {
if (!ws || ws.readyState !== WebSocket.OPEN) return;
const t = textInput.value.trim();
if (!t) return;
ws.send(JSON.stringify({ type: "text", text: t }));
logLine("[send] text: " + t);
textInput.value = "";
};
btnClearLog.onclick = () => {
logEl.textContent = "";
};
setUiConnected(false);
btnStopMic.disabled = true;
5.代码讲解
1 整体架构与数据流
前端浏览器 ↔ WebSocket (tio-boot) ↔ 后端 GeminiLiveBridge ↔ Gemini Live 服务
关键点:
前端通过 WebSocket 发两类消息到后端:
- 控制消息(JSON 文本),例如:
setup(初始 prompt)、text(发送文本)、audio_end、close。 - 二进制消息:16 kHz、16-bit PCM 的裸音频数据(ArrayBuffer / Int16Array)。
- 控制消息(JSON 文本),例如:
后端
VoiceSocketHandler负责接收前端消息,按连接创建/管理GeminiLiveBridge实例(每个前端连接一个 bridge)。GeminiLiveBridge使用 SDK 的异步 Realtime API(示例中是client.async.live.connect(model, config))连接 Gemini Live,会话内:- 接收模型发回的
LiveServerMessage(含转写、合成音频、turnComplete、usage、goAway 等)并转发给前端; - 接收前端发送的实时音频(
sendPcm16k)并推到模型; - 支持向模型发送文本内容(
sendText)与指示音频流结束(sendAudioStreamEnd)。
- 接收模型发回的
前端负责:音频采集 + 重采样到 16k、发送到 WebSocket;播放模型返回的 24k PCM 音频(客户端重采样到播放设备采样率)并展示事件/转写日志。
2 后端实现详解(关键类)
下面按功能逐个说明后端关键类与核心逻辑(以给出的实现为蓝本)。
2.1 GeminiLiveBridge(核心桥接类)
职责概述:
- 管理与 Gemini Live 的异步会话(
AsyncSession); - 封装 SDK 的连接配置(VAD、RealtimeInput、SpeechConfig、ThinkingConfig、转写配置等);
- 将来自前端的音频/文本发送到模型,并处理模型返回的消息(文本/音频/转写/usage/goAway/turnComplete 等),转发给前端或通过
RealtimeBridgeCallback回调给调用方。
核心实现要点(摘要):
模型常量 示例中:
private static final String MODEL = "models/gemini-2.5-flash-native-audio-preview-12-2025";请根据平台提供的最新模型 ID 替换。
连接代码示例(主要流程)
buildLiveConfig():构造LiveConnectConfig,其中设置:realtimeInputConfig:自动活动检测(VAD)AutomaticActivityDetection,start/end 敏感度,prefix padding,silenceDuration 等;speechConfig:合成语音配置(PrebuiltVoiceConfig/ 自定义 voice);responseModalities:通常设置为AUDIO(或AUDIO+TEXT取决需求);inputAudioTranscription/outputAudioTranscription:是否启用转写。
client.async.live.connect(MODEL, config)返回AsyncSession。拿到 session 后:- 保存
session实例; - 调用
session.receive(this::onGeminiMessage)注册消息回调; - 若提前收到了 prompts(system/user),则调用
sendPromptsIfAny(sess)。
- 保存
关闭时调用
session.close()并关闭 SDK client。
发送音频:
- 前端发来 16k PCM 的 Int16 数据;后端将 bytes 封装为
Blob(指定mimeType: "audio/pcm;rate=16000")并调用s.sendRealtimeInput(params)。 - 支持发送
audioStreamEnd(模型接收端识别为流结束)。
- 前端发来 16k PCM 的 Int16 数据;后端将 bytes 封装为
发送文本:
- 使用
Content.fromParts(Part.fromText(text))并用s.sendClientContent(...),可选择turnComplete标志。
- 使用
onGeminiMessage 处理:
- 关注
serverContent()中的inputTranscription(用户语音转写)、outputTranscription(模型输出转写)、modelTurn()(包含 parts:文本/inlineData/函数调用等)、turnComplete()。 - 对音频 inlineData:如果
mimeType以audio/pcm开头且 data 存在,则直接通过RealtimeBridgeCallback.sendBinary(byte[])发送原始音频二进制给前端(前端按 24k PCM 播放);否则可 base64 编码并以文本消息返回。 - 处理
goAway(服务端提示即将断开)与usageMetadata(token 使用情报)并转发。
- 关注
会话内转写/turn 管理:
- 示例中维护了两个 StringBuilder(
turnUserTranscript、turnAssistantTranscript)用于积累一个 turn 中的转写文本并在turnComplete()时通过callback.turnComplete(role, text)回调一次性上报给上层业务。
- 示例中维护了两个 StringBuilder(
2.2 RealtimeBridgeCallback 接口与 WsRealtimeBridgeCallback 实现
RealtimeBridgeCallback定义了后端与外部(如 WebSocket handler)之间的回调方法:sendText(String json):发送文本(JSON)事件给前端;sendBinary(byte[] bytes):发送二进制音频数据;close(String reason):关闭连接;session(String sessionId):会话 ID 回调(可选);turnComplete(String role, String text):turn 完成回调。
WsRealtimeBridgeCallback实现把 JSON 文本封装为WebSocketResponse.fromText(...)并通过Tio.send(...)发送,二进制使用WebSocketResponse.fromBytes(bytes)。
注意:回调的实现要保证尽可能“轻量”,避免阻塞 SDK 的回调线程。必要时在回调内部做非阻塞队列/异步发送。
2.3 VoiceSocketHandler(tio 的 WebSocket handler)
核心职责:
管理每个前端 WebSocket 连接对应的
GeminiLiveBridge实例(使用一个 ConcurrentHashMap)。处理三类消息:
- WebSocket 握手与关闭(
handshake/onClose)——清理 bridge; - 文本消息(JSON)—解析为
WsVoiceAgentRequestMessage,根据type字段分派(SETUP-> 创建 bridge 并发送 prompts,TEXT-> 送到 bridge,AUDIO_END-> 通知 bridge,CLOSE-> 关闭); - 二进制消息(
onBytes)—直接视为 16k PCM 推送给bridge.sendPcm16k(bytes)。
- WebSocket 握手与关闭(
重要逻辑示例:
- 收到
setup:创建GeminiLiveBridge,调用bridge.setPrompts(systemPrompt, userPrompt),然后bridge.connect()(异步)。 - 若连接尚未建立但收到 setup:把 prompts 保存并在连接建立后发送(桥接类实现了这一逻辑)。
3 前端实现详解(关键文件与要点)
前端采用单页面静态文件(index.html、mic-worklet.js、app.js),示例已包含完整逻辑,核心设计如下。
3.1 音频采集(mic-worklet.js 或 ScriptProcessor 回退)
优先采用
AudioWorklet(更低延迟、稳定),工作原理:- 在 Worklet 内收集浮点帧,做线性重采样到 16 kHz(示例内实现
_resampleTo16k); - 将
Float32Array(16k)通过postMessage传给主线程(AudioWorkletNode.port),主线程把 float 转为 Int16 PCM 并WebSocket.send(i16.buffer)。
- 在 Worklet 内收集浮点帧,做线性重采样到 16 kHz(示例内实现
若浏览器不支持
AudioWorklet,回退到ScriptProcessor,主线程在onaudioprocess做采样与发送。
建议:
- 前端在每次发送二进制音频时,直接发送
ArrayBuffer(Int16),后端以此作为 16 kHz PCM 解读。 - 每次发送的包长度建议在 20–200 ms 范围,示例每 ~40 ms 打包一次。
3.2 音频播放(模型返回的 24 kHz PCM)
- 后端把模型返回的 inline audio(24 kHz PCM,16-bit)以二进制方式直接推给前端;
- 前端接到 ArrayBuffer,将其视作 Int16Array,转换为 Float32(
pcm16ToFloat32),再把 24 k 重采样到AudioContext.sampleRate(通常 48k)并用 BufferSource 播放; - 为保证平滑播放,示例使用
nextPlayTime队列化播放时间,若播放落后则丢弃间隙以追上实时。
3.3 WebSocket 消息协议(前端 → 后端)
示例 JSON 消息结构(在 app.js 与后端 DTO 对应):
setup:发送初始 prompts(在连接成功后发送)
{
"type": "setup",
"system_prompt": "...",
"user_prompt": "..."
}
text:发送文本到模型
{ "type": "text", "text": "好" }
audio_end:提示音频流结束
{ "type": "audio_end" }
- 二进制:直接发送 Int16Array.buffer(16k PCM)
服务器 → 客户端 常见类型(示例):
transcript_in/transcript_out:转写事件;text:模型输出文本片段;inline_data:非标准 inline 数据(以 Base64 返回);turn_complete、setup_complete、usage、go_away、error等。
4 音频格式与编码注意事项
- 前端采集并发送:16 kHz、mono、16-bit PCM、Little-endian(示例
INPUT_MIME = "audio/pcm;rate=16000")。 - 模型合成输出:示例中使用 24 kHz PCM(
OUTPUT_MIME_PREFIX = "audio/pcm",并约定后端发送的为 24k PCM);请以实际模型返回的 mime 为准(可能是 24000、22050、48000 等)。 - 转写:如果启用了 input/output transcription,模型会返回文本片段,请在前端显示。
- 播放:浏览器
AudioContext采样率通常为 48 kHz 或 44.1 kHz,需要对模型返回的 PCM 做重采样(示例用线性插值简易实现)。 - 延迟管理:保持每次音频包大小适中(30–80 ms),VAD 出现断句时可减少发送量(示例通过 SDK 的
AutomaticActivityDetection实现更高效的实时转写/分段)。
6. Live模型参数设置
模型名称 gemini-2.5-flash-native-audio-preview
一、连接级参数(LiveConnectConfig)
建立实时音频会话时一次性配置
1)输出模态
控制模型输出什么类型内容
| 参数 | 类型 | 作用 |
|---|---|---|
responseModalities | List<Modality> | 模型输出形式:AUDIO / TEXT |
inputAudioTranscription | AudioTranscriptionConfig | 用户语音转写 |
outputAudioTranscription | AudioTranscriptionConfig | 模型语音转写 |
可选值:
Modality.Known.AUDIO
Modality.Known.TEXT
典型组合:
| 场景 | 配置 |
|---|---|
| 纯语音助手 | AUDIO |
| 语音 + 字幕 | AUDIO + transcription |
| ASR only | TEXT |
2)语音合成参数(SpeechConfig / VoiceConfig)
控制 TTS 声音
| 参数 | 类型 | 作用 |
|---|---|---|
voiceConfig | VoiceConfig | 声音配置 |
prebuiltVoiceConfig.voiceName | String | 选择内置音色 |
当前 preview 支持示例 voice:
Puck
Charon
Kore
Fenrir
Aoede
Leda
Orpheus
3)思考预算(ThinkingConfig)
控制 latency 与成本(native audio 极关键)
| 参数 | 类型 | 含义 |
|---|---|---|
thinkingBudget | int | 推理深度预算(0 = 极低延迟) |
效果:
| 值 | 行为 |
|---|---|
| 0 | 类似流式反应(对话最快) |
| 1~8 | 平衡 |
| 16+ | 更智能但明显变慢 |
语音助手通常必须 0 或 1
4)实时语音检测(RealtimeInputConfig)
决定什么时候算“用户说话结束”
A. 自动语音活动检测(VAD)
AutomaticActivityDetection
| 参数 | 含义 |
|---|---|
disabled | 是否关闭VAD |
startOfSpeechSensitivity | 起始检测灵敏度 |
endOfSpeechSensitivity | 结束检测灵敏度 |
prefixPaddingMs | 保留前导音频 |
silenceDurationMs | 静音多久算一句话结束 |
敏感度枚举:
START_SENSITIVITY_LOW
START_SENSITIVITY_MEDIUM
START_SENSITIVITY_HIGH
END_SENSITIVITY_LOW
END_SENSITIVITY_MEDIUM
END_SENSITIVITY_HIGH
B. 会话打断策略
activityHandling
| 值 | 行为 |
|---|---|
NO_INTERRUPTION | 用户说话不打断模型 |
START_OF_ACTIVITY_INTERRUPTS | 用户一开口立刻打断模型(语音助手必备) |
C. Turn 覆盖范围
turnCoverage
| 值 | 行为 |
|---|---|
TURN_INCLUDES_ONLY_ACTIVITY | 仅说话部分算turn |
TURN_INCLUDES_ALL_AUDIO | 包含静音 |
二、实时输入参数(sendRealtimeInput)
发送 PCM 音频帧
| 参数 | 类型 | 说明 |
|---|---|---|
audio | Blob | PCM chunk |
audioStreamEnd | boolean | 告诉模型音频结束 |
音频要求(native-audio preview):
| 项 | 要求 |
|---|---|
| 编码 | PCM16 LE |
| 采样率 | 16000 Hz |
| 单声道 | 是 |
| MIME | audio/pcm;rate=16000 |
三、文本输入参数(sendClientContent)
| 参数 | 含义 |
|---|---|
turns | 文本消息 |
turnComplete | 是否立即触发模型回答 |
重要: 语音助手中 turnComplete=false = context true = 用户说完
四、模型输出(LiveServerMessage)
模型回传结构(可处理的字段)
| 字段 | 含义 |
|---|---|
inputTranscription | 用户识别文本 |
outputTranscription | 模型语音字幕 |
modelTurn.parts.text | 模型文本 |
modelTurn.parts.inlineData | 音频PCM |
turnComplete | 回答结束 |
goAway | 服务端将断开 |
usageMetadata | token统计 |
五、Native-Audio 特有行为参数(重要)
这些不是字段,而是由上面组合形成的能力
| 能力 | 依赖配置 |
|---|---|
| barge-in 打断 | START_OF_ACTIVITY_INTERRUPTS |
| 真正半双工 | VAD + stream audio |
| 无延迟响应 | thinkingBudget=0 |
| 一边说一边播 | responseModalities=AUDIO |
| 实时字幕 | transcription 开启 |
| 语音对话连续性 | turnCoverage=ONLY_ACTIVITY |
六、总结(最核心控制面)
Native-audio 实际上只由 6 个维度决定行为:
1)思考深度 → latency / intelligence 2)VAD → 什么时候回答 3)interrupt → 是否能打断 4)voice → 声音风格 5)modality → 输出类型 6)turnComplete → 对话节奏
