asr-service/README.md

245 lines
7.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ASR Service
Speech-to-text service powered by [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (CTranslate2 backend). Uses the `large-v3-turbo` model for fast, high-quality transcription with word-level timestamps.
## Architecture
```
Client --> Redis Queue ("asr") --> ASRTasks (LongTasks worker)
|
v
faster-whisper (GPU)
|
v
Result (JSON)
```
- **ahserver**: Web framework serving HTTP on port 9925
- **longtasks**: Redis-backed async task queue with worker management
- **Redis**: Task queue broker (queue name: `asr`)
- **faster-whisper**: ASR engine running on GPU (CUDA, float16)
The service follows the same ahserver+longtasks pattern as wan22-service and realesrgan-service.
## Model
- **Model**: faster-whisper-large-v3-turbo-ct2
- **Path**: `/data/ymq/models/deepdml/faster-whisper-large-v3-turbo-ct2`
- **Device**: CUDA (float16)
- **GPU**: Isolated via `CUDA_VISIBLE_DEVICES` (default GPU 5)
The model is lazy-loaded on first transcription request and stays in GPU memory for subsequent requests.
## 模型下载(离线部署)
faster-whisper-large-v3-turbo-ct2 是 HuggingFace 模型,需要先下载再部署。
### 方法1: huggingface-cli推荐
```bash
# 安装 huggingface-cli
pip install huggingface_hub
# 下载模型到指定目录
huggingface-cli download deepdml/faster-whisper-large-v3-turbo-ct2 \
--local-dir /data/ymq/models/deepdml/faster-whisper-large-v3-turbo-ct2 \
--local-dir-use-symlinks False
```
**下载大小**: ~1.6GB
**下载时间**: 取决于网络速度约3-10分钟
### 方法2: git-lfs
```bash
# 安装 git-lfs
git lfs install
# 克隆模型仓库
cd /data/ymq/models
mkdir -p deepdml
cd deepdml
git clone https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2
```
### 方法3: wget/curl单文件
如果只需要核心文件,可以直接下载:
```bash
cd /data/ymq/models/deepdml/faster-whisper-large-v3-turbo-ct2
# 下载模型文件
wget https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2/resolve/main/model.bin
wget https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2/resolve/main/tokenizer.json
wget https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2/resolve/main/vocabulary.json
wget https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2/resolve/main/config.json
wget https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2/resolve/main/preprocessor_config.json
```
### 验证下载
```bash
ls -lh /data/ymq/models/deepdml/faster-whisper-large-v3-turbo-ct2/
# 应该看到 model.bin (约1.6GB) + tokenizer.json + vocabulary.json + config.json
```
### 模型来源
- **HuggingFace**: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2
- **Base Model**: openai/whisper-large-v3-turbo (CTranslate2 优化版)
- **License**: MIT
- **优化**: CTranslate2 格式,比原版 Whisper 快 4 倍,内存占用更少
## Deployment
### Prerequisites
- Python venv with faster-whisper 1.2.1: `/data/ymq/demucs_venv`
- Redis server running on 127.0.0.1:6379
- CUDA-capable GPU
### Start
```bash
cd /data/ymq/asr-service
bash start.sh
```
### Stop
```bash
cd /data/ymq/asr-service
bash stop.sh
```
### Health Check
```bash
curl http://localhost:9925/health
```
Returns:
```json
{
"status": "ok",
"service": "asr-service",
"model": "faster-whisper-large-v3-turbo-ct2"
}
```
## API Usage
Tasks are submitted via Redis, same pattern as wan22-service.
### Submit a Transcription Task
```python
import redis
import json
import uuid
r = redis.Redis(host='127.0.0.1', port=6379)
task_id = str(uuid.uuid4())
payload = {
"task_id": task_id,
"task_type": "transcribe",
"audio_path": "/path/to/audio.wav",
"language": "zh",
"word_timestamps": True,
"vad_filter": True,
"output_path": "/tmp/asr-outputs/result.json"
}
# Push to the Redis queue
r.lpush('asr:queue', json.dumps(payload))
print(f"Task submitted: {task_id}")
```
### Check Task Status
```python
# Task status is stored in Redis by longtasks
status = r.get(f'asr:status:{task_id}')
result = r.get(f'asr:result:{task_id}')
```
## Task Payload Format
| Field | Type | Required | Default | Description |
|------------------|--------|----------|---------|--------------------------------------|
| task_type | string | Yes | - | Must be `"transcribe"` |
| audio_path | string | Yes | - | Path to input audio file |
| language | string | No | `"zh"` | Language code (zh, en, ja, etc.) |
| word_timestamps | bool | No | `True` | Enable word-level timestamps |
| vad_filter | bool | No | `True` | Enable voice activity detection |
| output_path | string | No | - | If set, save result JSON to this path|
## Output Format
```json
{
"status": "ok",
"text": "Full transcription text...",
"language": "zh",
"language_probability": 0.9876,
"duration": 125.340,
"segments": [
{
"text": "Segment text",
"start": 0.000,
"end": 5.120,
"words": [
{
"word": "你好",
"start": 0.000,
"end": 0.800,
"probability": 0.9523
}
]
}
],
"processing_time": 3.45,
"audio_path": "/path/to/audio.wav"
}
```
## Configuration
Config file: `conf/config.json`
| Setting | Value | Description |
|-----------------------|------------------------------|--------------------------------|
| website.port | 9925 | HTTP listen port |
| website.host | 0.0.0.0 | Bind address |
| session_redis | 127.0.0.1:6379 db=1 | Session storage |
| password_key | ASRService2026Key | Auth key |
| filesroot | /tmp/asr-outputs | Output files directory |
### Environment Variables
| Variable | Default | Description |
|----------------------|---------|---------------------------------------|
| ASR_GPU_ID | 5 | GPU device ID (for logging) |
| CUDA_VISIBLE_DEVICES | 5 | CUDA device isolation |
| PYTHONPATH | . | Python module search path |
## File Structure
```
asr-service/
├── ah.py # Main entry point
├── start.sh # Start script
├── stop.sh # Stop script
├── conf/
│ └── config.json # Service configuration
├── app/
│ └── health.dspy # Health check endpoint
├── workers/
│ ├── __init__.py
│ └── transcribe.py # Transcription worker
└── README.md
```