Kokoro Text-to-Speech (TTS) MCP server that converts input text to MP3, optionally uploads the files to S3, and exposes the service through the MCP protocol.
https://github.com/mberg/kokoro-tts-mcpStop paying per-character for TTS API calls. This MCP server gives you high-quality text-to-speech generation that stays local, works offline, and integrates seamlessly into your existing MCP workflow.
No API rate limits or usage caps - Generate as much audio as you need without worrying about quotas or throttling. Perfect for batch processing documentation, creating training materials, or building voice-heavy applications.
Offline-first architecture - Your text never leaves your machine. No internet dependency means consistent performance and zero privacy concerns about sensitive content.
Built-in storage management - Automatically handles MP3 generation, optional S3 uploads, and intelligent cleanup. Set retention policies and forget about managing storage manually.
Documentation Audio Generation
# Convert your entire README to audio in seconds
python mcp_client.py --file README.md --voice af_heart --speed 1.2
Podcast Production Pipeline Upload scripts, generate audio with consistent voice settings, and automatically sync to S3 for your editing workflow. The retention policies keep your local storage clean while maintaining backups.
Accessibility Enhancement Add audio versions of web content, emails, or reports. The multiple voice options let you match tone to content type - use different voices for different document sections or speakers.
Training Material Creation Batch convert training documents to audio. The S3 integration means your generated audio files are immediately available across your infrastructure.
Add this to your MCP config and you're generating audio through any MCP-compatible tool:
"kokoro-tts-mcp": {
"command": "uv",
"args": ["--directory", "/path/to/kokoro-tts-mcp", "run", "mcp-tts.py"],
"env": {
"TTS_VOICE": "af_heart",
"TTS_SPEED": "1.0",
"S3_ENABLED": "true",
"MP3_RETENTION_DAYS": "30"
}
}
The server handles the complexity - model loading, audio processing, file management, and S3 uploads happen automatically. You just send text and get back audio.
Automatic cleanup prevents storage bloat with configurable retention policies. Set MP3_RETENTION_DAYS=30
and files older than 30 days get automatically removed.
Flexible storage options - Keep files local, sync to S3, or delete locally after S3 upload. The DELETE_LOCAL_AFTER_S3_UPLOAD
option is perfect for high-volume generation where you only need cloud storage.
Batch processing ready - Process multiple files or large documents without manual intervention. The client can read from files directly and handle the entire pipeline.
The initial setup requires downloading the Kokoro ONNX model files (about 1.5GB total), but this one-time cost eliminates ongoing API fees and latency issues. The models run efficiently on CPU or GPU, depending on your hardware.
Most cloud TTS services charge $4-15 per million characters. If you're generating more than occasional audio, the model download pays for itself quickly while giving you better control over quality and timing.
The Kokoro models deliver natural-sounding speech that's noticeably better than many cloud alternatives. Multiple voice options let you match tone to content, and speed control (0.5x to 2.0x) handles different use cases from careful documentation reading to quick content review.
This isn't a toy project - it's production-ready TTS that scales with your needs and integrates cleanly into existing MCP workflows. If you're currently paying for TTS APIs or dealing with the complexity of cloud service integration, this server belongs in your toolkit.