Exclusive: Lemony says its dynamic prompt routing tool cuts AI costs by up to 85%

Lemony.ai, the operating name of Uptime Industries Inc., today is releasing an open-source tool that it says can cut artificial intelligence application development costs by dynamically routing prompts to the most cost-effective language model available. Cascadeflow aims to help developers reduce application programming interface spending without compromising quality or performance. Most developers hardcode large language models for every query, according to Sascha Buehrle, Lemony’s co-founder and chief executive. “Cascadeflow lets developers run smarter, not bigger, by dynamically choosing the right model for every task,” he said. The software routes each prompt through a cascading pipeline. It starts by using a small, inexpensive model and then evaluates the result against configurable quality metrics such as completeness and correctness. If the output falls short, the software escalates the prompt to a larger model. This approach, known as speculative execution, aims to mitigate the cost of using flagship models for each prompt. The software tracks token usage and costs across models and providers, offering configurable budget controls and per-query spending caps. Developers define their own pricing in a local cost file to account for differences in provider contracts. Buehrle said initial benchmarks indicate that up to 85% of prompts can be processed using smaller or domain-specific models. “You don’t need a flagship model to answer ‘what’s 2 plus 2,’” he said. Lemony’s principal business is providing on-premises edge devices running scaled-down language models (pictured). Broad model support Cascadeflow initially supports commercial models and processors from OpenAI LLC, Anthropic PBC, Hugging Face Inc., Groq Inc., Together Computer Inc. and the open-source vLLM and Ollama. It also integrates with the Python-based LightLLM, enabling access to approximately 100 additional language models. The software can be used in cloud environments, on local machines or edge devices. “You can run it wherever your AI application runs,” Buehrle said. “It adds only two milliseconds of latency to your AI stack.” The software can be deployed with agent frameworks, is compatible with the Model Control Protocol and supports batch processing, streaming and caching optimizations for various providers. It integrates with n8n, a low-code automation platform widely used to build agent workflows. Buehrle said the company chose to release cascadeflow as open source to build community engagement and transparency into the cost control process. “It’s important to push the core of Lemony out as open source,” he said. “It’s important to build a community and to get from the companies using it.” Cascadeflow is available beginning today on GitHub. Photo: Lemony.ai

Exclusive: Lemony says its dynamic prompt routing tool cuts AI costs by up to 85%

Guess You Like