Show HN: OSS implementation of Test Time Diffusion that runs on a 24gb GPU

This repository contains our submission for the MMU-RAG Competition, a deep research agent named TTD-RAG. Our system is a faithful implementation of the framework proposed in the paper "Deep Researcher with Test-Time Diffusion (TTD-DR)". This README is generated by gemini 2.5. It conceptualizes report generation as an iterative "denoising" process, starting with a preliminary draft and progressively refining it through cycles of targeted search, synthesis, and revision. This approach is designed to excel at complex, multi-hop reasoning tasks that require coherent, long-form answers. Test-Time Diffusion Framework: Models research report generation as an iterative process of refining a "noisy" draft with external information, ensuring coherence and reducing information loss. Report-Level Denoising with Retrieval: Uses an evolving draft to dynamically guide the search process, ensuring each retrieval step is targeted at filling specific knowledge gaps. Component-wise Self-Evolution: Enhances the quality of each step in the workflow (planning, synthesis) by generating diverse variants, critiquing them, and merging them into a superior output. High-Performance Serving: Utilizes vLLM to serve both the generative (Qwen/Qwen3-4B-Instruct-2507) and reranking (tomaarsen/Qwen3-Reranker-0.6B-seq-cls) models for high throughput and low latency. Competition Compliant: Fully supports both dynamic (streaming) and static evaluation endpoints as required by the competition rules, validated with the provided local_test.py script. The agent operates in a structured, multi-stage process orchestrated by src/pipeline.py: Stage 1: Planning & Initial Drafting An initial Research Plan is generated to outline the key areas of investigation. A preliminary Noisy Draft is created based on the LLM's internal knowledge, serving as the starting point for the diffusion process. Stage 2: Iterative Search & Denoising The system enters a loop, where for each iteration: A new search query is generated, informed by the current draft's deficiencies and the overall plan. Documents are retrieved from the FineWeb Search API. The retrieved documents are chunked and reranked using a specialized model to find the most relevant information. The top-ranked chunks are synthesized into a concise answer for the search query. The draft is revised ("denoised") by integrating this new information. Stage 3: Final Report Generation After the iterations complete, the agent synthesizes the final, refined draft, the initial plan, and the full history of questions and answers into a single, comprehensive report. Backend Framework: FastAPI LLM Serving: vLLM Generative LLM: Qwen/Qwen3-4B-Instruct-2507 Reranker Model: tomaarsen/Qwen3-Reranker-0.6B-seq-cls Retrieval Source: FineWeb Search API Containerization: Docker Docker and Docker Compose An NVIDIA GPU with 24GB+ VRAM NVIDIA Container Toolkit First, create a local environment file from the example template. This file will store your API keys. Now, open .env and add your API keys for: FINEWEB_API_KEY OPENROUTER_API_KEY (used as a fallback for the generator) We recommend using Docker Compose, which handles building the image and running the services as defined in compose.yml. This command will: Build the Docker image from the Dockerfile. Start the container. Execute the start.sh script, which first launches the vLLM OpenAI-compatible server in the background to serve the Qwen models. After a brief pause to allow the models to load, it starts the FastAPI application on port 5053. Your API is now running and accessible at http://localhost:5053. You can verify that your service is compliant with the competition requirements using the provided local_test.py script. A successful run will confirm that both endpoints are functioning correctly and that the result.jsonl file is generated as expected for the static evaluation. Health Check: GET /health A simple endpoint to confirm the service is running. Returns {"status": "ok"}. Dynamic Evaluation: POST /run Input: {"question": "string"} Output: A Server-Sent Events (SSE) stream that provides real-time updates on the agent's progress, including intermediate steps, citations, and the final report. Static Evaluation: POST /evaluate Input: {"query": "string", "iid": "string"} Output: A single JSON response {"query_id": "string", "generated_response": "string"}. The following AWS CLI commands are provided for pushing your final Docker image to the competition's ECR repository. Sign in to AWS ECR aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin .dkr.ecr.us-east-1.amazonaws.com Build the Image (if not already built) Ensure you build for the correct platform. docker build --platform linux/amd64 -t ttt-dr:latest . Tag the Image for ECR docker tag ttt-dr:latest .dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest Push the Image to ECR docker push .dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest

Show HN: OSS implementation of Test Time Diffusion that runs on a 24gb GPU

Guess You Like