Online RL for Code Completion Agent#

Use this recipe to connect Continue to RLinf, collect accept/reject feedback on code completions, and update a Qwen coder model online.

Overview#

Use this recipe to connect Continue to an RLinf training service and update a code completion model from user feedback.

Model

Qwen2.5-Coder-1.5B

Algorithm

PPO online RL or GRPO offline validation

Feedback

Continue accept / reject events or LLM-as-judge labels

Services

Inference on 8081 and feedback ingestion on 8082

Step	Component	Outcome
Real-time interaction	Continue extension	Sends code-completion requests to RLinf
Model inference	RLinf inference service	Returns code-completion suggestions
User feedback	Continue tracking callback	Records accepted or rejected completions
Online learning	RLinf training service	Updates the policy from feedback

Installation#

Install RLinf first, then add the lightweight HTTP client dependencies used by this recipe:

# Install additional dependencies
pip install httpx asyncio

If using the offline validation example, download the dataset:

modelscope download --dataset "paxionfruit/code-fim-v2-python-filtered" --local_dir code-fim-v2-python-filtered

Run It#

Configure Continue Integration#

Install Continue Extension

Since the current Continue does not support uploading user preference feedback on code completions, we have modified the Continue source code to support uploading user preference feedback on code completions. Users can get the compiled modified Continue plugin from here or build it themselves.

After downloading the compiled Continue plugin, install it in VS Code.

Method 1: code –install-extension /path/to/continue-1.3.9.vsix”

Method 2: In VSCode, press Cmd+Shift+P, type ‘Extensions: Install from VSIX’, and select the above file

Configure Continue Settings

The Continue configuration file path is:

~/.continue/config.yaml

Add the following settings to your Continue configuration file:

# Please replace http://xxx:xx/ with the actual RLinf online code completion service address

# Add a model for code completion
models:
  - name: my-autocomplete
    provider: openai
    model: Qwen2.5-Coder-1.5B
    apiBase: http://xxx:8081/v1
    apiKey: xxx
    roles:
      - autocomplete

# Add sending user feedback on whether to accept code completions
tabAutocompleteOptions:
  enableCompletionTracking: true
  completionTrackingUrl: http://xxx:8082/api/training/submit
  completionTrackingHeaders:
    Authorization: Bearer test-token
    X-Project-ID: test-project
  maxPromptTokens: 1024
  debounceDelay: 350
  multilineCompletions: auto

After modifying and saving, open the Continue extension from the left panel, click the “Settings” gear button in the top right corner, and ensure “Autocomplete Model” is set to my-autocomplete in the “Models” page.

Start Training Service#

Prepare Model and Configuration

For common path, runner, rollout, and cluster fields, see Training configuration.

For online RL, edit and use examples/agent/coding_online_rl/config/qwen2.5-1.5b-ppo.yaml:

runner:
  output_dir: /path/to/your/logs

rollout:
  model:
    model_path: /path/to/your/model

For offline validation, edit and use examples/agent/coding_online_rl/config/qwen2.5-1.5b-grpo-llm_judge.yaml:

runner:
  output_dir: /path/to/your/logs

rollout:
  model:
    model_path: /path/to/your/model

data:
  train_data_paths: ["/path/to/your/dataset/code-fim-v2-python-filtered_formatted_train_3k.jsonl"]
  val_data_paths: ["/path/to/your/dataset/code-fim-v2-python-filtered_formatted_test_1k.jsonl"]

Also set the API endpoint and key for the LLM-as-judge used to simulate feedback:

export LLMASJUDGE_API_URL=your_api_url
export LLMASJUDGE_API_KEY=your_api_key
export LLMASJUDGE_MODEL=your_model  # not recommended; the prompt should fit your model.

Start RLinf Training Service

For online RL:
```
# Navigate to project directory
cd /path/to/RLinf

# Start training service
bash examples/agent/coding_online_rl/run_main_coding_online_rl.sh
```
This will start the following services: - Inference Service: Provides code completion API on port 8081 - Training Service: Receives user feedback data on port 8082

For offline validation:

# Navigate to project directory
cd /path/to/RLinf

# Start training service
bash examples/agent/coding_online_rl/run_main_coding_rl_llm_judge.sh

Use Continue#

Start Continue

Launch the Continue extension in VS Code, ensuring it connects to the correct API endpoints.
Begin Programming

Start writing code in Continue. The system will: - Automatically send code completion requests to the inference service - Receive model-generated code suggestions - Collect your acceptance/rejection feedback on suggestions
Real-time Learning

The system processes your feedback in real-time: - Accepted suggestions are marked as positive feedback - Rejected suggestions are marked as negative feedback - Model parameters are updated online based on feedback

Visualization and Results#

Monitor logs, TensorBoard, and checkpoints. For common metric meanings, see Training metrics.

View Log Output

# View training logs
tail -f results/ppo-1.5b/train.log

Use TensorBoard

# Start TensorBoard
tensorboard --logdir results/grpo-1.5b

Check Model Checkpoints

Model checkpoints are periodically saved to the results/grpo-1.5b/checkpoints/ directory during training.

Verify the Client#

Use the provided test client to verify system functionality:

# Run test client
python examples/agent/coding_online_rl/simple_online_coding_client.py

The test client simulates Continue behavior by sending code completion requests and submitting feedback data.

Troubleshooting#

Common issues and solutions:

Port Conflicts

If ports 8081 or 8082 are occupied, modify the port settings in the configuration file.
Model Loading Failure

Check that the model path is correct and ensure model files exist and are accessible.
Continue Connection Failure

Ensure the API endpoint addresses in Continue configuration are correct and check network connectivity. You can also use simple_online_coding_client.py to test if feedback data can be received normally.

Use this setup as the online loop: Continue sends requests, RLinf collects feedback, and the training service updates the policy.