Live Dashboard of PRs collections and SWE Tasks Generation
Live state of GitHub PR collection and verifiable SWE-Bench task generation across 8 languages. Detailed per-language analysis is collapsed below — click any section to expand.
Language Progress
| Language | PRs collected | Last 1h | Last 24h | Valid SWE | Last 1h | Last 24h | Processed | Success rate |
|---|---|---|---|---|---|---|---|---|
| Cc | 32,357 | 0 | +2,510 | 9,709 | 0 | +1 | 30,871 | |
| C++cpp | 48,172 | 0 | +2,088 | 4,039 | 0 | +3 | 20,983 | |
| Gogo | 132,103 | 0 | +4,698 | 8,097 | 0 | +26 | 90,113 | |
| Javajava | 89,029 | 0 | +2,001 | 4,019 | +1 | +2 | 83,284 | |
| JavaScriptjs | 39,682 | 0 | +864 | 7,139 | +1 | +20 | 37,560 | |
| Pythonpy | 105,769 | 0 | +5,196 | 5,058 | 0 | +10 | 101,447 | |
| Rustrust | 71,901 | 0 | +2,839 | 5,505 | 0 | +9 | 70,449 | |
| TypeScriptts | 70,490 | 0 | +3,422 | 6,411 | +1 | +24 | 58,231 |
Run Parameters
| Language | Eval model (OPENAI) | Completion model (ANTHROPIC) | Concurrency | min_source_files | max_source_files |
|---|---|---|---|---|---|
| C | gpt-5.4 | claude-sonnet-4-6 | 12 | 2 | 15 |
| C++ | Qwen3.6-35B-A3B | Qwen3.6-35B-A3B | 8 | 2 | 15 |
| Go | Qwen3.6-35B-A3B | Qwen3.6-35B-A3B | 12 | 2 | 10 |
| Java | claude-haiku-4-5-20251001 | claude-sonnet-4-6 | 8 | 2 | 10 |
| JavaScript | Qwen3.6-35B-A3B | Qwen3.6-35B-A3B | 12 | 2 | 10 |
| Python | glm-5 | claude-sonnet-4-6 | 12 | 3 | 15 |
| Rust | Qwen3.6-35B-A3B | Qwen3.6-35B-A3B | 8 | 2 | 10 |
| TypeScript | Qwen3.6-35B-A3B | Qwen3.6-35B-A3B | 12 | 2 | 10 |
Failure Reason Breakdown
click to expand
| Language | Processed | Valid SWE | Failed | trivial_pr | validation | infra_error | timeout | workflow_error | Other |
|---|---|---|---|---|---|---|---|---|---|
| C | 30,871 | 9,709 | 21,162 | 15,225 | 1,430 | 4,717 | 37 | 41 | 1 |
| C++ | 20,983 | 4,039 | 16,944 | 2,684 | 666 | 15,087 | 159 | 326 | 266 |
| Go | 90,113 | 8,097 | 82,016 | 22,865 | 8,658 | 47,664 | 1,527 | 815 | 471 |
| Java | 83,284 | 4,019 | 79,265 | 20,117 | 6,645 | 44,935 | 1,298 | 5,065 | 1,602 |
| JavaScript | 37,560 | 7,139 | 30,421 | 15,594 | 1,403 | 14,097 | 529 | 180 | 0 |
| Python | 101,447 | 5,058 | 96,389 | 29,812 | 7,762 | 59,053 | 985 | 375 | 120 |
| Rust | 70,449 | 5,505 | 64,944 | 20,305 | 5,584 | 36,324 | 1,241 | 914 | 1,234 |
| TypeScript | 58,231 | 6,411 | 51,820 | 14,674 | 4,532 | 30,386 | 1,760 | 784 | 11 |
trivial_pr: the PR was judged by the LLM as too trivial (e.g. only config, docs, or dependency-version changes) and unsuitable as a SWE task.
validation: validation failed after task generation (the NOP agent did not return reward=0, or the ORACLE agent did not return reward=1).
infra_error: infrastructure error (Docker build failure, network timeout, insufficient disk space, etc.).
timeout: processing timed out (per-PR total timeout or Claude Code session timeout).
workflow_error: workflow error (PR metadata fetch failure, worktree creation failure, patch generation failure, etc.).
fix.patch Complexity
click to expand
| Language | Valid SWE Count | Avg fix.patch lines | Avg fix.patch hunks | Avg fix.patch files |
|---|---|---|---|---|
| C | 9,709 | 336.57 | 17.94 | 5.85 |
| C++ | 4,039 | 286.89 | 13.76 | 5.10 |
| Go | 8,097 | 213.23 | 12.66 | 4.36 |
| Java | 4,019 | 163.34 | 10.50 | 4.23 |
| JavaScript | 7,139 | 77.09 | 6.28 | 2.79 |
| Python | 5,058 | 151.42 | 10.98 | 3.83 |
| Rust | 5,505 | 225.88 | 13.17 | 4.10 |
| TypeScript | 6,411 | 158.36 | 9.61 | 4.13 |
Metric Definitions
click to expand
Difficulty score (difficulty_score)
Reads each valid task directory's solution/fix.patch, tests/, and instruction.md, scored statically with zero API calls by src/swegen/scoring.py.
The current formula uses log-scale continuous scoring to avoid mid-sized patches becoming hard too early. Weights: patch_scope 38%, logic_complexity 32%, context_breadth 15%, test_complexity 10%, instruction_complexity 5%.
Label thresholds: easy <= 4.0, medium <= 7.0, hard > 7.0.
Tag generation and display
tags are not computed live by the dashboard; they are generated by the LLM from PR information when swegen builds the task, and written to [metadata].tags in task.toml.
The prompt asks for tags in four parts: programming language, project layer/domain, framework/library or specific topic, and a domain-independent bug class (e.g. missing-fallback, incomplete-validation). The dashboard reads existing task.toml files, counts each language's tag occurrences and share, and treats the 4th tag as the bug class for the Bug-Class panels below.
fix.patch statistics
Patch stats come from each valid task's solution/fix.patch, filtering code files by language extension, consistent with the code-only stats in upload_march_swe_to_hf.py.
Avg fix.patch lines counts added/removed lines in code-file diffs; Avg fix.patch hunks counts @@ hunks; Avg fix.patch files counts the code files involved.
difficulty_label Distribution
click to expand
| Language | easy / medium / hard | easy | medium | hard |
|---|---|---|---|---|
| C | 869 / 6414 / 2418 | 869 | 6,414 | 2,418 |
| C++ | 430 / 2507 / 1097 | 430 | 2,507 | 1,097 |
| Go | 630 / 5857 / 1604 | 630 | 5,857 | 1,604 |
| Java | 443 / 2604 / 968 | 443 | 2,604 | 968 |
| JavaScript | 1077 / 5291 / 770 | 1,077 | 5,291 | 770 |
| Python | 273 / 3248 / 1513 | 273 | 3,248 | 1,513 |
| Rust | 382 / 3260 / 1861 | 382 | 3,260 | 1,861 |
| TypeScript | 573 / 4743 / 1095 | 573 | 4,743 | 1,095 |
difficulty_score Overview
click to expand
| Language | count | min | p25 | median | mean | p75 | max |
|---|---|---|---|---|---|---|---|
| C | 9,701 | 2.4 | 4.9 | 6.0 | 5.97 | 7.0 | 9.2 |
| C++ | 4,034 | 2.5 | 4.9 | 6.0 | 5.99 | 7.2 | 9.1 |
| Go | 8,091 | 2.6 | 4.9 | 5.8 | 5.87 | 6.8 | 9.1 |
| Java | 4,015 | 2.8 | 4.8 | 5.9 | 5.90 | 7.0 | 9.2 |
| JavaScript | 7,138 | 2.6 | 4.4 | 5.2 | 5.36 | 6.2 | 9.2 |
| Python | 5,034 | 2.6 | 5.2 | 6.2 | 6.22 | 7.3 | 9.1 |
| Rust | 5,503 | 2.7 | 5.2 | 6.3 | 6.26 | 7.4 | 9.0 |
| TypeScript | 6,411 | 2.7 | 4.7 | 5.6 | 5.72 | 6.6 | 9.1 |
Global Top Tags
click to expand
Per-Language Tag Distribution
click to expand
Global Top Bug Classes
click to expand
Bug class is the 4th tag in task.toml -> [metadata].tags: a domain-independent label describing the defect mechanism (e.g. missing-fallback, incomplete-validation, off-by-one-error). Generated by the LLM during swegen create and backfilled into legacy 3-tag tasks via swegen backfill-tags.
Per-Language Bug-Class Distribution
click to expand
Top bug classes per mainstream language. Counts are over tasks whose task.toml already carries a 4-tag entry; tasks still on the legacy 3-tag schema do not contribute until the backfill catches up.