Live Dashboard of PRs collections and SWE Tasks Generation
Live state of GitHub PR collection and verifiable SWE-Bench task generation across 8 languages. Detailed per-language analysis is collapsed below — click any section to expand.
Language Progress
| Language | PRs collected | Last 1h | Last 24h | Valid SWE | Last 1h | Last 24h | Processed | Success rate |
|---|---|---|---|---|---|---|---|---|
| Cc | 32,981 | 0 | 0 | 9,769 | 0 | +53 | 32,981 | |
| C++cpp | 49,163 | 0 | 0 | 4,062 | 0 | 0 | 48,170 | |
| Gogo | 133,024 | 0 | +2 | 8,386 | 0 | +17 | 123,518 | |
| Javajava | 90,836 | 0 | +1 | 4,229 | +17 | +77 | 90,730 | |
| JavaScriptjs | 40,065 | 0 | 0 | 7,257 | 0 | 0 | 40,065 | |
| Pythonpy | 108,345 | 0 | 0 | 5,232 | +25 | +113 | 108,256 | |
| Rustrust | 72,577 | 0 | 0 | 5,595 | 0 | 0 | 72,539 | |
| TypeScriptts | 71,783 | 0 | +1 | 6,636 | +4 | +24 | 71,716 |
Run Parameters
| Language | Eval model (OPENAI) | Completion model (ANTHROPIC) | Concurrency | min_source_files | max_source_files |
|---|---|---|---|---|---|
| C | glm-5 | claude-opus-4-7 | 12 | 2 | 15 |
| C++ | glm-5 | claude-opus-4-7 | 8 | 2 | 15 |
| Go | glm-5 | claude-opus-4-7 | 12 | 2 | 10 |
| Java | glm-5 | claude-opus-4-7 | 8 | 2 | 10 |
| JavaScript | glm-5 | claude-opus-4-7 | 12 | 2 | 10 |
| Python | glm-5 | claude-opus-4-7 | 12 | 3 | 15 |
| Rust | glm-5 | claude-opus-4-7 | 8 | 2 | 10 |
| TypeScript | glm-5 | claude-opus-4-7 | 12 | 2 | 10 |
Failure Reason Breakdown
click to expand
| Language | Processed | Valid SWE | Failed | trivial_pr | validation | infra_error | timeout | workflow_error | Other |
|---|---|---|---|---|---|---|---|---|---|
| C | 32,981 | 9,769 | 23,212 | 19,357 | 1,131 | 2,951 | 5 | 26 | 2 |
| C++ | 48,170 | 4,062 | 44,108 | 8,760 | 6,809 | 29,698 | 159 | 662 | 266 |
| Go | 123,518 | 8,386 | 115,132 | 38,741 | 24,048 | 49,742 | 1,461 | 1,033 | 108 |
| Java | 90,730 | 4,229 | 86,501 | 26,921 | 10,989 | 41,347 | 1,064 | 5,011 | 1,599 |
| JavaScript | 40,065 | 7,257 | 32,808 | 20,652 | 93 | 13,421 | 1 | 15 | 0 |
| Python | 108,256 | 5,232 | 103,024 | 44,388 | 15,511 | 43,581 | 817 | 339 | 120 |
| Rust | 72,539 | 5,595 | 66,944 | 30,149 | 10,205 | 24,100 | 1,077 | 846 | 1,233 |
| TypeScript | 71,716 | 6,636 | 65,080 | 22,080 | 13,937 | 26,868 | 1,765 | 722 | 8 |
trivial_pr: the PR was judged by the LLM as too trivial (e.g. only config, docs, or dependency-version changes) and unsuitable as a SWE task.
validation: validation failed after task generation (the NOP agent did not return reward=0, or the ORACLE agent did not return reward=1).
infra_error: infrastructure error (Docker build failure, network timeout, insufficient disk space, etc.).
timeout: processing timed out (per-PR total timeout or Claude Code session timeout).
workflow_error: workflow error (PR metadata fetch failure, worktree creation failure, patch generation failure, etc.).
fix.patch Complexity
click to expand
| Language | Valid SWE Count | Avg fix.patch lines | Avg fix.patch hunks | Avg fix.patch files |
|---|---|---|---|---|
| C | 9,769 | 335.71 | 17.92 | 5.84 |
| C++ | 4,062 | 285.93 | 13.74 | 5.09 |
| Go | 8,386 | 212.01 | 12.56 | 4.34 |
| Java | 4,229 | 166.56 | 10.67 | 4.26 |
| JavaScript | 7,257 | 77.11 | 6.31 | 2.80 |
| Python | 5,232 | 153.73 | 11.13 | 3.88 |
| Rust | 5,595 | 226.35 | 13.19 | 4.11 |
| TypeScript | 6,636 | 155.69 | 9.53 | 4.11 |
Metric Definitions
click to expand
Difficulty score (difficulty_score)
Reads each valid task directory's solution/fix.patch, tests/, and instruction.md, scored statically with zero API calls by src/swegen/scoring.py.
The current formula uses log-scale continuous scoring to avoid mid-sized patches becoming hard too early. Weights: patch_scope 38%, logic_complexity 32%, context_breadth 15%, test_complexity 10%, instruction_complexity 5%.
Label thresholds: easy <= 4.0, medium <= 7.0, hard > 7.0.
Tag generation and display
tags are not computed live by the dashboard; they are generated by the LLM from PR information when swegen builds the task, and written to [metadata].tags in task.toml.
The prompt asks for tags in four parts: programming language, project layer/domain, framework/library or specific topic, and a domain-independent bug class (e.g. missing-fallback, incomplete-validation). The dashboard reads existing task.toml files, counts each language's tag occurrences and share, and treats the 4th tag as the bug class for the Bug-Class panels below.
fix.patch statistics
Patch stats come from each valid task's solution/fix.patch, filtering code files by language extension, consistent with the code-only stats in upload_march_swe_to_hf.py.
Avg fix.patch lines counts added/removed lines in code-file diffs; Avg fix.patch hunks counts @@ hunks; Avg fix.patch files counts the code files involved.
difficulty_label Distribution
click to expand
| Language | easy / medium / hard | easy | medium | hard |
|---|---|---|---|---|
| C | 871 / 6455 / 2435 | 871 | 6,455 | 2,435 |
| C++ | 433 / 2524 / 1100 | 433 | 2,524 | 1,100 |
| Go | 689 / 6041 / 1650 | 689 | 6,041 | 1,650 |
| Java | 468 / 2732 / 1025 | 468 | 2,732 | 1,025 |
| JavaScript | 1096 / 5376 / 784 | 1,096 | 5,376 | 784 |
| Python | 275 / 3348 / 1585 | 275 | 3,348 | 1,585 |
| Rust | 391 / 3312 / 1890 | 391 | 3,312 | 1,890 |
| TypeScript | 612 / 4894 / 1130 | 612 | 4,894 | 1,130 |
difficulty_score Overview
click to expand
| Language | count | min | p25 | median | mean | p75 | max |
|---|---|---|---|---|---|---|---|
| C | 9,761 | 2.4 | 4.9 | 6.0 | 5.98 | 7.0 | 9.2 |
| C++ | 4,057 | 2.5 | 4.9 | 6.0 | 5.99 | 7.2 | 9.1 |
| Go | 8,380 | 2.6 | 4.9 | 5.8 | 5.85 | 6.8 | 9.1 |
| Java | 4,225 | 2.8 | 4.8 | 5.9 | 5.90 | 7.0 | 9.2 |
| JavaScript | 7,256 | 2.6 | 4.4 | 5.2 | 5.36 | 6.2 | 9.2 |
| Python | 5,208 | 2.6 | 5.2 | 6.2 | 6.23 | 7.3 | 9.1 |
| Rust | 5,593 | 2.7 | 5.2 | 6.3 | 6.26 | 7.4 | 9.0 |
| TypeScript | 6,636 | 2.7 | 4.7 | 5.6 | 5.72 | 6.6 | 9.2 |
Global Top Tags
click to expand
Per-Language Tag Distribution
click to expand
Global Top Bug Classes
click to expand
Bug class is the 4th tag in task.toml -> [metadata].tags: a domain-independent label describing the defect mechanism (e.g. missing-fallback, incomplete-validation, off-by-one-error). Generated by the LLM during swegen create and backfilled into legacy 3-tag tasks via swegen backfill-tags.
Per-Language Bug-Class Distribution
click to expand
Top bug classes per mainstream language. Counts are over tasks whose task.toml already carries a 4-tag entry; tasks still on the legacy 3-tag schema do not contribute until the backfill catches up.