SWE 任务和轨迹进度看板

最后更新时间:2026-05-14 03:44:59 BJT | 下次刷新:2026-05-14 04:45:00 BJT | 刷新间隔:3600 秒

收集 PR 总数
429,909
1h 0 / 24h +5,976
有效 SWE 总数
29,024
1h +31 / 24h +373
整体处理成功率
17.1%
Valid SWE / 已处理 170,099
difficulty_score 均值
5.77
median 5.7,count 28,977

语言进度

语言收集 PR过去 1h过去 24h有效 SWE过去 1h过去 24h已处理处理成功率
Cc28,2320+307,7340026,979
28.7%
C++cpp43,1160+502,017+20+14419,637
10.3%
Gogo88,5160+9084,4830+6526,714
16.8%
Javajava59,9360+472,6520+4713,757
19.3%
JavaScriptjs30,9410+1553,674+6+814,021
26.2%
Pythonpy68,3180+2352,5180019,567
12.9%
Rustrust54,6610+4,2252,523+3+5116,486
15.3%
TypeScriptts56,1890+3263,423+2+5832,938
10.4%

运行参数

语言评估模型 (OPENAI)填充模型 (ANTHROPIC)并发数min_source_filesmax_source_files
Cgpt-5.4claude-sonnet-4-616215
C++gpt-5.4claude-sonnet-4-616215
Gogpt-5.4claude-sonnet-4-616210
Javaclaude-haiku-4-5-20251001claude-sonnet-4-616210
JavaScriptglmmoedsaglmmoedsa16210
PythonMiniMax-M2.7MiniMax-M2.716315
Rustgpt-5.4claude-opus-4-616210
TypeScriptgpt-5.4claude-sonnet-4-616210

失败原因统计

语言已处理有效 SWE失败trivial_prvalidationinfra_errortimeoutworkflow_error其他
C26,9797,73419,24512,2382,2372,8152971,6554
C++19,6372,01717,6203,50339614,009227385267
Go26,7144,48322,2319,7772,1826,5251,5449441,196
Java13,7572,65211,1054,4501,5643,2012832541,601
JavaScript14,0213,67410,3476,5741,749521246432826
Python19,5672,51817,0497,5152,2568,117229290123
Rust16,4862,52313,9634,2091,1145,5046131,5041,234
TypeScript32,9383,42329,5155,68392221,3911421,34928

trivial_pr:PR 被 LLM 评估为过于简单(如仅修改配置、文档、依赖版本等),不适合作为 SWE 任务。

validation:任务生成后验证失败(NOP agent 未返回 reward=0 或 ORACLE agent 未返回 reward=1)。

infra_error:基础设施错误(Docker 构建失败、网络超时、磁盘空间不足等)。

timeout:处理超时(单个 PR 总超时或 Claude Code session 超时)。

workflow_error:工作流程错误(PR 元数据获取失败、worktree 创建失败、patch 生成失败等)。

fix.patch 复杂度

语言Valid SWE CountAvg fix.patch linesAvg fix.patch hunksAvg fix.patch files
C7,734281.2215.404.93
C++2,017323.9511.584.59
Go4,483269.8114.864.95
Java2,652176.4011.264.55
JavaScript3,67473.356.222.76
Python2,518135.789.993.44
Rust2,523256.2112.844.06
TypeScript3,423158.879.074.08

统计方法说明

难度打分 difficulty_score

读取每个有效任务目录的 solution/fix.patchtests/instruction.md,由 src/swegen/scoring.py 使用零 API 静态评分。

当前公式采用 log-scale 连续评分,避免中等规模 patch 过早变成 hard。权重为:patch_scope 38%logic_complexity 32%context_breadth 15%test_complexity 10%instruction_complexity 5%

label 阈值:easy <= 4.0medium <= 7.0hard > 7.0

Tags 生成与展示

tags 不是看板现场计算的,而是在 swegen 构建任务时由 LLM 根据 PR 信息生成,并写入 task.toml[metadata].tags

prompt 要求 tags 按三段式生成:编程语言、项目层级/领域、框架/库名或具体主题。看板只读取已有 task.toml 并统计每个语言的 tag 出现次数和占比。

fix.patch 统计

patch 统计来自每个有效任务的 solution/fix.patch,并按语言扩展名过滤代码文件,口径与 upload_march_swe_to_hf.py 的 code-only 统计保持一致。

Avg fix.patch lines 统计代码文件 diff 中新增/删除行数;Avg fix.patch hunks 统计 @@ hunk 数;Avg fix.patch files 统计涉及的代码文件数。

difficulty_label 分布

语言easy / medium / hardeasymediumhard
C
735 / 5189 / 1802
7355,1891,802
C++
329 / 1249 / 435
3291,249435
Go
372 / 3305 / 800
3723,305800
Java
351 / 1686 / 612
3511,686612
JavaScript
593 / 2730 / 350
5932,730350
Python
211 / 1724 / 560
2111,724560
Rust
260 / 1477 / 784
2601,477784
TypeScript
347 / 2596 / 480
3472,596480

difficulty_score 概览

语言countminp25medianmeanp75max
C7,7262.44.95.95.917.09.2
C++2,0132.54.45.65.676.99.0
Go4,4772.64.95.85.816.79.1
Java2,6492.84.75.95.816.99.2
JavaScript3,6732.64.45.25.296.19.2
Python2,4952.64.95.85.906.98.9
Rust2,5212.75.06.26.117.49.0
TypeScript3,4232.74.65.55.596.58.9

全局 Top Tags

library13,092 (45.2%)
backend8,944 (30.9%)
cli4,036 (13.9%)
frontend1,713 (5.9%)
testing1,269 (4.4%)
http923 (3.2%)
react920 (3.2%)
framework813 (2.8%)
embedded567 (2.0%)
cpp396 (1.4%)
networking362 (1.2%)
async340 (1.2%)
kubernetes284 (1.0%)
graphql232 (0.8%)
postgresql226 (0.8%)
eslint217 (0.7%)
parsing209 (0.7%)
aws184 (0.6%)
angular174 (0.6%)
kernel173 (0.6%)
compiler172 (0.6%)
firmware170 (0.6%)
quic166 (0.6%)
git165 (0.6%)
json162 (0.6%)
redis154 (0.5%)
aem147 (0.5%)
security145 (0.5%)
rust142 (0.5%)
tls138 (0.5%)

每语言 Tags 分布

C c

library4,080 (52.8%)
backend1,884 (24.4%)
cli938 (12.1%)
embedded562 (7.3%)
cpp395 (5.1%)
networking287 (3.7%)
testing213 (2.8%)
postgresql186 (2.4%)
kernel173 (2.2%)
framework172 (2.2%)
firmware170 (2.2%)
quic161 (2.1%)
http146 (1.9%)
rust131 (1.7%)
bluetooth121 (1.6%)
tls119 (1.5%)
ruby113 (1.5%)
scheduler110 (1.4%)
cryptography102 (1.3%)
python88 (1.1%)

C++ cpp

library1,471 (73.1%)
backend319 (15.8%)
testing266 (13.2%)
cli157 (7.8%)
framework102 (5.1%)
boost92 (4.6%)
http78 (3.9%)
async72 (3.6%)
compiler41 (2.0%)
parsing40 (2.0%)
serialization37 (1.8%)
templates30 (1.5%)
actor-framework28 (1.4%)
arrayfire28 (1.4%)
formatting27 (1.3%)
logging23 (1.1%)
redis22 (1.1%)
audio20 (1.0%)
sparql20 (1.0%)
iceberg19 (0.9%)

Go go

backend2,387 (53.3%)
cli1,200 (26.8%)
library994 (22.2%)
http337 (7.5%)
kubernetes252 (5.6%)
testing138 (3.1%)
docker98 (2.2%)
aws80 (1.8%)
framework71 (1.6%)
grpc70 (1.6%)
aws-sdk56 (1.3%)
dns45 (1.0%)
git45 (1.0%)
database44 (1.0%)
aws-lambda38 (0.8%)
security37 (0.8%)
blockchain36 (0.8%)
terraform36 (0.8%)
prometheus35 (0.8%)
templ34 (0.8%)

Java java

backend1,338 (50.5%)
library1,173 (44.3%)
aem147 (5.5%)
testing133 (5.0%)
spring106 (4.0%)
android77 (2.9%)
framework77 (2.9%)
http77 (2.9%)
json63 (2.4%)
sling57 (2.2%)
flink47 (1.8%)
cli40 (1.5%)
concurrency40 (1.5%)
grpc37 (1.4%)
nacos34 (1.3%)
mybatis33 (1.2%)
maven29 (1.1%)
parsing27 (1.0%)
security27 (1.0%)
sql-parser27 (1.0%)

JavaScript js

library1,799 (49.0%)
backend762 (20.7%)
frontend490 (13.3%)
cli436 (11.9%)
testing204 (5.6%)
react180 (4.9%)
eslint178 (4.8%)
framework177 (4.8%)
fastify118 (3.2%)
http105 (2.9%)
mongoose97 (2.6%)
webpack72 (2.0%)
typescript69 (1.9%)
express67 (1.8%)
lighthouse61 (1.7%)
svelte61 (1.7%)
aframe54 (1.5%)
async50 (1.4%)
apostrophecms48 (1.3%)
nodejs47 (1.3%)

Python py

backend1,095 (43.9%)
library916 (36.7%)
cli418 (16.7%)
ansible98 (3.9%)
fastapi78 (3.1%)
testing77 (3.1%)
framework66 (2.6%)
aiohttp62 (2.5%)
aws56 (2.2%)
django54 (2.2%)
http52 (2.1%)
click46 (1.8%)
dbt41 (1.6%)
black40 (1.6%)
async39 (1.6%)
openai39 (1.6%)
jinja235 (1.4%)
pipx34 (1.4%)
litellm31 (1.2%)
pytorch31 (1.2%)

Rust rust

library1,410 (55.9%)
cli580 (23.0%)
backend472 (18.7%)
testing169 (6.7%)
http99 (3.9%)
git88 (3.5%)
async79 (3.1%)
compiler59 (2.3%)
graphql58 (2.3%)
parsing55 (2.2%)
datafusion53 (2.1%)
substrate48 (1.9%)
sql47 (1.9%)
actix-web45 (1.8%)
macros45 (1.8%)
framework42 (1.7%)
parquet39 (1.5%)
blockchain31 (1.2%)
clap28 (1.1%)
sqlparser28 (1.1%)

TypeScript ts

library1,249 (36.5%)
frontend1,086 (31.7%)
react736 (21.5%)
backend687 (20.1%)
cli267 (7.8%)
angular173 (5.1%)
graphql133 (3.9%)
framework106 (3.1%)
electron94 (2.7%)
fullstack86 (2.5%)
testing69 (2.0%)
github-actions59 (1.7%)
vue55 (1.6%)
express48 (1.4%)
javascript45 (1.3%)
nextjs44 (1.3%)
eslint39 (1.1%)
xstate37 (1.1%)
react-native36 (1.1%)
zod36 (1.1%)
轨迹文件数
24
轨迹总条数
65,965
平均消息轮数
61.5
平均 composite_score
0.7449
基于 29,596 条有分数的轨迹

轨迹文件总览

数据集脚手架模型Owner轨迹数文件大小平均轮数平均 Token平均 Tool Calls平均 Score
swegenClaude Codeglm5chaofan1,096446 MB72.739,64837.90.7218
swegenOpenCodeglm5chaofan860197 MB64.830,57734.00.7337
swegenOpenHands SDKglm5chaofan1,140280 MB119.649,81159.80.7533
swegenTerminus-2glm5chaofan1,331142 MB53.226,96841.20.7048
swerebench_oraclesolvedClaude Codeglm5chaofan4,053601 MB44.122,11722.7
swerebench_oraclesolvedOpenCodeglm5chaofan3,431405 MB41.319,70421.5
swerebench_oraclesolvedOpenHands-AIglm5chaofan2,808638 MB135.136,43467.2
swerebench_oraclesolvedTerminus-2glm5chaofan2,920291 MB55.725,47336.9
swerebench_oraclesolvedClaude Codeglm5jierun5,013779 MB44.724,12323.3
swerebench_oraclesolvedOpenCodeglm5jierun3,521464 MB51.822,44727.0
swerebench_oraclesolvedOpenHands SDKglm5jierun2,733507 MB95.835,56747.5
swerebench_oraclesolvedTerminus-2glm5jierun2,998296 MB55.525,25737.0
swerebench_othersClaude Codeglm5jierun2,885457 MB43.825,19423.10.7178
swerebench_othersOpenCodeglm5jierun2,637360 MB49.123,63725.80.7210
swerebench_othersOpenHands SDKglm5jierun1,843340 MB90.235,60244.80.7868
swerebench_othersTerminus-2glm5jierun2,270225 MB50.624,99035.10.7844
swerebenchv2_python_oraclesolvedClaude Codeglm5jierun2,728445 MB47.526,06425.1
swerebenchv2_python_oraclesolvedOpenCodeglm5jierun1,654240 MB60.723,89831.3
swerebenchv2_python_oraclesolvedOpenHands SDKglm5jierun1,521313 MB103.040,23251.0
swerebenchv2_python_oraclesolvedTerminus-2glm5jierun1,633174 MB55.727,01337.6
v2nopy_fullClaude Codeglm5jierun5,128818 MB42.726,94122.80.7281
v2nopy_fullOpenCodeglm5jierun3,465455 MB47.023,17224.80.7353
v2nopy_fullOpenHands SDKglm5jierun3,266620 MB92.738,25246.00.7580
v2nopy_fullTerminus-2glm5jierun3,675411 MB64.128,36545.60.7799

质量评分统计

数据集脚手架compositeefficiencystyletool_masterycompletionprecision
swegenClaude Code0.72180.91250.34140.86780.62130.7722
swegenOpenCode0.73370.93220.32720.88860.62880.7923
swegenOpenHands SDK0.75330.89750.35000.85370.77370.7632
swegenTerminus-20.70480.92410.42280.89830.39600.8866
swerebench_othersClaude Code0.71780.88640.31970.79420.64050.8924
swerebench_othersOpenCode0.72100.89520.31710.80930.65090.8621
swerebench_othersOpenHands SDK0.78680.88860.36930.76980.93330.8525
swerebench_othersTerminus-20.78440.85390.42920.81720.82030.9323
v2nopy_fullClaude Code0.72810.91860.32130.88860.55650.8996
v2nopy_fullOpenCode0.73530.92780.32540.90080.57660.8772
v2nopy_fullOpenHands SDK0.75800.90020.35550.84400.75230.8370
v2nopy_fullTerminus-20.77990.92220.43100.90160.69300.8808

按数据集对比

数据集总轨迹数平均轮数平均 Token平均 Tool Calls平均 Score
swegen4,42777.436,69143.80.7271
swerebench_oraclesolved27,47761.825,72433.3
swerebench_others9,63555.726,71130.80.7475
swerebenchv2_python_oraclesolved7,53663.428,65434.4
v2nopy_full15,53459.228,81533.50.7483

源数据路径目录

数据集脚手架Owner文件路径大小条数
swegenClaude Codechaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swegen_cc_2135.jsonl446 MB2,135
swegenOpenCodechaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swegen_oc_1177.jsonl197 MB1,177
swegenOpenHands SDKchaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swegen_ohsdk_1140.jsonl280 MB1,140
swegenTerminus-2chaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swegen_t2_1331.jsonl142 MB1,331
swerebench_oraclesolvedClaude Codechaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swerebench_oraclesolved_cc_4053.jsonl601 MB4,053
swerebench_oraclesolvedClaude Codejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_oraclesolved_cc_5013.jsonl779 MB5,013
swerebench_oraclesolvedOpenCodechaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swerebench_oraclesolved_oc_3431.jsonl405 MB3,431
swerebench_oraclesolvedOpenCodejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_oraclesolved_oc_3521.jsonl464 MB3,521
swerebench_oraclesolvedOpenHands-AIchaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swerebench_oraclesolved_oh_2808.jsonl638 MB2,808
swerebench_oraclesolvedOpenHands SDKjierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_oraclesolved_oh_sdk_2733.jsonl507 MB2,733
swerebench_oraclesolvedTerminus-2chaofan/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/chaofan_glm5_swerebench_oraclesolved_t2_2920.jsonl291 MB2,920
swerebench_oraclesolvedTerminus-2jierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_oraclesolved_t2_2998.jsonl296 MB2,998
swerebench_othersClaude Codejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_others_cc_2885.jsonl457 MB2,885
swerebench_othersOpenCodejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_others_oc_2637.jsonl360 MB2,637
swerebench_othersOpenHands SDKjierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_others_oh_sdk_1843.jsonl340 MB1,843
swerebench_othersTerminus-2jierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebench_others_t2_2270.jsonl225 MB2,270
swerebenchv2_python_oraclesolvedClaude Codejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebenchv2_python_oraclesolved_cc_2728.jsonl445 MB2,728
swerebenchv2_python_oraclesolvedOpenCodejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebenchv2_python_oraclesolved_oc_1654.jsonl240 MB1,654
swerebenchv2_python_oraclesolvedOpenHands SDKjierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebenchv2_python_oraclesolved_oh_sdk_1521.jsonl313 MB1,521
swerebenchv2_python_oraclesolvedTerminus-2jierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_swerebenchv2_python_oraclesolved_t2_1633.jsonl174 MB1,633
v2nopy_fullClaude Codejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_v2nopy_full_cc_5128.jsonl818 MB5,128
v2nopy_fullOpenCodejierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_v2nopy_full_oc_3465.jsonl455 MB3,465
v2nopy_fullOpenHands SDKjierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_v2nopy_full_oh_sdk_3266.jsonl620 MB3,266
v2nopy_fullTerminus-2jierun/home/ywxzml3j/ywxzml3juser57/LLaMA-Factory/data/chaofan_jierun_traj/jierun_glm5_v2nopy_full_t2_3675.jsonl411 MB3,675

统计方法说明

平均轮数 / Token / Tool Calls

平均轮数:每条轨迹的 messages 数组长度的平均值。

平均 Token:使用 tiktoken cl100k_base tokenizer 对所有 message 的 content + reasoning_content 精确编码计数的平均值。

平均 Tool Calls:assistant 消息中 tool_calls 数组长度之和的平均值。对 Terminus-2 脚手架,统计 assistant 消息 JSON content 中 commands 数组的长度。

质量评分

composite_score(0-1)由五个维度加权:efficiency(效率)、style(风格)、tool_mastery(工具掌握)、completion(完成度)、precision(精确度)。

仅部分文件包含 _score 字段,无分数的文件显示 "—"。