Charles Issue Humor

image text translation

“GPT-5, several works
Equivalent to humans in karma “. Bird
Long benchmark release
Reporter Park Chan
Update 2025.09.27 06:47
Prick
N or RAHA
1 Stand for Assembly Line
APE for Last Mile Delivery
Editor: Create High-Energy
Customer Service: EMAL
Recoonse
Goncorpaigonewoahloyrgion “
Audio
UESTING
Order Clerk: Audit Priclr
iconsistencies
Real Estate Agent: Design Sales Brochure
Recreation
Optimize table layou
fornew dcnronertv
For Spring Vendor
ExarnPle gdpval tasks
(Photo = coming)
The right poetry is a human expert -level artificial intelligence (A
I) No matter how much the model can be performed,
It was released. As a result, ‘GPT-5’ rice paddies
Of course, the untropic stitching model also reaches the human level.
I evaluated it as it:
The right time is 25 days (local time)
44 directs in nine industries that contribute to the biggest contribution to
Human experts and tasks
A new Ben compared with the result of A
The skirt ‘GDPVAL’ was released.
Evaluation targets are the fractional taign and spreadsheet tablets
Book briefing, CAD design audio video hornence, etc.
Various results are included: These results are corresponding.
Experts in the field comparisons in the Victory way (Pair
Wise Comparison) scores.
The right time is the first version of the ‘GDPVAL-VO’
Write a report, write journaly articles, and nursing plans
There are a total of 1320 tasks, including lip. Each project is average
Experts who have 14 years of experience are designed and Choi
More than 5 times of verification process
It is secured.
GDPVAL VO: Pairwise Expert Preferences
WINS
Ties
WINS ONLY
6096
Parity with
0o6
Industrial
Expert
5096
40.696
409
34,896
all
28.896
30%6
23.496
24.196
2096
13.79
1096
apt-40
grok
gemini
04-MNINI-High
03-HIGH GPT-5-HIGH
claude
Mode
Comparison of the results of major AL models and human experts (photo-right)
As a result, the level of ‘GPT-5-High’ non-field expert
The ratio is considered to be the same or better as 40.
It was 6%.
Entropic’s ‘Glod Operus 4.7’ records 49%framework.
Relatively higher screens. But the right A
Ipan “Killed is a document pack or slide design, etc.
The strength of the visual expression is high and the score is high
There is a noodle. ”
GPT-5 achievements were 15 months ago, GPT-4O (13.
7%) 3 times the level. “Performance
The improvement is tough. ”
But GDPVAL is the same as the report
Since it is evaluated as a focus on this,
Complex interactions or multi -level work courses
It is said that it cannot be reflected:
In the future, the interactive Week Flow Context axis
Early more than actual work, such as writing enemy repetitive draft drafts
Introduced the benchmark for the benchmark, adding
all:
Aaron Chatter, who led this research frame,
Nomist said, “The model is getting closer to professional levels.
While; Actual workers leave some work in the city
“We will be able to focus on more valuable activities.”
I did:
“AI model is reality
The speed applied to work is getting faster, ”he said.
The development tax will be accelerated. ”
all

In 44 occupations, they are already equal or superior to humans

Fear of AI development.jpg

Leave a Comment Cancel reply