OpenCompass司南

CompassHub CompassRank Rules CompassKit

Docs

AI4S Program

Online Evaluation

Contribute Benchmark

Large Model Evaluation System

Shanghai AI Laboratory Open-source, efficient, and comprehensive large model evaluation system and open platform

Online Evaluation

View Leaderboards →

OpenCompass

Shanghai AI Laboratory Open-source, efficient, and comprehensive large model evaluation system and open platform

CompassKit

Evaluation Toolkit

Provides a rich set of evaluation benchmarks and model templates in the form of an open source toolkit, supports flexible expansion to meet various practical business needs of model evaluation

View on Github →

CompassHub

Benchmark Community

Publish and share evaluation benchmarks and leaderboards within the community. At the same time, select high-quality benchmarks to be added to the official evaluation leaderboards

Go to the Community →

CompassRank

Evaluation Leaderboards

Provides comprehensive, objective, and neutral scores and rankings for top-tier large language models and multimodal models. Leave your model API or repository address here and compete with top AI talents

View Leaderboards →

Support

Large Language Model Leaderboard & Multi-modal Leaderboard

One-stop Evaluation

NLP

Large Language Model Evaluation

OpenCompass performs an in-depth and holistic assessment of large language models, leveraging an extensive array of rigorously curated benchmarks across eight fundamental dimensions: language comprehension, knowledge precision, logical deduction, creative ideation, mathematical problem-solving, programming proficiency, extended text analysis, and intelligent agent engagement.

_MULTIMODAL

Core Tasks

Go to CompassHub to view all benchmarks →

OrganizationModel

Go to CompassRank to view the evaluation leaderboards →

OrganizationModel

Go to CompassRank to view the evaluation leaderboards →

How do we evaluate?

OpenCompass platform extensively supports over 40+ HuggingFace and API models, incorporating a diverse range of 100+ benchmarks with approximately 400,000 questions to evaluate models across eight dimensions. Its efficient distributed evaluation system allows for quick and thorough assessment of billion-scale models. The platform accommodates various evaluation methods, including zero-shot, few-shot, and chain-of-thought evaluations, and features a highly extendable modular design for easy addition of new models, benchmarks, or customized task strategies. Additionally, OpenCompass includes robust experiment management and reporting tools for detailed tracking and real-time result presentation.

Go to GitHub to learn more →

6 Key Features for Professional Evaluation

Open Source and Reproducible

A complete open source and reproducible large language model evaluation system

Rich Model Support

One-stop evaluation support for various types of models (HF models, API models, custom open source models)

Distributed and Efficient Evaluation

Provides distributed and efficient evaluation of dozens of benchmarks, supporting rapid evaluation of models with hundreds of billions of parameters

Comprehensive Capability Dimensions

Comprehensive division of capability dimensions and corresponding benchmark support

Flexible Expansion

Supports flexible and convenient addition of evaluation benchmarks and models

Diverse Evaluation Methods

Supports zero-shot evaluation, few-shot evaluation, and thought chain evaluation

Partners

If you are interested in cooperation, please contact us via email:

opencompass@pjlab.org.cn

Aliyun

East Money

Duxiaoman

Fudan University

Nanjing University

Nanyang Technological University

OPPO

Tencent

TigerResearch

The Chinese University of Hong Kong

Zhejiang University

Zhipu·AI

Partner Program

We have launched the OpenCompass Partner Program, and welcome more excellent models to join our leaderboard. At the same time, we sincerely invite all of you to contribute more representative and reliable benchamarks

Evaluation Leaderboards

Only open source or external API service models are supported (internal iterative models currently do not support)

View Rules

Benchmark Community

Make your benchmark shine: co-create and share evaluation benchmarks with the community, establish unique leaderboards, and together shape the new benchmark for large model evaluation

Contribute Benchmark