OpenAI의 새로운 오픈소스 모델 제품군인 GPT OSS에 오신 것을 환영합니다! | reach-vb

dimohy · 8월 6, 2025, 10:52오후

GPT OSS - OpenAI의 새로운 오픈소스 모델 패밀리 요약

주요 개요

GPT OSS는 OpenAI에서 2025년 8월 5일에 발표한 오픈 웨이트(open-weights) 모델로, 강력한 추론, 에이전트 작업, 다양한 개발자 사용 사례를 위해 설계되었습니다.

핵심 모델 사양

gpt-oss-120b: 117B 매개변수의 대형 모델
gpt-oss-20b: 21B 매개변수의 소형 모델
아키텍처: 둘 다 MoE(Mixture-of-Experts) 방식
양자화: 4비트 양자화 스키마 (MXFP4) 사용
라이선스: Apache 2.0 라이선스

메모리 요구사항

대형 모델(120B): 단일 H100 GPU에서 실행 가능
소형 모델(20B): 16GB 메모리 내에서 실행, 소비자 하드웨어 및 온디바이스 애플리케이션에 적합

기술적 특징

아키텍처 세부사항

토큰 선택 MoE: SwiGLU 활성화 함수 사용
활성 매개변수:
- 120B 모델: 5.1B 활성 매개변수
- 20B 모델: 3.6B 활성 매개변수
어텐션 메커니즘:
- RoPE 사용, 128K 컨텍스트
- 교대로 전체 컨텍스트와 128토큰 슬라이딩 윈도우 사용
- 헤드당 학습된 어텐션 싱크
토크나이저: GPT-4o 및 기타 OpenAI API 모델과 동일한 토크나이저 사용

API 접근 방법

Inference Providers를 통한 접근

Hugging Face의 Inference Providers 서비스를 통해 다양한 제공업체에 동일한 코드로 요청을 보낼 수 있습니다.

Python 예제 (Cerebras 제공업체 사용)

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b:cerebras",
    messages=[
        {
            "role": "user",
            "content": "How many rs are in the word 'strawberry'?",
        }
    ],
)

print(completion.choices[0].message)

Responses API 예제 (Fireworks AI 제공업체 사용)

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="openai/gpt-oss-20b:fireworks-ai",
    input="How many rs are in the word 'strawberry'?",
)

print(response)

로컬 추론 설정

Transformers 사용

기본 설치 요구사항

pip install --upgrade accelerate transformers kernels

고급 설치 (PyTorch 2.8 + Triton 3.4)

# PyTorch 2.8 설치 (선택사항)
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/test/cu128

# mxfp4 지원을 위한 triton kernels 설치
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

기본 추론 코드 (20B 모델)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

최적화 옵션

Flash Attention 3 (Hopper GPU용)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    # Flash Attention with Sinks
    attn_implementation="kernels-community/vllm-flash-attn3",
)

MegaBlocks MoE 커널 (다른 GPU용)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    # Optimize MoE layers with downloadable MegaBlocksMoeMLP
    use_kernels=True,
)

GPU 호환성 및 최적화 요약

GPU 타입	mxfp4	Flash Attention 3	MegaBlocks MoE
Hopper (H100, H200)
Blackwell (GB200, 50xx)
기타 CUDA GPU
AMD Instinct (MI3XX)

멀티 GPU 설정

4개 GPU를 사용한 120B 모델 실행 예제:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",    # Enable Tensor Parallelism
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

터미널에서 실행: torchrun --nproc_per_node=4 generate.py

다양한 추론 프레임워크

llama.cpp

MXFP4 및 Flash Attention 네이티브 지원
Metal, CUDA, Vulkan 백엔드 지원

설치 및 실행

# MacOS
brew install llama.cpp

# Windows
winget install llama.cpp

# 서버 실행
llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none

vLLM

최적화된 Flash Attention 3 커널 제공
Chat Completion 및 Responses API 지원

# 서버 시작 (2개 H100 GPU 가정)
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2

Python에서 직접 사용:

from vllm import LLM
llm = LLM("openai/gpt-oss-120b", tensor_parallel_size=2)
output = llm.generate("San Francisco is a")

transformers serve

transformers serve

Responses API 요청

curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

Completions API 요청

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

파인튜닝

TRL과 완전히 통합되어 있으며, 다음 예제들이 제공됩니다:

LoRA 예제: OpenAI cookbook에서 다국어 추론을 위한 파인튜닝 방법 제공
기본 파인튜닝 스크립트: 사용자 요구에 맞게 조정 가능한 스크립트

클라우드 배포 옵션

Azure AI Model Catalog

GPT OSS 20B 및 120B 모델 모두 제공
관리형 온라인 엔드포인트로 실시간 추론 가능
Azure의 엔터프라이즈급 인프라, 자동 스케일링, 모니터링 활용

Dell Enterprise Hub

온프레미스 배포를 위한 보안 온라인 포털
Dell 플랫폼에 최적화된 컨테이너 제공
Dell 하드웨어 네이티브 지원 및 엔터프라이즈급 보안 기능

모델 평가

중요 고려사항

추론 모델 특성: 매우 큰 생성 크기(최대 새 토큰 수) 필요
추론 단계: 모델 생성에는 먼저 추론 과정, 그 다음 실제 답변이 포함됨
평가 시 주의사항: 너무 작은 생성 크기는 추론 중간에 중단될 위험

lighteval을 사용한 평가 예제

git clone https://github.com/huggingface/lighteval
pip install -e .[dev]

lighteval accelerate \
    "model_name=openai/gpt-oss-20b,max_length=16384,skip_special_tokens=False,generation_parameters={temperature:1,top_p:1,top_k:40,min_p:0,max_new_tokens:16384}" \ 
    "extended|ifeval|0|0,lighteval|aime25|0|0" \
    --save-details --output-dir "openai_scores" \
    --remove-reasoning-tags --reasoning-tags="[('<|channel|>analysis<|message|>','<|end|><|start|>assistant<|channel|>final<|message|>')]"

20B 모델 예상 성능

IFEval (strict prompt): 69.5 (+/-1.9)
AIME25 (pass@1): 63.3 (+/-8.9)

채팅 및 채팅 템플릿

채널 구조 이해

GPT OSS는 출력에서 “채널” 개념을 사용합니다:

“analysis” 채널: 사고 과정, 추론 체인 등 최종 사용자에게 표시되지 않는 내용
“final” 채널: 실제로 사용자에게 표시될 메시지

출력 구조 예시

<|start|>assistant<|channel|>analysis<|message|>CHAIN_OF_THOUGHT<|end|><|start|>assistant<|channel|>final<|message|>ACTUAL_MESSAGE

훈련용 채팅 형식

chat = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Hello!"},
    {"role": "user", "content": "Can you think about this one?"},
    {"role": "assistant", "thinking": "Thinking real hard...", "content": "Okay!"}
]

inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=False)

시스템 및 개발자 메시지

GPT OSS는 “system” 메시지와 “developer” 메시지를 구분합니다:

system 메시지: 현재 날짜, 모델 정체성, 추론 노력 수준 등 엄격한 형식
developer 메시지: 더 자유로운 형식

chat = [
    {"role": "system", "content": "This will actually become a developer message!"}
]

tokenizer.apply_chat_template(
    chat, 
    model_identity="You are OpenAI GPT OSS.",
    reasoning_effort="high"  # "medium"(기본값), "high", "low" 중 선택
)

도구 사용 (Tool Use)

지원 도구 유형

내장 도구: browser, python
사용자 정의 도구: JSON 스키마 또는 타입 힌트가 있는 Python 함수

도구 사용 예제

def get_current_weather(location: str):
    """
    Returns the current weather status at a given location as a string.

    Args:
        location: The location to get the weather for.
    """
    return "Terrestrial."

chat = [
    {"role": "user", "content": "What's the weather in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(
    chat, 
    tools=[get_current_weather], 
    builtin_tools=["browser", "python"],
    add_generation_prompt=True,
    return_tensors="pt"
)

도구 호출 처리

tool_call_message = {
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": "get_current_temperature", 
                "arguments": {"location": "Paris, France"}
            }
        }
    ]
}
chat.append(tool_call_message)

tool_output = get_current_weather("Paris, France")

tool_result_message = {
    "role": "tool",
    "content": tool_output
}
chat.append(tool_result_message)

실용적인 팁과 주의사항

메모리 최적화

Hopper GPU (H100, H200): mxfp4 + Flash Attention 3 조합 사용 권장
기타 GPU: MegaBlocks MoE 커널 사용 권장
AMD 하드웨어: ROCm 플랫폼 지원으로 MegaBlocks 가속 가능

평가 시 주의사항

추론 태그 필터링: skip_special_tokens=False 사용 필수
충분한 생성 길이: 추론 과정이 중단되지 않도록 충분한 max_new_tokens 설정
메트릭 계산 전: 추론 흔적을 모델 답변에서 제거

멀티턴 대화 훈련

체인 오브 쏘트(CoT) 처리: 최종 턴의 CoT만 유지
라벨 마스킹: 최종 어시스턴트 턴만 언마스크
샘플 분할: 전체 멀티턴 대화를 어시스턴트 턴당 하나의 샘플로 분할

기여자 및 감사의 말

이 릴리즈는 여러 팀과 회사의 협력으로 이루어졌습니다:

주요 기여 팀

오픈소스 팀: Cyril, Lysandre, Arthur, Marc, Mohammed, Nouamane, Harry, Benjamin, Matt
TRL 팀: Ed, Lewis, Quentin
평가 팀: Clémentine
커널 팀: David, Daniel
상업적 파트너십: Simon, Alvaro, Jeff, Akos, Ivar
Hub 및 제품 팀: Simon, Célina, Pierric, Lucain, Xuan-Son, Chunte, Julien
법무 팀: Magda, Anna

협력 파트너

vLLM: 분야 발전에 기여
추론 제공업체들: 더 간단한 구축 방법 제공
OpenAI: 커뮤니티를 위한 모델 릴리즈 결정

이 문서는 GPT OSS 모델의 포괄적인 사용 가이드로, 설치부터 고급 사용법까지 모든 단계를 다루고 있습니다.