LLMs Technology
全链路:大模型设计、训练、微调、压缩、部署、托管、推理 优化点:推理加速(算子融合、KV缓存、多卡并行、量化压缩)、断点续传
API
PyTorch:
云原生的弹性 AI 训练系列之二:PyTorch 1.9.0 弹性分布式训练的设计与实现
Model Download
Hugging Face:
1.git lfs
下载到当前路径
1
2
git lfs install
git clone https://huggingface.co/Qwen/Qwen2-0.5B
2.调用本地没有的模型会自动联网下载,默认下载到 ~/.cache/huggingface/
,cache_dir
指定下载路径
1
2
3
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B", cache_dir = ".")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", cache_dir = ".")
3.huggingface_hub 下载,默认下载到 ~/.cache/huggingface/
,cache_dir
或 local_dir
指定下载路径,两者区别
下载单个文件:
1
2
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="Qwen/Qwen2-0.5B", filename="model.safetensors", local_dir = ".")
下载整个仓库:
1
2
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Qwen/Qwen2-0.5B", local_dir = ".")
4.huggingface-cli
命令行下载,Guides
1
huggingface-cli download gpt2 config.json model.safetensors --local-dir .
(可选) huggingface-cli
登录
1
2
pip install -U "huggingface_hub[cli]"
huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential
Modelscope:
1.git lfs
下载到当前路径
1
2
git lfs install
git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.git
2.命令行下载多个文件,默认下载到 ~/.cache/modelscope
,cache_dir
指定路径
1
modelscope download --model 'AI-ModelScope/gpt2' config.json model.safetensors --cache_dir '.'
3.SDK下载
1
2
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen-7B-Chat', local_dir = ".")
Embedding
[OpenAI - tiktoken | fast BPE tokenizer](https://blog.csdn.net/lovechris00/article/details/129889317) |
Vector Database:chroma,milvus,faiss
Compiler
Fine Tuning
TOWARDS A UNIFIED VIEW OF PARAMETER-EFFICIENT TRANSFER LEARNING
人工智能大语言模型微调技术:SFT 监督微调、LoRA 微调方法、P-tuning v2 微调方法、Freeze 监督微调方法
人工智能大语言模型微调技术:SFT 、LoRA 、Freeze 监督微调方法
QLoRA:4-bit级别的量化+LoRA方法,用3090在DB-GPT上打造基于33B LLM的个人知识库
微调语言大模型选LoRA还是全参数?基于LLaMA 2深度分析
通义千问Qwen-7B效果如何?Firefly微调实践,效果出色
Quantization
FP8 KV Cache
用 bitsandbytes、4 比特量化和 QLoRA 打造亲民的 LLM
大规模 Transformer 模型 8 比特矩阵乘简介 - 基于 Hugging Face Transformers、Accelerate 以及 bitsandbytes
Training / Inference
DeepSpeed: Extreme-scale model training for everyone
Getting Started with DeepSpeed for Inferencing Transformer based Models
torchrun
如何将Pytorch模型转ONNX格式并使用OnnxRuntime推理
A guide to LLM inference and performance
PyTorch 显存释放: 推理时禁用了梯度计算,推理后显式地释放未使用的显存资源,避免显存累积
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class EmbeddingTorchBackend(EmbeddingBackend):
def __init__(self, use_cpu: bool):
super().__init__(use_cpu)
self.return_tensors = "pt"
self._model = AutoModel.from_pretrained(LOCAL_EMBED_PATH, return_dict=False)
if use_cpu:
self.device = torch.device('cpu')
else:
self.device = torch.device('cuda')
self._model = self._model.to(self.device)
print("embedding device:", self.device)
def get_embedding(self, sentences, max_length):
inputs_pt = self._tokenizer(sentences, padding=True, truncation=True, max_length=max_length,
return_tensors=self.return_tensors)
inputs_pt = {k: v.to(self.device) for k, v in inputs_pt.items()}
start_time = time.time()
with torch.no_grad(): # 禁用梯度计算
outputs_pt = self._model(**inputs_pt)
torch.cuda.empty_cache() # 释放未使用显存
debug_logger.info(f"torch embedding infer time: {time.time() - start_time}")
embedding = outputs_pt[0][:, 0].cpu().detach().numpy()
debug_logger.info(f'embedding shape: {embedding.shape}')
norm_arr = np.linalg.norm(embedding, axis=1, keepdims=True)
embeddings_normalized = embedding / norm_arr
return embeddings_normalized.tolist()
LLM serving engines (performance benchmark):
Toolchain
LangChain: a framework for developing applications powered by LLMs
LangChain for Java: Supercharge your Java application with the power of LLMs
LangChain 与 LangSmith:构建与微调支持LLM的智能应用双重攻略
LlamaIndex: data framework for your LLM application
llmware: The Ultimate Toolkit for Building LLM Apps
Ollama: Get up and running with large language models locally
lmstudio.js: LM Studio TypeScript SDK
Toolkit
API Calls
OpenAI API
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from openai import OpenAI
# import httpx
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
# # to solve 307 redirection problem
# http_client=httpx.Client(
# base_url="http://localhost:8000/v1",
# follow_redirects=True,
# ),
)
models = client.models.list()
model = models.data[0].id
print(model)
completion = client.chat.completions.create(
model= model,
messages=[
# {"role": "system", "content": ""},
{"role": "user", "content": "hello"}
],
temperature=0.7,
max_tokens=4096,
top_p=0.95,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
print(completion.choices[0].message.content)
AzureOpenAI API
通用:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version="2024-02-01",
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)
chat_completion = client.chat.completions.create(
model="DEPLOYMENT_NAME",
messages=[{
"role": "user",
"content": "x=3, x+y等于多少",
},
{
"role": "assistant",
"content": "请告诉我y是多少"
},
{
"role": "user",
"content": "y是100,那结果是多少"
}]
)
print(chat_completion.choices[0].message.content)
仅适用 gpt-35-turbo-instruct
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import os
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version="2024-02-01",
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)
deployment_name='REPLACE_WITH_YOUR_DEPLOYMENT_NAME' #This will correspond to the custom name you chose for your deployment when you deployed a model. Use a gpt-35-turbo-instruct deployment.
# Send a completion call to generate an answer
print('Sending a test completion job')
start_phrase = 'Write a tagline for an ice cream shop. '
response = client.completions.create(model=deployment_name, prompt=start_phrase, max_tokens=10)
print(start_phrase+response.choices[0].text)
vllm API Calls Template
vllm在线推理踩坑记,vLLM部署流式推理、openai接口调用、requests调用,GLM-4-9B-Chat
WebUI
1
2
3
4
5
6
7
Could not create share link. Missing file: /data/jiangyy/miniconda3/envs/glm/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.2.
Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps:
1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /data/jiangyy/miniconda3/envs/glm/lib/python3.10/site-packages/gradio
下载后无果:Could not create share link. Please check your internet connection or our status page: https://status.gradio.app
可继续尝试:https://github.com/gradio-app/gradio/pull/6091
最终解决:chmod +x frpc_linux_amd64_v0.2
1
2
3
4
Running on local URL: http://127.0.0.1:8001
Running on public URL: https://c3f57738b574623792.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
Others
Hyperparams
分析transformer模型的参数量、计算量、中间激活、KV cache
batch size
RLHF
Implementing RLHF: Learning to Summarize with trlX
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
offline rl, online rl, on-policy rl, off-policy rl
AGI
WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION
Insights
Bert 对比 LLM:
Bert 优点:1双向编码的优势(考虑前后文的信息),2较小的数据和资源,缺点:1只能用于理解任务
LLM 优点:1能理解能生成,2多样化任务(翻译、摘要、对话)不用针对每个任务单独微调,缺点:数据和资源消耗大
Comments powered by Disqus.