Post

LLMs Technology

全链路:大模型设计、训练、微调、压缩、部署、托管、推理 优化点:推理加速(算子融合、KV缓存、多卡并行、量化压缩)、断点续传

API

PyTorch:

云原生的弹性 AI 训练系列之二:PyTorch 1.9.0 弹性分布式训练的设计与实现

Model Download

Hugging Face:

1.git lfs 下载到当前路径

1
2
git lfs install
git clone https://huggingface.co/Qwen/Qwen2-0.5B

2.调用本地没有的模型会自动联网下载,默认下载到 ~/.cache/huggingface/cache_dir 指定下载路径

1
2
3
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B", cache_dir = ".")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", cache_dir = ".")

3.huggingface_hub 下载,默认下载到 ~/.cache/huggingface/cache_dirlocal_dir 指定下载路径,两者区别

下载单个文件:

1
2
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="Qwen/Qwen2-0.5B", filename="model.safetensors", local_dir = ".")

下载整个仓库:

1
2
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Qwen/Qwen2-0.5B", local_dir = ".")

4.huggingface-cli 命令行下载Guides

1
huggingface-cli download gpt2 config.json model.safetensors --local-dir .

(可选) huggingface-cli 登录

1
2
pip install -U "huggingface_hub[cli]"
huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential

Modelscope

1.git lfs 下载到当前路径

1
2
git lfs install
git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.git

2.命令行下载多个文件,默认下载到 ~/.cache/modelscopecache_dir 指定路径

1
modelscope download --model 'AI-ModelScope/gpt2' config.json model.safetensors --cache_dir '.'

3.SDK下载

1
2
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen-7B-Chat', local_dir = ".")

Embedding

FlagEmbedding

Embedding Projector

OpenAI API

embedding的原理及实践

[OpenAI - tiktokenfast BPE tokenizer](https://blog.csdn.net/lovechris00/article/details/129889317)

Vector Databasechromamilvusfaiss

Compiler

tvm

Fine Tuning

LoRA

QLoRA

LLaMA Factory

PEFT

FastChat

unsloth

Firefly

SWIFT

TOWARDS A UNIFIED VIEW OF PARAMETER-EFFICIENT TRANSFER LEARNING

人工智能大语言模型微调技术:SFT 监督微调、LoRA 微调方法、P-tuning v2 微调方法、Freeze 监督微调方法

人工智能大语言模型微调技术:SFT 、LoRA 、Freeze 监督微调方法

大模型微调技术LoRA与QLoRA

QLoRA:4-bit级别的量化+LoRA方法,用3090在DB-GPT上打造基于33B LLM的个人知识库

LoRA和QLoRA微调语言大模型:数百次实验后的见解

LoRA 微调语言大模型的实用技巧

微调语言大模型选LoRA还是全参数?基于LLaMA 2深度分析

通义千问Qwen-7B效果如何?Firefly微调实践,效果出色

Quantization

GPTQ

AWQ

SqueezeLLM

FP8 KV Cache

用 bitsandbytes、4 比特量化和 QLoRA 打造亲民的 LLM

大规模 Transformer 模型 8 比特矩阵乘简介 - 基于 Hugging Face Transformers、Accelerate 以及 bitsandbytes

Training / Inference

llama.cpp

大模型部署工具 llama.cpp

Transformers

DeepSpeed

DeepSpeed之ZeRO系列:将显存优化进行到底

DeepSpeed: Extreme-scale model training for everyone

Getting Started with DeepSpeed for Inferencing Transformer based Models

torchrun

Accelerate

onnxruntime

如何将Pytorch模型转ONNX格式并使用OnnxRuntime推理

openvino

mindspore

PaddlePaddle

大模型推理框架概述

A guide to LLM inference and performance

PyTorch 显存释放: 推理时禁用了梯度计算,推理后显式地释放未使用的显存资源,避免显存累积

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class EmbeddingTorchBackend(EmbeddingBackend):
    def __init__(self, use_cpu: bool):
        super().__init__(use_cpu)
        self.return_tensors = "pt"
        self._model = AutoModel.from_pretrained(LOCAL_EMBED_PATH, return_dict=False)
        if use_cpu:
            self.device = torch.device('cpu')
        else:
            self.device = torch.device('cuda')
        self._model = self._model.to(self.device)
        print("embedding device:", self.device)

    def get_embedding(self, sentences, max_length):
        inputs_pt = self._tokenizer(sentences, padding=True, truncation=True, max_length=max_length,
                                    return_tensors=self.return_tensors)
        inputs_pt = {k: v.to(self.device) for k, v in inputs_pt.items()}
        start_time = time.time()
        with torch.no_grad():  # 禁用梯度计算
            outputs_pt = self._model(**inputs_pt)
        torch.cuda.empty_cache()  # 释放未使用显存
        debug_logger.info(f"torch embedding infer time: {time.time() - start_time}")
        embedding = outputs_pt[0][:, 0].cpu().detach().numpy()
        debug_logger.info(f'embedding shape: {embedding.shape}')
        norm_arr = np.linalg.norm(embedding, axis=1, keepdims=True)
        embeddings_normalized = embedding / norm_arr
        return embeddings_normalized.tolist()

LLM serving engines (performance benchmark):

vllm

TensorRT-LLM

text-generation-inference

lmdeploy

Toolchain

LangChain: a framework for developing applications powered by LLMs

LangChain for Java: Supercharge your Java application with the power of LLMs

LangChain 与 LangSmith:构建与微调支持LLM的智能应用双重攻略

LlamaIndex: data framework for your LLM application

llmware: The Ultimate Toolkit for Building LLM Apps

Ollama: Get up and running with large language models locally

lmstudio.js: LM Studio TypeScript SDK

浦源OpenXLab

Toolkit

Weights & Biases

tensorboard

dvc

Kubeflow

mlflow

API Calls

FastAPI

Sanic

OpenAI API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from openai import OpenAI
# import httpx

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
    # # to solve 307 redirection problem
    # http_client=httpx.Client(
    #     base_url="http://localhost:8000/v1",
    #     follow_redirects=True,
    # ),
)

models = client.models.list()
model = models.data[0].id

print(model)

completion = client.chat.completions.create(
  model= model,
  messages=[
    # {"role": "system", "content": ""},
    {"role": "user", "content": "hello"}
  ],
  temperature=0.7,
  max_tokens=4096,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None
)

print(completion.choices[0].message.content)

AzureOpenAI API

通用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-02-01",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

chat_completion = client.chat.completions.create(
    model="DEPLOYMENT_NAME",
    messages=[{
                "role": "user",
                "content": "x=3, x+y等于多少",
            },
            {
                "role": "assistant",
                "content": "请告诉我y是多少"
            },
            {
                "role": "user",
                "content": "y是100,那结果是多少"
        }]
)

print(chat_completion.choices[0].message.content)

仅适用 gpt-35-turbo-instruct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
    
deployment_name='REPLACE_WITH_YOUR_DEPLOYMENT_NAME' #This will correspond to the custom name you chose for your deployment when you deployed a model. Use a gpt-35-turbo-instruct deployment. 
    
# Send a completion call to generate an answer
print('Sending a test completion job')
start_phrase = 'Write a tagline for an ice cream shop. '
response = client.completions.create(model=deployment_name, prompt=start_phrase, max_tokens=10)
print(start_phrase+response.choices[0].text)

vllm API Calls Template

vllm在线推理踩坑记vLLM部署流式推理、openai接口调用、requests调用GLM-4-9B-Chat

WebUI

Streamlit

gradio

gradio share=True 开放外部链接

1
2
3
4
5
6
7
Could not create share link. Missing file: /data/jiangyy/miniconda3/envs/glm/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.2. 

Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps: 

1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /data/jiangyy/miniconda3/envs/glm/lib/python3.10/site-packages/gradio

下载后无果:Could not create share link. Please check your internet connection or our status page: https://status.gradio.app

可继续尝试:https://github.com/gradio-app/gradio/pull/6091

最终解决:chmod +x frpc_linux_amd64_v0.2

1
2
3
4
Running on local URL:  http://127.0.0.1:8001
Running on public URL: https://c3f57738b574623792.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)

Others

Hyperparams

分析transformer模型的参数量、计算量、中间激活、KV cache

batch size

Batch size对训练效果的影响

神经网络中Batch和Epoch之间的区别

RLHF

Implementing RLHF: Learning to Summarize with trlX

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

RLHF-PPO算法代码解析

offline rl, online rl, on-policy rl, off-policy rl

rl

AGI

WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION

Insights

Bert 对比 LLM:

Bert 优点:1双向编码的优势(考虑前后文的信息),2较小的数据和资源,缺点:1只能用于理解任务

LLM 优点:1能理解能生成,2多样化任务(翻译、摘要、对话)不用针对每个任务单独微调,缺点:数据和资源消耗大

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.