LLMs Technology

Posted Apr 24, 2024 Updated Jan 1, 2025

By Yuyao Jiang 9 min read

全链路：大模型设计、训练、微调、压缩、部署、托管、推理优化点：推理加速（算子融合、KV缓存、多卡并行、量化压缩）、断点续传

API

PyTorch：

云原生的弹性 AI 训练系列之二：PyTorch 1.9.0 弹性分布式训练的设计与实现

Model Download

Hugging Face：

1.git lfs 下载到当前路径

git lfs install
git clone https://huggingface.co/Qwen/Qwen2-0.5B

2.调用本地没有的模型会自动联网下载，默认下载到 ~/.cache/huggingface/，cache_dir 指定下载路径

  
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B", cache_dir = ".")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B", cache_dir = ".")

3.huggingface_hub 下载，默认下载到 ~/.cache/huggingface/，cache_dir 或 local_dir 指定下载路径，两者区别

下载单个文件：

  
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="Qwen/Qwen2-0.5B", filename="model.safetensors", local_dir = ".")

下载整个仓库：

  
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Qwen/Qwen2-0.5B", local_dir = ".")

4.huggingface-cli 命令行下载，Guides

huggingface-cli download gpt2 config.json model.safetensors --local-dir .

(可选) huggingface-cli 登录

  
pip install -U "huggingface_hub[cli]"
huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential

Modelscope：

1.git lfs 下载到当前路径

git lfs install
git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.git

2.命令行下载多个文件，默认下载到 ~/.cache/modelscope，cache_dir 指定路径

  
modelscope download --model 'AI-ModelScope/gpt2' config.json model.safetensors --cache_dir '.'

3.SDK下载

  
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen-7B-Chat', local_dir = ".")

Embedding

[OpenAI - tiktoken

fast BPE tokenizer](https://blog.csdn.net/lovechris00/article/details/129889317)

Vector Database：chroma，milvus，faiss

Compiler

tvm

Fine Tuning

TOWARDS A UNIFIED VIEW OF PARAMETER-EFFICIENT TRANSFER LEARNING

人工智能大语言模型微调技术：SFT 监督微调、LoRA 微调方法、P-tuning v2 微调方法、Freeze 监督微调方法

人工智能大语言模型微调技术：SFT 、LoRA 、Freeze 监督微调方法

大模型微调技术LoRA与QLoRA

QLoRA：4-bit级别的量化+LoRA方法，用3090在DB-GPT上打造基于33B LLM的个人知识库

LoRA和QLoRA微调语言大模型：数百次实验后的见解

LoRA 微调语言大模型的实用技巧

微调语言大模型选LoRA还是全参数？基于LLaMA 2深度分析

通义千问Qwen-7B效果如何？Firefly微调实践，效果出色

Quantization

GPTQ

AWQ

SqueezeLLM

FP8 KV Cache

用 bitsandbytes、4 比特量化和 QLoRA 打造亲民的 LLM

大规模 Transformer 模型 8 比特矩阵乘简介 - 基于 Hugging Face Transformers、Accelerate 以及 bitsandbytes

Training / Inference

DeepSpeed之ZeRO系列：将显存优化进行到底

DeepSpeed: Extreme-scale model training for everyone

Getting Started with DeepSpeed for Inferencing Transformer based Models

torchrun

Accelerate

onnxruntime

如何将Pytorch模型转ONNX格式并使用OnnxRuntime推理

A guide to LLM inference and performance

PyTorch 显存释放： 推理时禁用了梯度计算，推理后显式地释放未使用的显存资源，避免显存累积

  
class EmbeddingTorchBackend(EmbeddingBackend):
    def __init__(self, use_cpu: bool):
        super().__init__(use_cpu)
        self.return_tensors = "pt"
        self._model = AutoModel.from_pretrained(LOCAL_EMBED_PATH, return_dict=False)
        if use_cpu:
            self.device = torch.device('cpu')
        else:
            self.device = torch.device('cuda')
        self._model = self._model.to(self.device)
        print("embedding device:", self.device)

    def get_embedding(self, sentences, max_length):
        inputs_pt = self._tokenizer(sentences, padding=True, truncation=True, max_length=max_length,
                                    return_tensors=self.return_tensors)
        inputs_pt = {k: v.to(self.device) for k, v in inputs_pt.items()}
        start_time = time.time()
        with torch.no_grad():  # 禁用梯度计算
            outputs_pt = self._model(**inputs_pt)
        torch.cuda.empty_cache()  # 释放未使用显存
        debug_logger.info(f"torch embedding infer time: {time.time() - start_time}")
        embedding = outputs_pt[0][:, 0].cpu().detach().numpy()
        debug_logger.info(f'embedding shape: {embedding.shape}')
        norm_arr = np.linalg.norm(embedding, axis=1, keepdims=True)
        embeddings_normalized = embedding / norm_arr
        return embeddings_normalized.tolist()

LLM serving engines (performance benchmark):

vllm

TensorRT-LLM

text-generation-inference

lmdeploy

Toolchain

LangChain: a framework for developing applications powered by LLMs

LangChain for Java: Supercharge your Java application with the power of LLMs

LangChain 与 LangSmith：构建与微调支持LLM的智能应用双重攻略

LlamaIndex: data framework for your LLM application

llmware: The Ultimate Toolkit for Building LLM Apps

Ollama: Get up and running with large language models locally

lmstudio.js: LM Studio TypeScript SDK

浦源OpenXLab

Toolkit

API Calls

FastAPI

Sanic

OpenAI API

  
from openai import OpenAI
# import httpx

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
    # # to solve 307 redirection problem
    # http_client=httpx.Client(
    #     base_url="http://localhost:8000/v1",
    #     follow_redirects=True,
    # ),
)

models = client.models.list()
model = models.data[0].id

print(model)

completion = client.chat.completions.create(
  model= model,
  messages=[
    # {"role": "system", "content": ""},
    {"role": "user", "content": "hello"}
  ],
  temperature=0.7,
  max_tokens=4096,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None
)

print(completion.choices[0].message.content)

AzureOpenAI API

通用：

  
import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-02-01",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

chat_completion = client.chat.completions.create(
    model="DEPLOYMENT_NAME",
    messages=[{
                "role": "user",
                "content": "x=3, x+y等于多少",
            },
            {
                "role": "assistant",
                "content": "请告诉我y是多少"
            },
            {
                "role": "user",
                "content": "y是100，那结果是多少"
        }]
)

print(chat_completion.choices[0].message.content)

仅适用 gpt-35-turbo-instruct：

  
import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
    
deployment_name='REPLACE_WITH_YOUR_DEPLOYMENT_NAME' #This will correspond to the custom name you chose for your deployment when you deployed a model. Use a gpt-35-turbo-instruct deployment. 
    
# Send a completion call to generate an answer
print('Sending a test completion job')
start_phrase = 'Write a tagline for an ice cream shop. '
response = client.completions.create(model=deployment_name, prompt=start_phrase, max_tokens=10)
print(start_phrase+response.choices[0].text)

vllm API Calls Template

vllm在线推理踩坑记，vLLM部署流式推理、openai接口调用、requests调用，GLM-4-9B-Chat

WebUI

Streamlit

gradio

gradio share=True 开放外部链接

Could not create share link. Missing file: /data/jiangyy/miniconda3/envs/glm/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.2. 

Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps: 

1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /data/jiangyy/miniconda3/envs/glm/lib/python3.10/site-packages/gradio

下载后无果：Could not create share link. Please check your internet connection or our status page: https://status.gradio.app

可继续尝试：https://github.com/gradio-app/gradio/pull/6091

最终解决：chmod +x frpc_linux_amd64_v0.2

Running on local URL:  http://127.0.0.1:8001
Running on public URL: https://c3f57738b574623792.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)