Open-Source AI Cookbook documentation

使用 Hugging Face 和 Milvus 构建 RAG 系统

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

使用 Hugging Face 和 Milvus 构建 RAG 系统

作者: 张晨

Milvus 是一个广受欢迎的开源向量数据库,为人工智能应用提供高性能和可扩展的向量相似性搜索。在本教程中,我们将向您展示如何使用 Hugging Face 和 Milvus 构建 RAG(检索增强生成)流程。

RAG 系统将检索系统与 LLM 相结合。该系统首先使用向量数据库 Milvus 从语料库中检索相关文档,然后使用 Hugging Face 的 LLM 根据检索到的文档生成回答。

准备

依赖关系和环境

! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm

如果您使用 Google Colab,要启用刚刚安装的依赖项,您可能需要 重新启动运行时(单击屏幕顶部的“运行时”菜单,然后从下拉菜单中选择“重新启动会话”) 。

另外,我们建议您配置您的 Hugging Face User Access Token,并将其设置在您的环境变量中,因为我们将使用 Hugging Face Hub 的 LLM 。如果不设置 token 环境变量,您可能会受到较低的请求速率限制。

import os

os.environ["HF_TOKEN"] = "hf_..."

准备数据

我们使用 AI Act PDF 作为我们 RAG 中的私有知识。这是一个针对人工智能的监管框架,具有不同风险级别,对应或多或少的监管。

%%bash

if [ ! -f "The-AI-Act.pdf" ]; then
    wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
fi

我们使用 LangChain 的 PyPDFLoader 从 PDF 中提取文本,然后将文本拆分为更小的块。默认情况下,我们将块大小设置为 1000,重叠设置为 200,这意味着每个块将有近 1000 个字符,两个块之间的重叠将是 200 个字符。

>>> from langchain_community.document_loaders import PyPDFLoader

>>> loader = PyPDFLoader("The-AI-Act.pdf")
>>> docs = loader.load()
>>> print(len(docs))
108
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
text_lines = [chunk.page_content for chunk in chunks]

准备 embedding 模型

定义一个函数来生成文本 embedding。我们使用 BGE embedding模型 作为示例,但您也可以将其更改为任何其他 embedding 模型,例如在 MTEB排行榜 上选择。

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")


def emb_text(text):
    return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]

生成测试 embedding 并打印其大小和前几个元素。

>>> test_embedding = emb_text("This is a test")
>>> embedding_dim = len(test_embedding)
>>> print(embedding_dim)
>>> print(test_embedding[:10])
384
[-0.07660683244466782, 0.025316666811704636, 0.012505513615906239, 0.004595153499394655, 0.025780051946640015, 0.03816710412502289, 0.08050819486379623, 0.003035430097952485, 0.02439221926033497, 0.0048803347162902355]

将数据加载到 Milvus 中

创建 collection

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./hf_milvus_demo.db")

collection_name = "rag_collection"

对于 MilvusClient 的参数:

  • uri 设置为本地文件,例如 ./hf_milvus_demo.db ,是最方便的方法,因为它会自动使用 Milvus Lite 将所有数据存储在此文件中。
  • 如果您有大量数据,例如超过一百万个向量,您可以在 Docker 或 Kubernetes 上设置性能更高的 Milvus 服务器。在此设置中,请使用服务器 uri,例如 http://localhost:19530 作为您的 uri
  • 如果您想使用 Milvus 的全托管云服务 Zilliz Cloud ,请调整 uritoken,分别对应 Zilliz Cloud 中 Public Endpoint 和 Api key

检查 collection 是否已存在,如果存在则将其删除。

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

使用指定参数创建一个新 collection。

如果我们不指定任何字段信息,Milvus 会自动创建一个默认的 id 字段作为主键,并创建一个 vector 字段来存储向量数据。保留的 JSON 字段用于存储未定义 schema 的字段及其值。

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

插入数据

迭代文本行,创建 embedding,然后将数据插入到 Milvus 中。

这是一个新字段 text,它是 collection schema 中的未定义字段。它将自动添加到保留的 JSON 动态字段中,从更高的层次看来,它可被视为普通字段。

from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]

构建 RAG

检索查询数据

让我们具体指定一个有关该语料的问题。

question = "What is the legal basis for the proposal?"

在 collection 中搜索问题并检索语义前 3 个匹配项。

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

我们来看看查询的搜索结果

>>> import json

>>> retrieved_lines_with_distances = [(res["entity"]["text"], res["distance"]) for res in search_res[0]]
>>> print(json.dumps(retrieved_lines_with_distances, indent=4))
[
    [
        "EN 6  EN 2. LEGAL  BASIS,  SUBSIDIARITY  AND  PROPORTIONALITY  \n2.1. Legal  basis  \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and f unctioning of the internal market.  \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed a nd used in compliance with fundamental rights obligations. This will likely \nlead to two main problems: i) a fragmentation of the internal market on essential elements",
        0.7412998080253601
    ],
    [
        "applications and prevent market fragmentation.  \nTo achieve those objectives, this proposal presents a balanced and proportionate horizontal \nregulatory approach to AI that is limited to the minimum necessary requirements to address \nthe risks and problems linked to AI, withou t unduly constraining or hindering technological \ndevelopment or otherwise disproportionately increasing the cost of placing AI solutions on \nthe market.  The proposal sets a robust and flexible legal framework. On the one hand, it is \ncomprehensive and future -proof in its fundamental regulatory choices, including the \nprinciple -based requirements that AI systems should comply with. On the other hand, it puts \nin place a proportionate regulatory system centred on a well -defined risk -based regulatory \napproach that  does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such",
        0.696428656578064
    ],
    [
        "approach that  does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such \nconcern can reasonably be anticipated in the near future. At the same time, t he legal \nframework includes flexible mechanisms that enable it to be dynamically adapted as the \ntechnology evolves and new concerning situations emerge.  \nThe proposal sets harmonised rules for the development, placement on the market and use of \nAI systems i n the Union following a proportionate risk -based approach. It proposes a single \nfuture -proof definition of AI. Certain particularly harmful AI practices are prohibited as \ncontravening Union values, while specific restrictions and safeguards are proposed in  relation \nto certain uses of remote biometric identification systems for the purpose of law enforcement. \nThe proposal lays down a solid risk methodology to define \u201chigh -risk\u201d AI systems that pose",
        0.6891457438468933
    ]
]

使用 LLM 获取 RAG 回答

在合并到 LLM 提示词之前,我们首先将检索到的文档列表展平为纯字符串。

context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])

定义语言模型的提示词。该提示与从 Milvus 检索到的文档组合在一起。

PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

我们使用部署在 Hugging Face 推理服务上的 Mixtral-8x7B-Instruct-v0.1 根据提示词生成响应。

from huggingface_hub import InferenceClient

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(model=repo_id, timeout=120)

我们设置提示词格式并生成最后的回答。

prompt = PROMPT.format(context=context, question=question)
>>> answer = llm_client.text_generation(
...     prompt,
...     max_new_tokens=1000,
... ).strip()
>>> print(answer)
The legal basis for the proposal is Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market. The proposal aims to establish harmonized rules for the development, placing on the market, and use of AI systems in the Union following a proportionate risk-based approach.

恭喜!您已经使用 Hugging Face 和 Milvus 构建了 RAG 流程。

< > Update on GitHub