向量过滤 EmbeddingsFilter

使用大型语言模型(LLM)来处理每个找到的文件很费钱,而且速度不快。

而EmbeddingsFilter这个工具提供了一种更经济、更迅速的方法。

它会先将文件和你的问题转换成一种数学表达(嵌入),然后只挑出那些和你的问题在数学表达上相似的文件。这样就不需要每个文件都去麻烦那个大模型,省时省力。

设定一个门限值,通过输入问题和内容的相关度(0~1),进行过滤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
question = "What is Task Decomposition ?"

# 解析并载入url
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
docs = text_splitter.split_documents(docs)

# 索引数据 ,一次就够,后续可以注释
# # vectorstore.add_documents(docs)

# EmbeddingsFilter

embedding_filter_embeddings_filter = EmbeddingsFilter(embeddings=embeddings_model, similarity_threshold=0.7)

# 直接过滤文档
filtered_docs = embedding_filter_embeddings_filter.compress_documents(docs[:10],question)
pass

embedding_filter_compression_retriever = ContextualCompressionRetriever(
base_compressor=embedding_filter_embeddings_filter, base_retriever=retriever
)
embedding_filter_docs = embedding_filter_compression_retriever.get_relevant_documents(question)
pass

文档去重 EmbeddingsRedundantFilter

image-20240627102230593

相似度矩阵右上部分:相似度阈值:similarity_threshold**:** float = 0.95

image-20240627102244474

使用方法

1
2
3
4
5
6
7
8
docs = [
Document(page_content = "粥余知识库"),
Document(page_content = "粥余知识库"),
Document(page_content="同学你好!"),
Document(page_content="同学你好!"),
]
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings_model)
filered_docs = redundant_filter.transform_documents(docs)

image-20240627102251937


文档一条龙处理:DocumentCompressorPipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 文本去重
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings_model)

# 通过问题和文本的 语义相似度过滤
emb_filter = EmbeddingsFilter(embeddings=embeddings_model, similarity_threshold=0.6)
# 文档压缩
qianfan_compressor = LLMChainExtractor.from_llm(qianfan_chat)

# 建立管道: 过滤 + 压缩
pipeline_compressor = DocumentCompressorPipeline(
transformers=[emb_filter,redundant_filter,qianfan_compressor]
)

# 建立 语境压缩 Retriever
compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline_compressor, base_retriever=retriever
)
# 文档召回
compressed_docs = compression_retriever.get_relevant_documents(question)


Long-Context Reorder: 多个文档→重新排序

相关论文https://arxiv.org/abs/2307.03172

image-20240627102259819

调用方法

1
2
3
4
5
6
7
# 文档重排序
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
docs_pre = retriever.get_relevant_documents('Decomposition')
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs_pre)


排序之前

image-20240627102307259

排序之后

image-20240627102314989

长文本注意事项

https://blog.langchain.dev/multi-needle-in-a-haystack/

在长文本提示词中(甚至长达1MB),将事实放到不同的位置,召回效果不同

image-20240627102413330

正确的召回

image-20240627102817494

不完整的召回

image-20240627102902072

image-20240627102922081

事实 处于不同的文档深度时

image-20240627102932168

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
from uuid import uuid4
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor, LLMChainFilter, EmbeddingsFilter, \
DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.chat_models import AzureChatOpenAI
from langchain_community.chat_models.baidu_qianfan_endpoint import QianfanChatEndpoint
from langchain_community.document_loaders.web_base import WebBaseLoader
from langchain_community.document_transformers import EmbeddingsRedundantFilter, LongContextReorder
from langchain_community.llms.chatglm3 import ChatGLM3
from langchain_community.vectorstores.elasticsearch import ElasticsearchStore
import os
from langchain_community.embeddings import QianfanEmbeddingsEndpoint, HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain_core.messages import SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from zhipuai import ZhipuAI

# Langsmith 配置,不用可注掉
unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_PROJECT"] = f" 文档处理管道 Walkthrough - {unique_id}"
os.environ["LANGCHAIN_TRACING_V2"] = 'true'
os.environ["LANGCHAIN_API_KEY"] = os.getenv('MY_LANGCHAIN_API_KEY')

# 本地 BGE 模型
bge_en_v1p5_model_path = "D:\\LLM\\Bge_models\\bge-base-en-v1.5"

# 使用GPU
embeddings_model = HuggingFaceEmbeddings(
model_name=bge_en_v1p5_model_path,
model_kwargs={'device': 'cuda:0'},
encode_kwargs={'batch_size': 32, 'normalize_embeddings': True, }
)

# # 向量数据库
vectorstore = ElasticsearchStore(
es_url=os.environ['ELASTIC_HOST_HTTP'],
index_name="index_sd_1024_vectors",
embedding=embeddings_model,
es_user="elastic",
vector_query_field='question_vectors',
es_password=os.environ['ELASTIC_ACCESS_PASSWORD']
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Azure Openai
os.environ["AZURE_OPENAI_API_KEY"] = os.getenv('MY_AZURE_OPENAI_API_KEY')
os.environ["AZURE_OPENAI_ENDPOINT"] = os.getenv('MY_AZURE_OPENAI_ENDPOINT')
DEPLOYMENT_NAME_GPT3P5 = os.getenv('MY_DEPLOYMENT_NAME_GPT3P5')
azure_chat = AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_deployment=DEPLOYMENT_NAME_GPT3P5,
temperature=0
)

os.environ["QIANFAN_ACCESS_KEY"] = os.getenv('MY_QIANFAN_ACCESS_KEY')
os.environ["QIANFAN_SECRET_KEY"] = os.getenv('MY_QIANFAN_SECRET_KEY')

# 千帆 chatModel
qianfan_chat = QianfanChatEndpoint(
model="ERNIE-Bot-4"
)

messages = [
SystemMessage(content="You are an intelligent AI assistant, named ChatGLM3."),
]
# 本地GLM3-6B模型
LOCAL_GLM3_6B_ENDPOINT = "http://127.0.0.1:8000/v1/chat/completions"
local_glm3_chat = ChatGLM3(
endpoint_url=LOCAL_GLM3_6B_ENDPOINT,
max_tokens=(1024 * 32),
prefix_messages=messages,
top_p=0.9,
temperature=0,
stream=True,
)

if __name__ == '__main__':

question = "What is Task Decomposition ?"

# 解析并载入url
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
docs = text_splitter.split_documents(docs)

# 索引数据 ,一次就够,后续可以注释
# # vectorstore.add_documents(docs)

# EmbeddingsFilter

embedding_filter_embeddings_filter = EmbeddingsFilter(embeddings=embeddings_model, similarity_threshold=0.7)

# 直接过滤文档
filtered_docs = embedding_filter_embeddings_filter.compress_documents(docs[:10],question)
pass

embedding_filter_compression_retriever = ContextualCompressionRetriever(
base_compressor=embedding_filter_embeddings_filter, base_retriever=retriever
)
# embedding_filter_docs = embedding_filter_compression_retriever.get_relevant_documents(question)
# pass

docs = [
Document(page_content = "粥余知识库"),
Document(page_content = "粥余知识库"),
Document(page_content="同学你好!"),
Document(page_content="同学你好!"),
]
# 文本去重
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings_model)
# filered_docs = redundant_filter.transform_documents(docs)
# pass

# 通过问题和文本的 语义相似度过滤
relevant_filter = EmbeddingsFilter(embeddings=embeddings_model, similarity_threshold=0.6)

# 文档压缩
qianfan_compressor = LLMChainExtractor.from_llm(qianfan_chat)

# 文档快速处理
pipeline_compressor = DocumentCompressorPipeline(
transformers=[redundant_filter, relevant_filter,qianfan_compressor]
)

compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline_compressor, base_retriever=retriever
)
# compressed_docs = compression_retriever.get_relevant_documents(question)
pass

# 重排序
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
docs_pre = retriever.get_relevant_documents('Decomposition')
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs_pre)
pass