使用 Node.js + OpenAI + MongoDB 实现文本向量知识库搜索

向量搜索（Vector Search）是构建智能问答、AI 搜索系统的基础能力。本文将介绍如何使用 Node.js、OpenAI Embedding API 和 MongoDB，实现一个轻量级的文本向量知识库搜索系统，支持语义检索和相似内容召回。

目标

实现一个可执行的完整流程：

原始文档 
→ 文本分块
→ OpenAI 向量生成
→ 存入 MongoDB 
→ 用户查询 
→ 相似内容检索

📦 环境准备

Node.js 18+
MongoDB 本地或 Atlas 云服务
OpenAI API Key

安装依赖

npm install axios mongodb dotenv

.env 配置：

OPENAI_API_KEY=你的OpenAI密钥
MONGO_URI=mongodb://localhost:27017

文本切分与向量存储

我们将文档按固定长度切分为片段（chunk），再通过 OpenAI API 转为向量，并存入 MongoDB。

// index.js
require('dotenv').config();
const axios = require('axios');
const { MongoClient } = require('mongodb');

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const MONGO_URI = process.env.MONGO_URI;

const splitText = (text, size = 200) =>
  Array.from({ length: Math.ceil(text.length / size) },
    (_, i) => text.slice(i * size, (i + 1) * size));

async function getEmbedding(text) {
  const res = await axios.post(
    'https://api.openai.com/v1/embeddings',
    { input: text, model: 'text-embedding-ada-002' },
    { headers: { Authorization: `Bearer ${OPENAI_API_KEY}` } }
  );
  return res.data.data[0].embedding;
}

async function storeChunks(chunks) {
  const client = new MongoClient(MONGO_URI);
  await client.connect();
  const col = client.db('vector_store').collection('docs');

  for (const chunk of chunks) {
    const embedding = await getEmbedding(chunk);
    await col.insertOne({ chunk, embedding });
  }

  await client.close();
}

(async () => {
  const text = '这是一个用于测试的知识库文档，将被切分并转为向量存储。';
  const chunks = splitText(text);
  await storeChunks(chunks);
})();

本地语义检索（余弦相似度）

用户输入问题后，我们获取其向量并与数据库中所有向量计算相似度，返回最相关的内容。

// search.js
const similarity = (a, b) =>
  a.reduce((sum, val, i) => sum + val * b[i], 0) /
  (Math.hypot(...a) * Math.hypot(...b));

async function search(query, topK = 3) {
  const client = new MongoClient(MONGO_URI);
  await client.connect();
  const col = client.db('vector_store').collection('docs');

  const docs = await col.find({}).toArray();
  const queryVec = await getEmbedding(query);

  const results = docs
    .map(doc => ({ ...doc, score: similarity(queryVec, doc.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);

  for (const r of results) {
    console.log(`\n[Score: ${r.score.toFixed(4)}]`);
    console.log(r.chunk);
  }

  await client.close();
}

search('向量存储是什么？');

使用 MongoDB Atlas 向量搜索（可选）

如果使用 MongoDB Atlas，可开启原生向量索引，提升搜索性能。

向量索引配置（Atlas → Search Index → Create Index）

{
  "fields": {
    "embedding": {
      "type": "vector",
      "dimensions": 1536,
      "similarity": "cosine"
    }
  }
}

修改检索逻辑：

const pipeline = [
  {
    $vectorSearch: {
      index: "vector_index",
      path: "embedding",
      queryVector: queryVec,
      numCandidates: 100,
      limit: topK
    }
  },
  {
    $project: {
      chunk: 1,
      score: { $meta: "vectorSearchScore" }
    }
  }
];

const results = await col.aggregate(pipeline).toArray();

本项目实现了一个基础的文本向量知识库，具备以下能力：

支持文档分块和语义向量生成
支持本地余弦相似度检索
可扩展到 MongoDB Atlas 向量搜索
后续可接入 LLM（如 ChatGPT）做回答生成，实现 RAG 问答系统

🧠 延伸方向

✅ 支持 PDF、网页、Markdown 的导入
✅ 接入 LLM 生成问答（构建 RAG 系统）
✅ 引入前端页面，实现 Chat UI
✅ 构建多用户、多知识源的权限知识库

目标