Git Product home page Git Product logo

cocainecong / tangseng Goto Github PK

View Code? Open in Web Editor NEW
95.0 1.0 27.0 6.21 MB

Tangseng search engine including full text search and vector search base on golang. 基于go语言的搜索引擎,信息检索系统

Home Page: https://cocainecong.github.io/tangseng/

License: Apache License 2.0

Go 64.19% Makefile 0.33% Python 35.17% Shell 0.30%
boltdb etcd gin inverted-index losertree lsm-tree search-engine segment dockcer-compose docker

tangseng's Introduction

Tangseng 基于Go语言的搜索引擎

项目详细内容地址点击这里

项目大体框架&功能

  1. gin作为http框架,grpc作为rpc框架,etcd作为服务发现。
  2. 总体服务分成用户模块收藏夹模块索引平台搜索引擎(文字模块)搜索引擎(图片模块)
  3. 分布式爬虫爬取数据,并发送到kafka集群中,再落库消费。 (虽然爬虫还没写,但不妨碍我画饼...)
  4. 搜索引擎模块的文本搜索单独设立使用boltdb存储index,mapreduce加速索引构建并使用roaring bitmap存储索引。
  5. 使用trie tree实现词条联想(后面打算加上算法模型辅助词条联想)。
  6. 图片搜索使用ResNet50来进行向量化查询 + Milvus or Faiss 向量数据库的查询 (开始做了... DeepLearning也太难了...)。
  7. 支持多路召回,go中进行倒排索引召回,python进行向量召回。通过grpc调用连接,进行融合。
  8. 支持TF-IDF,BM25等等算法排序。

项目大体框架

🧑🏻‍💻 前端地址

all in react, but still coding

react-tangseng

未来规划

架构相关

  • 引入降级熔断
  • 引入jaeger进行全链路追踪(go追踪到python)
  • 引入skywalking or prometheus进行监控
  • 抽离dao的init,用key来获取相关数据库实例
  • 冷热数据分离(参考es的方案,关键在于判断冷热的标准,或许可以写在中间件里面?)
  • 目前来说mysql已经足够存储正排索引,但后续可能直接一步到位到OLAP,starrocks单表亿级数据也能毫秒查询,mysql到这个级别早就分库分表了..

功能相关

  • 构建索引的时候太慢了.后面加上并发,建立索引的地方加上并发
  • 索引压缩,inverted index,也就是倒排索引表,后续改成存offset,用mmap
  • 相关性的计算要考虑一下,TFIDF,bm25
  • 使用前缀树存储联想信息
  • 哈夫曼编码压缩前缀树
  • 建索引的时候,传文件地址改成传文件流
  • python 引入 bert 模型进行分词的推荐词并提供 grpc 接口
  • inverted 和 trie tree 的存储支持一致性hash分片存储
  • 词向量
  • pagerank
  • 分离 trie tree 的 build 和 recall 过程
  • 分词加入ik分词器
  • 构建索引平台,计算存储分离,构建索引与召回分开
  • 并且差运算 (位运算)
  • 分页
  • 排序
  • 纠正输入的query,比如“陆加嘴”-->“陆家嘴”
  • 输入进行词条可以进行联想,比如 “东方明” 提示--> “东方明珠”
  • 目前是基于块的索引方法,后续看看能不能改成分布式mapreduce来构建索引 (6.824 lab1)
  • 在上一条的基础上再加上动态索引(还不知道上一条能不能实现...)
  • 改造倒排索引,使用 roaring bitmap 存储docid (好难)
  • 实现TF类
  • 搜索完一个接着搜索,没有清除缓存导致结果是和上一个产生并集
  • 排序器优化

文本搜索

快速开始

环境启动!

make env-up

小小数据集就在 source_data/movies_data.csv

Python 启动!

  1. 确保电脑已经安装了python,确保python version>=3.9,我的版本是3.10.2

    python --version
  2. 安装venv环境

    python -m venv venv
  3. 激活 venv python 环境

    macos:

    source venv/bin/activate

    windows:

    等我清完C盘再兼容一下...还没在win上跑过...

  4. 安装第三方依赖

    pip install -r requirements.txt

Golang 启动!

golang version >= go 1.16 即可。我的go版本是 1.18.6

  1. 下载第三方依赖包

    go mod tidy
  2. 目录下执行

    make run-xxx(user,favortie ...)
    # e.g:
    # make run-user
    # make run-favorite
    # 具体看makefile文件

开源贡献

在提交pr之前,请查看 CONTRIBUTING_CN.md

tangseng's People

Contributors

3927o avatar cocainecong avatar dependabot[bot] avatar lyt122 avatar mutezebra avatar my0sotis avatar starryskyli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

tangseng's Issues

Feature Request: Add Jaeger to provide project with the capability of distributed tracing

Background

This is a project that spans across go and python. We hope to provide certain observability for each request in order to monitor the project.

Introduce of Jaeger

Jaeger is an open-source, end-to-end distributed tracing system that helps developers monitor and troubleshoot complex, microservices-based architectures. It provides insights into the flow of requests within and between services, allowing developers to visualize the path of a request as it traverses through various components of a distributed system.

In the context of Go (Golang), Jaeger has a client library called "Jaeger-Client-Go" that allows developers to instrument their Go applications for distributed tracing. The Jaeger client library integrates with the OpenTracing API, which is a set of vendor-neutral APIs for distributed tracing.

Problem Description

We hope to be able to monitor the request chain of this project, providing us with a more intuitive, concise, and clear chain analysis during production or debugging.

In the new version of Jaeger, Jaeger provides support for OpenTelemetry. More Detail Please Visit here

So, you should carefully choose the local client SDK to support distributed tracing for both Go projects and Python projects (this should be done together).

How to solve this problem

The code strategy is similar to another issue(#53). You should first complete the implementation of distributed tracing on the Go project side, submit it and pass the code review, and then start developing the Python version. This will not take up much of your time.

运行项目的时候出现了SSL连接失败的报错

如题,报错信息如下:

Codespace/tangseng$ python main.py
Traceback (most recent call last):
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f6d3b474cd0>: Failed to establish a new connection: [Errno 101] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rong/.local/lib/python3.10/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/home/rong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/uer/sbert-base-chinese-nli (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6d3b474cd0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/Codespace/tangseng/main.py", line 14, in <module>
    from app.search_vector.service.search_vector import serve
  File "/mnt/d/Codespace/tangseng/app/search_vector/service/search_vector.py", line 9, in <module>
    from ..config.config import DEFAULT_MILVUS_TABLE_NAME, VECTOR_ADDR
  File "/mnt/d/Codespace/tangseng/app/search_vector/config/config.py", line 78, in <module>
    TRANSFORMER_MODEL = SentenceTransformer(TRANSFORMER_MODEL_NAME)
  File "/home/rong/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 87, in __init__
    snapshot_download(model_name_or_path,
  File "/home/rong/.local/lib/python3.10/site-packages/sentence_transformers/util.py", line 442, in snapshot_download
    model_info = _api.model_info(repo_id=repo_id, revision=revision, token=token)
  File "/home/rong/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/rong/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1677, in model_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
  File "/home/rong/.local/lib/python3.10/site-packages/requests/sessions.py", line 542, in get
    return self.request('GET', url, **kwargs)
  File "/home/rong/.local/lib/python3.10/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/rong/.local/lib/python3.10/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/home/rong/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 63, in send
    return super().send(request, *args, **kwargs)
  File "/home/rong/.local/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/uer/sbert-base-chinese-nli (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6d3b474cd0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: ab538c3f-6bbc-4c72-901e-38b7ecef0529)')

想请问这个要怎么解决哇?

Proposal: Provide a Docker container image for the project.

Refer to DeepMD-kit

I believe this project can provide Docker images to simplify usage, while also offering more comprehensive introductory documentation.

At the same time, we can use mirrors to build Docker container instances and configure the corresponding Python settings inside the containers (as we all know, Python configuration on a physical machine is complex).
After mapping is done, we can develop within the containers without affecting the host machine's Python.

If possible, I will try to submit a pull request in the next few days to implement this feature.

But first, I need to know if this proposal is feasible.


事实上,我们可以使用 Docker 来规避复杂的环境配置。目前的 Docker 方案侧重于 dev,生产部署的方案待前者合入后再做考虑

为开发者提供的容器

  1. 将整个项目源代码打包进一个 Docker 镜像中,并推送至云服务商的容器镜像服务中(或 docker hub)。在镜像中我们配置好了这个项目所需的一切环境,除了诸如 redis、etcd 等中间件
  2. 在独立物理机上,我们将镜像拉下来,并做 volume 映射,将源代码的目录映射到容器外部
  3. 我们在这个映射目录中对项目做修改
  4. 修改后我们在容器内重新运行makefile 的编译指令,在容器内启动。

这样做我们可能需要做这些更改:

  1. 修改 Makefile,后续的 Makefile 中部分方法(如 make xxx)限定为在容器内工作。我们可以通过设置一个特殊的环境变量来做到这个效果
  2. 编写 Dockerfile,这个 Dockerfile 会很简单,只包含了COPY/ADD

需要注意的是,这个项目使用了 docker-compose 来提供中间件的快速启动,我们需要在外部环境中运行这个 compose 而不是在容器内

也就是说,这个容器只提供了编译部署所需要的环境,其余的我们均需要在外部做。此时这个容器的作用更类似于一个虚拟机环境。

这样做的好处除了简化 dev 难度,同时我们规避了在 Windows 下兼容性的问题(例如 Windows 下需要做一些特殊调整才可以运行 shell)

为部署提供的容器

我们可以通过一个新的 Dockerfile,这个 Dockerfile 负责实现常用的编译二进制文件的命令,对项目进行封装,将所有编译好的binary 和 python 文件放进同一个镜像中

我们编写一个 start-container.sh 脚本,这个 shell 脚本内明确声明用户不可调用

在利用这个镜像启动容器时,我们通过指定参数,callback 给这个 start-container.sh 脚本,以便让脚本知道应该运行哪个服务。

Feature Request: Add Prometheus as project's monitor

Background

This is a project that spans across go and python. We hope to provide certain observability for each request in order to monitor the project.

Introduce of Prometheus

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability of applications. It was originally developed at SoundCloud and later open-sourced. Prometheus is part of the Cloud Native Computing Foundation (CNCF) and has gained widespread adoption in the container orchestration ecosystem, particularly with Kubernetes.

In the context of Go (Golang), Prometheus is often used with a Go client library that allows developers to instrument their Go applications for monitoring and exposing metrics. The official Prometheus Go client library provides a set of functions and utilities to help developers expose custom metrics from their Go applications and make them available for scraping by a Prometheus server.

Problem Description

We hope to: integrate Prometheus for this project.

In fact, Prometheus also has a Python client SDK that allows for observability in Python projects.

How to solve this problem

  1. First, provide support for Prometheus in the Go language part of this project.
  2. After completing the previous task, submit a PR for code review.
  3. Once the code review is approved, if there is remaining time, provide support for Prometheus in the Python project.

🎈V0.1.3 Release Plan

  • use a more engineering-oriented error code wrapper.
  • update tangseng pages docs
  • add Prometheus as project's monitor
  • add Jaeger to provide project with the capability of distributed tracing

If u want to contribute this project, don't be shy and share your ideas.😘

🎈V0.1.1 Release Plan

Now the branch of main is deving. Don't run the project until we coding finished. 🤣

  • building index platform to separate index construction and query recall.
  • support the group of cluster kafka consumer.
  • support mapreduce to build index.
  • optimize ranking calculation, such as tfidf, bm25 and so on.

If u wanna contribute this project, don't be shy and share u ideas.😘

🎈V0.1.2 Release Plan

  • add docs by github pages.
  • refactor the older mapreduce.
  • support vector index by milvus.

we will support search image by image in next version, not this version 🤣 cause i have been so busy lately 🥲
If u want to contribute this project, don't be shy and share your ideas.😘

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.