Git Product home page Git Product logo

bertopic-tutorial's Introduction

模板代码的使用

  1. 模板代码是main.ipynb

  2. 安装依赖 & 运行代码

    • 整个教程基于Python 3.10.x版本
    • 需要先安装Anaconda,下载Anaconda
    • 先安装HDBSCAN:conda install -c conda-forge hdbscan
    • 安装其他依赖:pip install torch==2.0.1 transformers==4.29.1 tqdm==4.65.0 numpy==1.23.2 jieba==0.42.1 bertopic==0.15.0
    • 如果运行过程中提示缺少其他依赖,自行安装即可
    • 然后,直接运行main.ipynb,就能看到运行结果了

使用自己的数据

  1. 准备数据:首先,您需要准备一份文本.txt和一份时间.txt,放入data/目录

    • 文本.txt是切词前的语料,一个文档对应一行
    • 时间.txt对应每条文本的年份
    • 文本.txt时间.txt行数相同,比如都是1000行,代表1000行文本及其对应时间
    • 放入data/目录,您可以参考data/目录下的文件示例
  2. 切词:来到分词目录,运行cut_word.py,会生成data/切词.txt

    • 用户字典在分词/userdict.txt中设置
    • 停用词在分词/stopwords.txt中设置
  3. 生成词嵌入:来到embedding目录,运行其中一个ipynb文件,会生成emb.npy

    • 比如运行 embedding_bert.ipynb,会调用bert-base-chinese模型生成词向量
    • 比如运行 embedding_sentence_transformer.ipynb,会调用Sentencetransformers模型生成词向量
    • 把生成的词向量,复制到data目录,修改为embedding_bbc.npy等文件名,具体参考data/目录中的文件命名
    • 如果要使用autodl等线上GPU平台,则可以将data/文本.txt线上代码平台/目录中的embedding_xxx.ipynb上传到线上平台,运行,生成Embedding文件并下载到本地
  4. 运行:运行main.ipynb,生成聚类结果

项目目录

  1. data/:项目所需的各类数据
  2. embedding/:其中包含的是生成词向量的相关代码,用于在本地运行
  3. 线上平台代码/:同样的,其中包含的是生成词向量的相关代码,但用于在autodl等线上GPU平台运行
  4. 笔记/:视频课程中用到的笔记
  5. test/:视频课程中编写的各类代码案例
  6. main.ipynb:使用BERTopic的模板文件,您的项目可基于该文件进行改写 ⭐
  7. README.md:项目说明

bertopic-tutorial's People

Contributors

lynn1885 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.