Git Product home page Git Product logo

blocked-sort-based-indexing's Introduction

Blocked-Sort-Based-Indexing

要求

  1. 使用基于块的排序索引构建算法BSBI
  2. 在前几个作业基础上支持短语查询
  3. 实现词典和倒排索引表的压缩

BSBI

短语查询

短语查询可以使用带位置信息的索引表或者简单的双词索引。
这次作业我们就水一水用双词索引吧!

双词索引

每两个连续的词组成词对(作为短语)来索引。 比如文本片段“Friend, Romans, Countrymen”会产生两个词对:

  • friends romans
  • romans countrymen
    索引构建时将每个词对看成一个词项放入词典中。
    查询时,将查询拆分成基于双词的布尔查询式。
    例如:stanford university palo alto,处理方法:
    stanford university AND university palo AND palo alto

词典压缩

我们把词典看成一个单一的字符串,词项用指向这个单一字符串某一位置的指针表示。

倒排记录表压缩

倒排记录表的压缩使用对间隔编码。
computer: 283047 283154 283159 改写为 computer: 283047 107 5
这样可以降低词存储docID的开销
然而我们用Python不知道能不能真的降低。。。。


实现

倒排索引表的设计,倒排索引表每一项有三个部分:

  1. 词项(包括双词)指针
  2. 倒排记录:对应的文档ID
    文档docID采用对间隔编码的方式进行存储
    这样我们倒排索引表的格式就得到了:
    指向词项的指针: docID, 与第一个的间隔,与第二个的间隔

blocked-sort-based-indexing's People

Contributors

gitdxj avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.