Git Product home page Git Product logo

lucene-7.x-9.x's People

Contributors

luxugang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lucene-7.x-9.x's Issues

BooleanScorer中的head和tail队列有什么用?

求助大佬,在看9.1.0版本的源码时,BooleanScorer中的head和tail队列实在看不出有什么特殊作用,我感觉一个队列就够了,我看你文章中描述的是:
image

但是我看源码发现head中优先级最低的起始就是全局next最低的,那为什么要区分head和tail队列呢?

关于 WAND 算法的一些问题

目前我们es是5.4版本,希望将Lucene高版本中的 wand 算法集成到当前的Lucene中,用来优化term查询,问下大佬这样可行吗

关于ES索引兼容两种不同的压缩算法解压的问题的一些疑问,还请大神指定一二

您好,公司有一个需求,需要将线上ES集群7.10.0(lucene8.7.0)的索引添加zstd压缩算法,同时希望旧的使用lz4的索引能够在用户无感知的情况下切换到zstd压缩算法。

下面是我的实现ZSTDCompressor,压缩时在write dictLength 之前写入了一个魔数。这边定义了一个proxy ZstdProxyDecompressor来兼容两种不同的压缩算法的解压。数据解压的时候根据魔数来判断选择哪种压缩算法。请问这种方案的可行性吗?这边的想法是升级集群的镜像后,将元索引的压缩算法改成zstd实现,然后旧的数据兼容两种压缩算法解压,但是新的数据或者更新的数据使用zstd算法写入。

目前自己测试了兼容两种不同的压缩算法似乎没有什么问题,现在在尝试修改ES索引中的codec来修改索引压缩算法。

但是由于我之前没有深入了解过lucene和es,不知道会不会有什么风险点是我没有考虑到的。还请大神指点一二,谢谢!

这个是支持zstd压缩算法的实现类

public class ZstdWithPresetDictCompressionMode extends CompressionMode {
    private static final int NUM_SUB_BLOCKS = 10;
    private final int level;
    public static final int defaultLevel = 3;
    private static final int DICT_SIZE_FACTOR = 6;

    /**
     * default constructor
     */
    ZstdWithPresetDictCompressionMode() {
        this.level = defaultLevel;
    }

    /**
     * compression mode for a given compression level
     */
    ZstdWithPresetDictCompressionMode(int level) {
        this.level = level;
    }

    @Override
    public Compressor newCompressor() {
        return new ZSTDCompressor(level);
    }

    @Override
    public Decompressor newDecompressor() {
        return new ZSTDDecompressor();
    }

    /**
     * zstandard compressor
     */
    private static final class ZSTDCompressor extends Compressor {

        int compressionLevel;
        byte[] compressedBuffer;

        /**
         * compressor with a given compresion level
         */
        public ZSTDCompressor(int compressionLevel) {
            this.compressionLevel = compressionLevel;
            compressedBuffer = BytesRef.EMPTY_BYTES;
        }

        @Override
        public void close() throws IOException {
        }

        /*resuable compress function*/
        private void doCompress(byte[] bytes, int off, int len, ZstdCompressCtx cctx, DataOutput out)
            throws IOException {
            if (len == 0) {
                out.writeVInt(0);
                return;
            }
            final int maxCompressedLength = (int) Zstd.compressBound(len);
            compressedBuffer = ArrayUtil.grow(compressedBuffer, maxCompressedLength);

            int compressedSize =
                cctx.compressByteArray(compressedBuffer, 0, compressedBuffer.length, bytes, off, len);

            out.writeVInt(compressedSize);
            out.writeBytes(compressedBuffer, compressedSize);
        }

        @Override
        public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException {
            final int dictLength = len / (NUM_SUB_BLOCKS * DICT_SIZE_FACTOR);
            final int blockLength = (len - dictLength + NUM_SUB_BLOCKS - 1) / NUM_SUB_BLOCKS;

            // ZSTD写入魔数-666666
            out.writeVInt(Comporessions.ZSTD_MAGIC_NUMBER);

            out.writeVInt(dictLength);
            out.writeVInt(blockLength);

            final int end = off + len;

            try (ZstdCompressCtx cctx = new ZstdCompressCtx()) {
                cctx.setLevel(this.compressionLevel);

                // dictionary compression first
                doCompress(bytes, off, dictLength, cctx, out);
                cctx.loadDict(new ZstdDictCompress(bytes, off, dictLength, this.compressionLevel));

                for (int start = off + dictLength; start < end; start += blockLength) {
                    int l = Math.min(blockLength, off + len - start);
                    doCompress(bytes, start, l, cctx, out);
                }
            }
        }
    }

    /**
     * zstandard decompressor
     */
    private static final class ZSTDDecompressor extends Decompressor {

        private byte[] compressed;

        /**
         * default decompressor
         */
        public ZSTDDecompressor() {
            compressed = BytesRef.EMPTY_BYTES;
        }

        /*resuable decompress function*/
        private void doDecompress(
            DataInput in, ZstdDecompressCtx dctx, BytesRef bytes, int decompressedLen)
            throws IOException {
            final int compressedLength = in.readVInt();
            if (compressedLength == 0) {
                return;
            }

            compressed = ArrayUtil.grow(compressed, compressedLength);
            in.readBytes(compressed, 0, compressedLength);

            bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + decompressedLen);
            int uncompressed =
                dctx.decompressByteArray(
                    bytes.bytes, bytes.length, decompressedLen, compressed, 0, compressedLength);

            if (decompressedLen != uncompressed) {
                throw new IllegalStateException(decompressedLen + " " + uncompressed);
            }
            bytes.length += uncompressed;
        }

        @Override
        public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes)
            throws IOException {
            assert offset + length <= originalLength;

            if (length == 0) {
                bytes.length = 0;
                return;
            }

            final int magicNumber = in.readVInt();
            // 这里如果非-666666魔数,说明逻辑有问题
            if (magicNumber != Comporessions.ZSTD_MAGIC_NUMBER) {
                throw new IllegalArgumentException("Unknown magicNumber:[" + magicNumber + "]");
            }

            final int dictLength = in.readVInt();
            final int blockLength = in.readVInt();
            bytes.bytes = ArrayUtil.grow(bytes.bytes, dictLength);
            bytes.offset = bytes.length = 0;

            try (ZstdDecompressCtx dctx = new ZstdDecompressCtx()) {

                // decompress dictionary first
                doDecompress(in, dctx, bytes, dictLength);

                dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength));

                int offsetInBlock = dictLength;
                int offsetInBytesRef = offset;

                // Skip unneeded blocks
                while (offsetInBlock + blockLength < offset) {
                    final int compressedLength = in.readVInt();
                    in.skipBytes(compressedLength);
                    offsetInBlock += blockLength;
                    offsetInBytesRef -= blockLength;
                }

                // Read blocks that intersect with the interval we need
                while (offsetInBlock < offset + length) {
                    bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + blockLength);
                    int l = Math.min(blockLength, originalLength - offsetInBlock);
                    doDecompress(in, dctx, bytes, l);
                    offsetInBlock += blockLength;
                }

                bytes.offset = offsetInBytesRef;
                bytes.length = length;
                assert bytes.isValid();
            }
        }

        @Override
        public Decompressor clone() {
            return new ZSTDDecompressor();
        }
    }
}

这里是代理类支持解压两种不同压缩算法

public class ZstdProxyDecompressor extends Decompressor {

    private final ZstdAdapterDecompressor zstdAdapterDecompressor;

    private final AdapterDecompressor adapterDecompressor;


    public ZstdProxyDecompressor(ZstdAdapterDecompressor zstdAdapterDecompressor, AdapterDecompressor adapterDecompressor) {
        this.zstdAdapterDecompressor = zstdAdapterDecompressor;
        this.adapterDecompressor = adapterDecompressor;
    }

    @Override
    public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes)
        throws IOException {

        assert offset + length <= originalLength;

        if (length == 0) {
            bytes.length = 0;
            return;
        }

        final int magicNumber = in.readVInt();
        // 这里如果非zstd压缩则使用原来的压缩算法进行解压
        if (magicNumber == Comporessions.ZSTD_MAGIC_NUMBER) {
            zstdAdapterDecompressor.decompress(in, originalLength, offset, length, bytes, magicNumber);
        } else {
            adapterDecompressor.decompress(in, originalLength, offset, length, bytes, magicNumber);
        }

    }

    @Override
    public Decompressor clone() {
        return new ZstdProxyDecompressor(this.zstdAdapterDecompressor.clone(), this.adapterDecompressor.clone());
    }

}

这个是lz4实现,和原生的实现一模一样,就是多传了一个参数firstVInt作为dictLenght使用

public final class LZ4AdapterDecompressor extends AdapterDecompressor {

    private int[] compressedLengths;
    private byte[] buffer;

    public LZ4AdapterDecompressor() {
        compressedLengths = new int[0];
        buffer = new byte[0];
    }

    private int readCompressedLengths(DataInput in, int originalLength, int dictLength, int blockLength) throws IOException {
        in.readVInt(); // compressed length of the dictionary, unused
        int totalLength = dictLength;
        int i = 0;
        while (totalLength < originalLength) {
            compressedLengths = ArrayUtil.grow(compressedLengths, i + 1);
            compressedLengths[i++] = in.readVInt();
            totalLength += blockLength;
        }
        return i;
    }

    @Override
    public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes, int firstVInt)
        throws IOException {
        assert offset + length <= originalLength;

        if (length == 0) {
            bytes.length = 0;
            return;
        }

        final int dictLength = firstVInt;
        final int blockLength = in.readVInt();

        final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength);

        buffer = ArrayUtil.grow(buffer, dictLength + blockLength);
        bytes.length = 0;
        // Read the dictionary
        if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
            throw new CorruptIndexException("Illegal dict length", in);
        }

        int offsetInBlock = dictLength;
        int offsetInBytesRef = offset;
        if (offset >= dictLength) {
            offsetInBytesRef -= dictLength;

            // Skip unneeded blocks
            int numBytesToSkip = 0;
            for (int i = 0; i < numBlocks && offsetInBlock + blockLength < offset; ++i) {
                int compressedBlockLength = compressedLengths[i];
                numBytesToSkip += compressedBlockLength;
                offsetInBlock += blockLength;
                offsetInBytesRef -= blockLength;
            }
            in.skipBytes(numBytesToSkip);
        } else {
            // The dictionary contains some bytes we need, copy its content to the BytesRef
            bytes.bytes = ArrayUtil.grow(bytes.bytes, dictLength);
            System.arraycopy(buffer, 0, bytes.bytes, 0, dictLength);
            bytes.length = dictLength;
        }

        // Read blocks that intersect with the interval we need
        while (offsetInBlock < offset + length) {
            final int bytesToDecompress = Math.min(blockLength, offset + length - offsetInBlock);
            LZ4.decompress(in, bytesToDecompress, buffer, dictLength);
            bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + bytesToDecompress);
            System.arraycopy(buffer, dictLength, bytes.bytes, bytes.length, bytesToDecompress);
            bytes.length += bytesToDecompress;
            offsetInBlock += blockLength;
        }

        bytes.offset = offsetInBytesRef;
        bytes.length = length;
        assert bytes.isValid();

    }

    @Override
    public AdapterDecompressor clone() {
        return new LZ4AdapterDecompressor();
    }

}

这个和上面的zstd的实现一模一样,就是多传了一个参数firstVInt作为dictLenght使用

public final class ZstdAdapterDecompressor extends AdapterDecompressor {

    byte[] compressed;

    /**
     * default decompressor
     */
    public ZstdAdapterDecompressor() {
        compressed = BytesRef.EMPTY_BYTES;
    }

    /*resuable decompress function*/
    private void doDecompress(
        DataInput in, ZstdDecompressCtx dctx, BytesRef bytes, int decompressedLen)
        throws IOException {
        final int compressedLength = in.readVInt();
        if (compressedLength == 0) {
            return;
        }

        compressed = ArrayUtil.grow(compressed, compressedLength);
        in.readBytes(compressed, 0, compressedLength);

        bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + decompressedLen);
        int uncompressed =
            dctx.decompressByteArray(
                bytes.bytes, bytes.length, decompressedLen, compressed, 0, compressedLength);

        if (decompressedLen != uncompressed) {
            throw new IllegalStateException(decompressedLen + " " + uncompressed);
        }
        bytes.length += uncompressed;
    }

    @Override
    public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes, int firstVInt)
        throws IOException {
        assert offset + length <= originalLength;

        if (length == 0) {
            bytes.length = 0;
            return;
        }

        final int magicNumber = firstVInt;
        // 这里如果非-666666魔数说明逻辑有问题
        if (magicNumber != Comporessions.ZSTD_MAGIC_NUMBER) {
            throw new IllegalArgumentException("Unknown magicNumber:[" + magicNumber + "]");
        }

        final int dictLength = in.readVInt();
        final int blockLength = in.readVInt();
        bytes.bytes = ArrayUtil.grow(bytes.bytes, dictLength);
        bytes.offset = bytes.length = 0;

        try (ZstdDecompressCtx dctx = new ZstdDecompressCtx()) {

            // decompress dictionary first
            doDecompress(in, dctx, bytes, dictLength);

            dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength));

            int offsetInBlock = dictLength;
            int offsetInBytesRef = offset;

            // Skip unneeded blocks
            while (offsetInBlock + blockLength < offset) {
                final int compressedLength = in.readVInt();
                in.skipBytes(compressedLength);
                offsetInBlock += blockLength;
                offsetInBytesRef -= blockLength;
            }

            // Read blocks that intersect with the interval we need
            while (offsetInBlock < offset + length) {
                bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + blockLength);
                int l = Math.min(blockLength, originalLength - offsetInBlock);
                doDecompress(in, dctx, bytes, l);
                offsetInBlock += blockLength;
            }

            bytes.offset = offsetInBytesRef;
            bytes.length = length;
            assert bytes.isValid();
        }
    }

    @Override
    public ZstdAdapterDecompressor clone() {
        return new ZstdAdapterDecompressor();
    }

}

lucene搜索过程中哪些是在内存中进行,哪在磁盘中进行

Term dict index以FST的结构存缓存在内存中,从Term dict index查到关键词对应的term dic的块位置之后,再去磁盘上找term,大大减少了磁盘的IO次数。
1、(内存)内存加载tip文件,根据FST匹配到后缀词块在tim文件中的位置;
2、(内存)根据查询到的后缀词块位置查询到后缀及倒排表的相关信息;
3、(内存)根据tim中查询到的倒排表信息从doc文件中定位出文档号及词频信息,完成搜索;
4、(磁盘)文件定位完成后Lucene将去.fdx文件目录索引及.fdt中根据正向索引查找出目标文件
这个过程没问题吧?

SkipList

lucene里面使用SkipList来加速查找速度,使用有序数组进行两分查找是否比SkipList快 是因为SkipList对于增删的效率比有序数组更快吗?

Fuzzy 拼写纠错

大佬能介绍下拼写纠错是怎么实现的吗 看了网上的资料还是一知半解

请问大佬,如何将Lucene query string直接转为Query对象

我试了一下QueryPharser对象需要传入一个field域,但是我不知道我的string里面有什么域,可能有一个,也有可能有几个field,还有可能是组合起来的类似于
+fieldA:F +(+(((fieldB:cd)^300.0 (fieldC:sc)^200.0 (fieldD:china)^100.0 +(((fieldA:[1663210081 TO 1663814881])^300.0 (fieldA:[1662605281 TO 1663814881])^200.0)~1))~1) +(((fieldA:[1663210081 TO 1663814881])^300.0 (fieldA:[1662605281 TO 1663814881])^200.0)~1))

如何转为一个Query对象放入lucene中去查呢?
请指教,谢谢大佬🙏

来自一个fans的提问

请问大神lsm tree在lucene那里用到了?我看有些博客说字符串的范围检索底层使用lsm tree存储的,请问您有没有讲到这块的底层原理博客

排序问题

请假一个问题,事先把document排序,然后再add到index中和indexSort有什么区别吗

关于flush commit merge的一些疑问

大佬,请问就是在merge的时候,IndexWriter 提供给 MergePolicy 的段集是 SegmentInfos 对象,这个里应该是包括了所有的段,但是像flush之后的并没有commit 的段,没有Segment_N 文件,还在文件系统缓存中,这部分段是怎么参加合并的呢。

Lucene 索引写入策略

请教一个问题,Lucene 索引写入 采用类似LSM,目前的合并策略和size-tiered很像,是不是也有空间放大的问题

MatchAllDocsQuery

大佬,可以介绍下MatchAllDocsQuery吗,我看它并没有重写query,请问是怎么查询所有的id呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.