luxugang / lucene-7.x-9.x Goto Github PK

Lucene 7.x~9.x

XSLT 0.02% Perl 0.09% Python 0.13% Shell 0.01% Java 85.24% Lex 0.12% HTML 5.62% Groovy 0.03% CSS 8.73% ANTLR 0.01% Gnuplot 0.01% C++ 0.02% Batchfile 0.01%

lucene-7.x-9.x's People

Contributors

Stargazers

Watchers

Forkers

atarik91 yhf20071 wjp719 unix1986 bneliao ritterhou zuquan-song 448523760 dianjifzm allenyu01 399601829 happileming cbamls fanghuafan snfdf jeffvippan jifeer xiaoshi2013 zippoy huafeisuperman hxsam codefairy08 sjj107 sun897827804 tsregll chuck6 zk279444107 leihelpme sska123 cpp2golang hujiexuan wakaca louxiaoxin rgb-fk limitap jakeszhong bd-huangyangfeng dictcore zhoudaifa007 zzujustme zerer debuger6 waynecookie hit-zcc naruto2902git bstbst rimutuyuan1 shuchaome sqlchan forzmic ziemm easyice rghwer weiben01 xxwdwh fffro wangj3081 pelhans jasonbian imhu yangkp-ctrl jackiepon31 doobetter trent87 mutouboy earyj njuzengc wncbb persistence-liu promise2mm machunleilei chris-yupeng yaixihn tianqi-bucuo yankai1994 manersun wubudomain taaaang ybeforev huajiang123 bigfaye zleternity mytoncat 201607044209 t-c-j havetrytwo loredp henryleecn ruxinchao big-peter iwaller onetwenty1 lucas20029 rushsky518 meiweilove stevenlfc zjcdm zlgydx huayo65001 immemorialchaos

lucene-7.x-9.x's Issues

dim，dii几个疑问

fst一中[处理Node3 （a -> o）处理“a”]是否有误

https://www.amazingkoala.com.cn/Lucene/yasuocunchu/2019/0220/35.html 中：
处理“a”
--所以flag的值是BIT_ARC_HAS_OUTPUT(16) = 16，另外"o"的target的值是大于0的，说明“a”不是输入值（star）的最后一个字符。
原文："o"的target的值是大于0的，应该是：“a“的target的值是大于0的吧？

博客是否有关于Deflate压缩相关文章介绍

拜读大神文章，点开其中一篇博客分享的文章链接，发现里面有个Deflate压缩功能，想问问大神有没有Deflate相关文章介绍说明

BooleanScorer中的head和tail队列有什么用？

求助大佬，在看9.1.0版本的源码时，BooleanScorer中的head和tail队列实在看不出有什么特殊作用，我感觉一个队列就够了，我看你文章中描述的是：

但是我看源码发现head中优先级最低的起始就是全局next最低的，那为什么要区分head和tail队列呢？

关于 WAND 算法的一些问题

目前我们es是5.4版本，希望将Lucene高版本中的 wand 算法集成到当前的Lucene中，用来优化term查询，问下大佬这样可行吗

可以出书了

写得很不错，通俗易懂，👍

大佬 FST 和倒排都能看懂就是不知道在Lucene里面怎么通过fst得到一个Term的倒排

刚看到大神的博客，我震惊了！万分感谢！

关于ES索引兼容两种不同的压缩算法解压的问题的一些疑问，还请大神指定一二

您好，公司有一个需求，需要将线上ES集群7.10.0（lucene8.7.0）的索引添加zstd压缩算法，同时希望旧的使用lz4的索引能够在用户无感知的情况下切换到zstd压缩算法。

下面是我的实现ZSTDCompressor，压缩时在write dictLength 之前写入了一个魔数。这边定义了一个proxy ZstdProxyDecompressor来兼容两种不同的压缩算法的解压。数据解压的时候根据魔数来判断选择哪种压缩算法。请问这种方案的可行性吗？这边的想法是升级集群的镜像后，将元索引的压缩算法改成zstd实现，然后旧的数据兼容两种压缩算法解压，但是新的数据或者更新的数据使用zstd算法写入。

目前自己测试了兼容两种不同的压缩算法似乎没有什么问题，现在在尝试修改ES索引中的codec来修改索引压缩算法。

但是由于我之前没有深入了解过lucene和es，不知道会不会有什么风险点是我没有考虑到的。还请大神指点一二，谢谢！

这个是支持zstd压缩算法的实现类

public class ZstdWithPresetDictCompressionMode extends CompressionMode {
    private static final int NUM_SUB_BLOCKS = 10;
    private final int level;
    public static final int defaultLevel = 3;
    private static final int DICT_SIZE_FACTOR = 6;

    /**
     * default constructor
     */
    ZstdWithPresetDictCompressionMode() {
        this.level = defaultLevel;
    }

    /**
     * compression mode for a given compression level
     */
    ZstdWithPresetDictCompressionMode(int level) {
        this.level = level;
    }

    @Override
    public Compressor newCompressor() {
        return new ZSTDCompressor(level);
    }

    @Override
    public Decompressor newDecompressor() {
        return new ZSTDDecompressor();
    }

    /**
     * zstandard compressor
     */
    private static final class ZSTDCompressor extends Compressor {

        int compressionLevel;
        byte[] compressedBuffer;

        /**
         * compressor with a given compresion level
         */
        public ZSTDCompressor(int compressionLevel) {
            this.compressionLevel = compressionLevel;
            compressedBuffer = BytesRef.EMPTY_BYTES;
        }

        @Override
        public void close() throws IOException {
        }

        /*resuable compress function*/
        private void doCompress(byte[] bytes, int off, int len, ZstdCompressCtx cctx, DataOutput out)
            throws IOException {
            if (len == 0) {
                out.writeVInt(0);
                return;
            }
            final int maxCompressedLength = (int) Zstd.compressBound(len);
            compressedBuffer = ArrayUtil.grow(compressedBuffer, maxCompressedLength);

            int compressedSize =
                cctx.compressByteArray(compressedBuffer, 0, compressedBuffer.length, bytes, off, len);

            out.writeVInt(compressedSize);
            out.writeBytes(compressedBuffer, compressedSize);
        }

        @Override
        public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException {
            final int dictLength = len / (NUM_SUB_BLOCKS * DICT_SIZE_FACTOR);
            final int blockLength = (len - dictLength + NUM_SUB_BLOCKS - 1) / NUM_SUB_BLOCKS;

            // ZSTD写入魔数-666666
            out.writeVInt(Comporessions.ZSTD_MAGIC_NUMBER);

            out.writeVInt(dictLength);
            out.writeVInt(blockLength);

            final int end = off + len;

            try (ZstdCompressCtx cctx = new ZstdCompressCtx()) {
                cctx.setLevel(this.compressionLevel);

                // dictionary compression first
                doCompress(bytes, off, dictLength, cctx, out);
                cctx.loadDict(new ZstdDictCompress(bytes, off, dictLength, this.compressionLevel));

                for (int start = off + dictLength; start < end; start += blockLength) {
                    int l = Math.min(blockLength, off + len - start);
                    doCompress(bytes, start, l, cctx, out);
                }
            }
        }
    }

    /**
     * zstandard decompressor
     */
    private static final class ZSTDDecompressor extends Decompressor {

        private byte[] compressed;

        /**
         * default decompressor
         */
        public ZSTDDecompressor() {
            compressed = BytesRef.EMPTY_BYTES;
        }

        /*resuable decompress function*/
        private void doDecompress(
            DataInput in, ZstdDecompressCtx dctx, BytesRef bytes, int decompressedLen)
            throws IOException {
            final int compressedLength = in.readVInt();
            if (compressedLength == 0) {
                return;
            }

            compressed = ArrayUtil.grow(compressed, compressedLength);
            in.readBytes(compressed, 0, compressedLength);

            bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + decompressedLen);
            int uncompressed =
                dctx.decompressByteArray(
                    bytes.bytes, bytes.length, decompressedLen, compressed, 0, compressedLength);

            if (decompressedLen != uncompressed) {
                throw new IllegalStateException(decompressedLen + " " + uncompressed);
            }
            bytes.length += uncompressed;
        }

        @Override
        public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes)
            throws IOException {
            assert offset + length <= originalLength;

            if (length == 0) {
                bytes.length = 0;
                return;
            }

            final int magicNumber = in.readVInt();
            // 这里如果非-666666魔数，说明逻辑有问题
            if (magicNumber != Comporessions.ZSTD_MAGIC_NUMBER) {
                throw new IllegalArgumentException("Unknown magicNumber:[" + magicNumber + "]");
            }

            final int dictLength = in.readVInt();
            final int blockLength = in.readVInt();
            bytes.bytes = ArrayUtil.grow(bytes.bytes, dictLength);
            bytes.offset = bytes.length = 0;

            try (ZstdDecompressCtx dctx = new ZstdDecompressCtx()) {

                // decompress dictionary first
                doDecompress(in, dctx, bytes, dictLength);

                dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength));

                int offsetInBlock = dictLength;
                int offsetInBytesRef = offset;

                // Skip unneeded blocks
                while (offsetInBlock + blockLength < offset) {
                    final int compressedLength = in.readVInt();
                    in.skipBytes(compressedLength);
                    offsetInBlock += blockLength;
                    offsetInBytesRef -= blockLength;
                }

                // Read blocks that intersect with the interval we need
                while (offsetInBlock < offset + length) {
                    bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + blockLength);
                    int l = Math.min(blockLength, originalLength - offsetInBlock);
                    doDecompress(in, dctx, bytes, l);
                    offsetInBlock += blockLength;
                }

                bytes.offset = offsetInBytesRef;
                bytes.length = length;
                assert bytes.isValid();
            }
        }

        @Override
        public Decompressor clone() {
            return new ZSTDDecompressor();
        }
    }
}

这里是代理类支持解压两种不同压缩算法

public class ZstdProxyDecompressor extends Decompressor {

    private final ZstdAdapterDecompressor zstdAdapterDecompressor;

    private final AdapterDecompressor adapterDecompressor;


    public ZstdProxyDecompressor(ZstdAdapterDecompressor zstdAdapterDecompressor, AdapterDecompressor adapterDecompressor) {
        this.zstdAdapterDecompressor = zstdAdapterDecompressor;
        this.adapterDecompressor = adapterDecompressor;
    }

    @Override
    public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes)
        throws IOException {

        assert offset + length <= originalLength;

        if (length == 0) {
            bytes.length = 0;
            return;
        }

        final int magicNumber = in.readVInt();
        // 这里如果非zstd压缩则使用原来的压缩算法进行解压
        if (magicNumber == Comporessions.ZSTD_MAGIC_NUMBER) {
            zstdAdapterDecompressor.decompress(in, originalLength, offset, length, bytes, magicNumber);
        } else {
            adapterDecompressor.decompress(in, originalLength, offset, length, bytes, magicNumber);
        }

    }

    @Override
    public Decompressor clone() {
        return new ZstdProxyDecompressor(this.zstdAdapterDecompressor.clone(), this.adapterDecompressor.clone());
    }

}

这个是lz4实现，和原生的实现一模一样，就是多传了一个参数firstVInt作为dictLenght使用

public final class LZ4AdapterDecompressor extends AdapterDecompressor {

    private int[] compressedLengths;
    private byte[] buffer;

    public LZ4AdapterDecompressor() {
        compressedLengths = new int[0];
        buffer = new byte[0];
    }

    private int readCompressedLengths(DataInput in, int originalLength, int dictLength, int blockLength) throws IOException {
        in.readVInt(); // compressed length of the dictionary, unused
        int totalLength = dictLength;
        int i = 0;
        while (totalLength < originalLength) {
            compressedLengths = ArrayUtil.grow(compressedLengths, i + 1);
            compressedLengths[i++] = in.readVInt();
            totalLength += blockLength;
        }
        return i;
    }

    @Override
    public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes, int firstVInt)
        throws IOException {
        assert offset + length <= originalLength;

        if (length == 0) {
            bytes.length = 0;
            return;
        }

        final int dictLength = firstVInt;
        final int blockLength = in.readVInt();

        final int numBlocks = readCompressedLengths(in, originalLength, dictLength, blockLength);

        buffer = ArrayUtil.grow(buffer, dictLength + blockLength);
        bytes.length = 0;
        // Read the dictionary
        if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
            throw new CorruptIndexException("Illegal dict length", in);
        }

        int offsetInBlock = dictLength;
        int offsetInBytesRef = offset;
        if (offset >= dictLength) {
            offsetInBytesRef -= dictLength;

            // Skip unneeded blocks
            int numBytesToSkip = 0;
            for (int i = 0; i < numBlocks && offsetInBlock + blockLength < offset; ++i) {
                int compressedBlockLength = compressedLengths[i];
                numBytesToSkip += compressedBlockLength;
                offsetInBlock += blockLength;
                offsetInBytesRef -= blockLength;
            }
            in.skipBytes(numBytesToSkip);
        } else {
            // The dictionary contains some bytes we need, copy its content to the BytesRef
            bytes.bytes = ArrayUtil.grow(bytes.bytes, dictLength);
            System.arraycopy(buffer, 0, bytes.bytes, 0, dictLength);
            bytes.length = dictLength;
        }

        // Read blocks that intersect with the interval we need
        while (offsetInBlock < offset + length) {
            final int bytesToDecompress = Math.min(blockLength, offset + length - offsetInBlock);
            LZ4.decompress(in, bytesToDecompress, buffer, dictLength);
            bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + bytesToDecompress);
            System.arraycopy(buffer, dictLength, bytes.bytes, bytes.length, bytesToDecompress);
            bytes.length += bytesToDecompress;
            offsetInBlock += blockLength;
        }

        bytes.offset = offsetInBytesRef;
        bytes.length = length;
        assert bytes.isValid();

    }

    @Override
    public AdapterDecompressor clone() {
        return new LZ4AdapterDecompressor();
    }

}

这个和上面的zstd的实现一模一样，就是多传了一个参数firstVInt作为dictLenght使用

public final class ZstdAdapterDecompressor extends AdapterDecompressor {

    byte[] compressed;

    /**
     * default decompressor
     */
    public ZstdAdapterDecompressor() {
        compressed = BytesRef.EMPTY_BYTES;
    }

    /*resuable decompress function*/
    private void doDecompress(
        DataInput in, ZstdDecompressCtx dctx, BytesRef bytes, int decompressedLen)
        throws IOException {
        final int compressedLength = in.readVInt();
        if (compressedLength == 0) {
            return;
        }

        compressed = ArrayUtil.grow(compressed, compressedLength);
        in.readBytes(compressed, 0, compressedLength);

        bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + decompressedLen);
        int uncompressed =
            dctx.decompressByteArray(
                bytes.bytes, bytes.length, decompressedLen, compressed, 0, compressedLength);

        if (decompressedLen != uncompressed) {
            throw new IllegalStateException(decompressedLen + " " + uncompressed);
        }
        bytes.length += uncompressed;
    }

    @Override
    public void decompress(DataInput in, int originalLength, int offset, int length, BytesRef bytes, int firstVInt)
        throws IOException {
        assert offset + length <= originalLength;

        if (length == 0) {
            bytes.length = 0;
            return;
        }

        final int magicNumber = firstVInt;
        // 这里如果非-666666魔数说明逻辑有问题
        if (magicNumber != Comporessions.ZSTD_MAGIC_NUMBER) {
            throw new IllegalArgumentException("Unknown magicNumber:[" + magicNumber + "]");
        }

        final int dictLength = in.readVInt();
        final int blockLength = in.readVInt();
        bytes.bytes = ArrayUtil.grow(bytes.bytes, dictLength);
        bytes.offset = bytes.length = 0;

        try (ZstdDecompressCtx dctx = new ZstdDecompressCtx()) {

            // decompress dictionary first
            doDecompress(in, dctx, bytes, dictLength);

            dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength));

            int offsetInBlock = dictLength;
            int offsetInBytesRef = offset;

            // Skip unneeded blocks
            while (offsetInBlock + blockLength < offset) {
                final int compressedLength = in.readVInt();
                in.skipBytes(compressedLength);
                offsetInBlock += blockLength;
                offsetInBytesRef -= blockLength;
            }

            // Read blocks that intersect with the interval we need
            while (offsetInBlock < offset + length) {
                bytes.bytes = ArrayUtil.grow(bytes.bytes, bytes.length + blockLength);
                int l = Math.min(blockLength, originalLength - offsetInBlock);
                doDecompress(in, dctx, bytes, l);
                offsetInBlock += blockLength;
            }

            bytes.offset = offsetInBytesRef;
            bytes.length = length;
            assert bytes.isValid();
        }
    }

    @Override
    public ZstdAdapterDecompressor clone() {
        return new ZstdAdapterDecompressor();
    }

}

LogMergePolicy疑问点

博客https://www.amazingkoala.com.cn/Lucene/Index/2019/0513/58.html，计算log值这步骤似乎没啥用处，后续合并也没有用到这个值

TermsHashPerField numPostingInt

numPostingInt 这个值为什么设定为streamCount的2倍

Other那一类里的链接好像都没对上呢

期待对facet、group的分析

老板，能来个QQ群交流下吗？

solr的group怎么按照每组的数据量排序，这个可以做到吗，感谢大佬回复

【求问】Numeric类型的DocValues的问题？

想问下在.dvm和.dvd文件的哪部分存储了docId到value文件偏移的映射关系呢？或者还是通过其他部分进行逻辑映射呢？

Lu老板v5

可以把仓库具体分一下

仓库太大了，我clone了一个g还没结束呢

lucene搜索过程中哪些是在内存中进行，哪在磁盘中进行

Term dict index以FST的结构存缓存在内存中，从Term dict index查到关键词对应的term dic的块位置之后，再去磁盘上找term，大大减少了磁盘的IO次数。
1、（内存）内存加载tip文件，根据FST匹配到后缀词块在tim文件中的位置；
2、（内存）根据查询到的后缀词块位置查询到后缀及倒排表的相关信息；
3、（内存）根据tim中查询到的倒排表信息从doc文件中定位出文档号及词频信息，完成搜索；
4、（磁盘）文件定位完成后Lucene将去.fdx文件目录索引及.fdt中根据正向索引查找出目标文件
这个过程没问题吧？

文章写的很好，感谢分享，必须点赞。

SkipList

lucene里面使用SkipList来加速查找速度,使用有序数组进行两分查找是否比SkipList快是因为SkipList对于增删的效率比有序数组更快吗?

Fuzzy 拼写纠错

大佬能介绍下拼写纠错是怎么实现的吗看了网上的资料还是一知半解

请问大佬，如何将Lucene query string直接转为Query对象

我试了一下QueryPharser对象需要传入一个field域，但是我不知道我的string里面有什么域，可能有一个，也有可能有几个field，还有可能是组合起来的类似于
+fieldA:F +(+(((fieldB:cd)^300.0 (fieldC:sc)^200.0 (fieldD:china)^100.0 +(((fieldA:[1663210081 TO 1663814881])^300.0 (fieldA:[1662605281 TO 1663814881])^200.0)~1))~1) +(((fieldA:[1663210081 TO 1663814881])^300.0 (fieldA:[1662605281 TO 1663814881])^200.0)~1))

如何转为一个Query对象放入lucene中去查呢？
请指教，谢谢大佬🙏

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.