Git Product home page Git Product logo

lucenenet's Introduction

Welcome to Apache Lucene.NET

Nuget Azure DevOps builds (master) GitHub

Powerful Full-text search for .NET

Apache Lucene.NET is an open-source full-text search library written in C#. It is a port of the popular Java Apache Lucene project.

Apache Lucene.NET is a .NET library providing powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Lucene.NET version 4.8 (still in Beta) runs everywhere .NET runs, including Windows, Unix, MacOS, Android and iOS.

The Apache Lucene.NET website is at: http://lucenenet.apache.org

Supported Frameworks

Lucene.NET 3.0.3

  • .NET Framework 4.0
  • .NET Framework 3.5

Lucene.NET 4.8.0

Status

Latest Release Version: Lucene.NET 3.0.3

Working toward Lucene.NET 4.8.0 (currently in BETA)

  • The beta version is extremely stable
  • Has more than 7800+ passing unit tests
  • Integrates well with .NET 6.0, .NET 5.0 and .NET Core 2+
  • Supports .NET Standard 2.1 and .NET Standard 2.0
  • Supports .NET Framework 4.5+
  • Some developers already use it in production environments

Download

Lucene.NET 3.0.3

Core Library

NuGet version

PM> Install-Package Lucene.Net
All Packages

Lucene.NET 4.8.0

Core Library

NuGet version

PM> Install-Package Lucene.Net -Pre
All Packages

Documentation

We have preliminary documentation for Lucene.NET 4.8.0 on the Lucene.NET Website.

The API is similar to Java Lucene 4.8.0, which you may also find helpful to review.

NOTE: We are working on fixing issues with the documentation, but could use more help since it is a massive project. See #206.

Legacy Versions

Demos & Tools

There are several demos implemented as simple console applications that can be copied and pasted into Visual Studio or compiled on the command line in the Lucene.Net.Demo project.

There is also a dotnet command line tool available on NuGet. It contains all of the demos as well as tools maintaining your Lucene.NET index, featuring operations such as splitting, merging, listing segment info, fixing, deleting segments, upgrading, etc. Always be sure to back up your index before running any commands against it!

dotnet tool install lucene-cli -g --version 4.8.0-beta00015

NOTE: The version of the CLI you install should match the version of Lucene.NET you use.

Once installed, you can explore the commands and options that are available by entering the command lucene.

lucene-cli Documentation

How to Contribute

We love getting contributions! Read our Contribution Guide or read on for ways that you can help.

Join Mailing Lists

How to Join Mailing Lists

Ask a Question

If you have a general how-to question or need help from the Lucene.NET community, please subscribe to the user mailing list by sending an email to [email protected] and then follow the instructions to verify your email address. Note that you only need to subscribe once.

After you have subscribed to the mailing list, email your message to [email protected].

Alternatively, you can get help via StackOverflow's active community.

Please do not submit general how-to questions to GitHub, use GitHub for bug reports and tasks only.

Report a Bug

To report a bug, please use the GitHub issue tracker.

NOTE: In the past, the Lucene.NET project used the JIRA issue tracker, which has now been deprecated. However, we are keeping it active for tracking legacy issues. Please submit any new issues to GitHub.

Start a Discussion

To start a development discussion regarding the technical features of Lucene.NET, please email the dev mailing list by sending an email to [email protected] and then follow the instructions to verify your email address. Note that you only need to subscribe once.

After you have subscribed to the mailing list, email your message to [email protected].

Submit a Pull Request

Before you start working on a pull request, please read our Contributing guide.

Building and Testing

Command Line

Prerequisites
  1. PowerShell 5.0 or higher (see this question to check your PowerShell version)
  2. .NET 7.0 SDK or higher
Execution

NOTE: If the project is open in Visual Studio, its background restore may interfere with these commands. It is recommended to close all instances of Visual Studio that have Lucene.Net.sln open before executing.

To build the source, clone or download and unzip the repository. For specific releases, download and unzip the .src.zip file from the download page of the specific version. From the repository or distribution root, execute the build command from a command prompt and include the desired options from the build options table below:

Windows
> build [options]
Linux or macOS
./build [options]

NOTE: The build file will need to be given permission to run using the command chmod u+x build before the first execution.

Build Options

The following options are case-insensitive. Each option has both a short form indicated by a single - and a long-form indicated by --. The options that require a value must be followed by a space and then the value, similar to running the dotnet CLI.

Short Long Description Example
‑config ‑‑configuration The build configuration ("Release" or "Debug"). build ‑‑configuration Debug
‑mp ‑‑maximum-parallel-jobs The maximum number of parallel jobs to run during testing. If not supplied, the default is 8. build ‑t ‑mp 10
‑pv ‑‑package-version The NuGet package version. If not supplied, will use the version from the Version.proj file. build ‑pv 4.8.0‑beta00001
‑t ‑‑test Runs the tests after building. This option does not require a value. Note that testing typically takes around 40 minutes with 8 parallel jobs. build ‑t
‑fv ‑‑file-version The assembly file version. If not supplied, defaults to the --package-version value (excluding any pre-release label). The assembly version will be derived from the major version component of the passed in value, excluding the minor, build and revision components. build ‑pv 4.8.0‑beta00001 ‑fv 4.8.0

For example, the following command creates a Release build with NuGet package version 4.8.0‑ci00015 and file version 4.8.0. The assembly version will be derived from the major version component of the passed in value, excluding the minor, build and revision components (in this case 4.0.0).

Windows
> build ‑‑configuration Release ‑pv 4.8.0‑ci00015 ‑fv 4.8.0
Linux or macOS
./build ‑‑configuration Release ‑pv 4.8.0‑ci00015 ‑fv 4.8.0

In the above example, we are using "ci" in the package version to indicate this is not a publicly released beta version but rather the output of a continuous integration build from master which occurred after beta00014 but before beta00015 was released.

NuGet packages are output by the build to the /_artifacts/NuGetPackages/ directory. Test results (if applicable) are output to the /_artifacts/TestResults/ directory.

You can setup Visual Studio to read the NuGet packages like any NuGet feed by following these steps:

  1. In Visual Studio, right-click the solution in Solution Explorer, and choose "Manage NuGet Packages for Solution"
  2. Click the gear icon next to the Package sources dropdown.
  3. Click the + icon (for add)
  4. Give the source a name such as Lucene.Net Local Packages
  5. Click the ... button next to the Source field, and choose the /src/_artifacts/NuGetPackages folder on your local system.
  6. Click Ok

Then all you need to do is choose the Lucene.Net Local Packages feed from the dropdown (in the NuGet Package Manager) and you can search for, install, and update the NuGet packages just as you can with any Internet-based feed.

Visual Studio

Prerequisites

  1. Visual Studio 2022 or higher
  2. .NET 7.0 SDK or higher

Execution

  1. Open Lucene.Net.sln in Visual Studio.
  2. Choose the target framework to test by opening .build/TestTargetFramework.props and uncommenting the corresponding <TargetFramework> (and commenting all others).
  3. Build a project or the entire solution, and wait for Visual Studio to discover the tests - this may take several minutes.
  4. Run or debug the tests in Test Explorer, optionally using the desired filters.

NOTE: When running tests in Visual Studio, be sure to set the default processor architecture to 64 bit to avoid running out of virtual memory on some tests.

Azure DevOps

We have setup our azure-pipelines.yml file with logical defaults so anyone with an Azure DevOps account can build Lucene.NET and run the tests with minimal effort. Even a free Azure DevOps account will work, but tests will run much faster if the account is setup as public, which enables up to 10 parallel jobs to run simultaneously.

Prerequisites

  1. An Azure DevOps account.
  2. A fork of this repository either on GitHub or Azure DevOps. The rest of these instructions assume a GitHub fork.

Execution

If you don't already have a pipeline set up:
  1. Create an Azure DevOps organization. If you already have one that you wish to use, you may skip this step.
  2. Create an Azure DevOps project. We recommend naming the project Lucene.NET. Note that if you are using a free Azure DevOps account, you should choose to make the project public in order to enable 10 parallel jobs. If you make the project private, you will only get 1 parallel job. Also, if disabling features, make sure to leave Pipelines enabled.
  3. Create an Azure DevOps pipeline.
    • Click on "Pipelines" from the left menu.
    • Click the "Create Pipeline" or "New Pipeline" button, depending on whether any pipelines already exist.
    • Select GitHub as the location to find the YAML file.
    • Select the fork of this repository you created in "Prerequisites". Note that if this is a new Azure DevOps account you may need to setup extra permissions to access your GitHub account.
    • Next a "Review your YAML" page is presented showing the contents of azure-pipelines.yml. There is documentation near the top of the file indicating the variables that can be setup to enable additional options, but note that the default configuration will automatically run the build and all of the tests.
    • Click the "Run" button at the top right of the page.
If you already have a pipeline set up:
  1. Click on "Pipelines" from the left menu.
  2. Select the pipeline you wish to run.
  3. Click the "Queue" button on the upper right.
  4. (Optional) Select the branch and override any variables in the pipeline for this run.
  5. Click the "Run" button.

Note that after the build is complete, the nuget artifact contains .nupkg files which may be downloaded to your local machine where you can setup a local folder to act as a NuGet feed.

It is also possible to add an Azure DevOps feed id to a new variable named ArtifactFeedID, but we are getting mixed results due to permission issues.

lucenenet's People

Contributors

aaron-meyers avatar bodewig avatar bongohrtech avatar ccurrens avatar christopherbass avatar christopherhaws avatar conniey avatar devheroo avatar eladmarg avatar emaher avatar geobmx540 avatar jeme avatar joneskj55 avatar jpsullivan avatar laimis avatar nazjunaid avatar nightowl888 avatar nikcio avatar paulirwin avatar pietervanginkel avatar rauhs avatar rclabo avatar shazwazza avatar slombard54 avatar stevetemple avatar synhershko avatar theolivenbaum avatar thoward avatar vvdb avatar wwb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lucenenet's Issues

Custom StopWord Analyzer - Exception Cannot read from a closed TextReader.

Hello,
We are trying to convert from v3.0.3 to v4.8.0-beta00007. .Net Framework 4.5.

We previously had a Custom StopWords Analyzer that inherited from Analyzer. After upgrading, there is an abstract method that needs to be implemented named:
TokenStreamComponents CreateComponents(string fieldName, TextReader reader)

Following the documentation from https://lucenenet.apache.org/download/version-4.html to implement this method, we are getting exception: "Cannot read from a closed TextReader."

Here is our implementation:

protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
Analyzer analyzer = new StandardAnalyzer(_luceneVersion, reader);
TokenStream ts = analyzer.GetTokenStream(fieldName, reader);
var tokenizer = new StandardTokenizer(_luceneVersion, reader);

        try
        {
            ts.Reset(); // Resets this stream to the beginning. (Required)
            while (ts.IncrementToken())
            {
            }
            ts.End();   // Perform end-of-stream operations, e.g. set the final offset.
        }
        catch (Exception ex)
        {
            _ = ex.Message;
            throw;
        }
        finally
        {
            ts.Dispose();
        }
        return new TokenStreamComponents(tokenizer, ts);
    }

The exception occurs on ts.IncrementToken().

Thanks
Roy

Port RandomizedTesting Test Runner

While the test framework can run with the standard NUnit test runner and we have managed to make it work, it is missing the ability to reproduce a random test failure, which is crucial for debugging some of the more complicated random tests. Lucene uses a custom JUnit test runner called RandomizedTesting to accomplish this.

It also includes some other nice features

  1. Ensure when a test fails, the random seeds are included in error messages and logs
  2. Code analysis to ensure the tests are setup properly
  3. Run tests in a random order to ensure they are not dependent upon each other

Preliminary analysis shows that the API for NUnit allows building custom test runners and it is a close enough match to implement the functionality. Most likely, this will require a custom adapter as well, so the test runner can integrate into Visual Studio and dotnet test/mstest.

Do note that without doing some pretty heavy refactoring on the Codec, Directory, and LuceneTestCase classes, it is not possible to run tests in parallel within the same AppDomain because the codecs use a static variable to turn codec impersonation on/off. For now, it would probably be best to run tests serially.

JIRA link - [LUCENENET-627] created by nightowl888

Build warning since v4.8.0-beta00008

Since updating to v4.8.0-beta00008, when I build my project, I get this build error:

Analyzer 'Lucene.Net.CodeAnalysis.Lucene1000_TokenStreamOrItsIncrementTokenMethodMustBeSealedAnalyzer' threw an exception of type 'System.IO.FileNotFoundException' with message 'Could not load file or assembly 'Microsoft.CodeAnalysis.VisualBasic, Version=3.4.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'. The system cannot find the file specified.'.

I've created an empty project and added all the references my project uses and that builds fine.

Known Failing Tests on Lucene.Net

All 7500+ tests pass most of the time, but there are still a few that fail under the random conditions we test them in. This is a complete list of the known tests that fail and what is known about them. Note that these may not be all of the ways the tests can fail.

  • Lucene.Net.Index.TestTermVectorsWriter::TestTermVectorCorruption() and Lucene.Net.Index.TestIndexWriterReader::TestDuringAddIndexes()
    • Fails with a message like System.IO.IOException : The process cannot access the file '/tmp/LuceneTemp/index-NIOFSDirectory-fkqg2vah/write.lock' because it is being used by another process.
    • Happens on macOS and Linux, but not on Windows
  • Lucene.Net.Search.TestSearchAfter::TestQueries()
    • Related to current culture
    • Happens on netcoreapp2.1 on macOS
    • Happens on netcoreapp1.0 on Linux
  • Lucene.Net.Util.TestVersionComparer::TestVersions()
    • Related to current culture
    • Happens on netcoreapp2.1 on macOS
  • Lucene.Net.Index.TestIndexWriter::TestTwoThreadsInterruptDeadlock()
    • Happens on every target framework/OS
    • Fixed in #525
  • Lucene.Net.Index.TestIndexWriter::TestThreadInterruptDeadlock()
    • Happens on every target framework/OS
    • Fixed in #525
    • The error message doesn't occur when using lock (this) in IndexWriter.CommitInternal() to make PrepareCommitInternal()
    • The test does not fail if both of the following lines are commented in TestIndexWriter.IndexerThreadInterrupt::Run()

      w.UpdateDocument(new Term("id", idField.GetStringValue()), doc);

      w.AddDocument(doc);
    • The test does not fail if RAMDirectory is used instead of MockDirectoryWrapper
    • It is quite possible that the problem is with MockDirectoryWrapper or one of its subcomponents, although the locking behavior of IndexWriter is also suspicious
    • Fails with message:
FAILED; unexpected exception
 System.InvalidOperationException: cannot close: prepareCommit was already called with no corresponding call to commit
 at Lucene.Net.Index.IndexWriter.CloseInternal(Boolean waitForMerges, Boolean doFlush) in F:\Projects\lucenenet\src\Lucene.Net\Index\IndexWriter.cs:line 1157
 at Lucene.Net.Index.IndexWriter.Dispose(Boolean waitForMerges) in F:\Projects\lucenenet\src\Lucene.Net\Index\IndexWriter.cs:line 1096
 at Lucene.Net.Index.IndexWriter.Dispose() in F:\Projects\lucenenet\src\Lucene.Net\Index\IndexWriter.cs:line 1053
 at Lucene.Net.Index.TestIndexWriter.IndexerThreadInterrupt.Run() in F:\Projects\lucenenet\src\Lucene.Net.Tests\Index\TestIndexWriter.cs:line 1223
  • Lucene.Net.Classification.KNearestNeighborClassifierTest::TestPerformance()
    • Happens on netcoreapp2.1 on Windows
    • Happens on net451 on Windows
  • Lucene.Net.Analysis.NGram.EdgeNGramTokenizerTest::TestFullUTF8Range()
  • Lucene.Net.Analysis.NGram.NGramTokenizerTest::TestFullUTF8Range()
    • Happens on net451 on Windows
    • Happens on netcoreapp2.1 on macOS
  • Lucene.Net.Search.VectorHighlight.SimpleFragmentsBuilderTest::
    TestRandomDiscreteMultiValueHighlighting()
    • Happens on Windows
    • Not related to current culture
    • Definitely a problem with the highlighter, not the test or test framework
    • As of 2020-07-23, this test no longer fails even after running continuously with different seeds for 8 minutes. I suspect that the J2N structural equality comparer fix NightOwl888/J2N@33382d9 inadvertently fixed this issue.
  • Lucene.Net.QueryParsers.Flexible.Standard.TestQPHelper::TestDateRange()
    • Does not happen on Windows, but happens on both Linux and macOS
    • Related to current culture - fails with new CultureInfo("ar") specifically (and others)
    • As of 2020-08-11, this test hasn't failed for several hundred test runs on Azure DevOps.
    • 2021-11-19 - The test started failing again after fixing the random seed functionality in the test framework. It was fixed in #543.
    • Stack trace
System.FormatException : String '1‏‏/1‏‏/2002' was not recognized as a valid DateTime.
 at System.DateTimeParse.Parse(ReadOnlySpan`1 s, DateTimeFormatInfo dtfi, DateTimeStyles styles)
 at Lucene.Net.QueryParsers.Flexible.Standard.TestQPHelper.AssertDateRangeQueryEquals(StandardQueryParser qp, String field, String startDate, String endDate, DateTime endDateInclusive, Resolution resolution) in D:\a\1\s\src\Lucene.Net.Tests.QueryParser\Flexible\Standard\TestQPHelper.cs:line 837
 at Lucene.Net.QueryParsers.Flexible.Standard.TestQPHelper.TestDateRange() in D:\a\1\s\src\Lucene.Net.Tests.QueryParser\Flexible\Standard\TestQPHelper.cs:line 826
  • Lucene.Net.Expressions.TestExpressionSorts::TestQueries()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Sandbox.TestSlowFuzzyQuery::TestTieBreaker()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Sandbox.TestSlowFuzzyQuery::TestTokenLengthOpt()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Search.TestBooleanQuery::TestBS2DisjunctionNextVsAdvance()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Search.TestFuzzyQuery::TestTieBreaker()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Search.TestSearchAfter::TestQueries()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Search.TestTopDocsMerge::TestSort_1()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Search.TestTopDocsMerge::TestSort_2()

    • Happens on .NET Framework on Windows, x86, in Release mode only (if not Release, x86, or different target framework, does not fail)
    • Does not fail when optimizations are disabled in Lucene.Net.dll
    • Fixed in #529
  • Lucene.Net.Index.TestTaskMergeScheduler::TestSubclassTaskMergeScheduler()

    • 2020-02-18 - Deprecated TaskMergeScheduler and deleted its tests, so this is no longer applicable
    • Happens rarely
    • Stack Trace:
System.ObjectDisposedException : Cannot access a disposed object.
at System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(TimeoutTracker timeout)
at System.Threading.ReaderWriterLockSlim.EnterReadLock()
at Lucene.Net.Support.Threading.ReaderWriterLockSlimExtensions.ReadLockToken..ctor(ReaderWriterLockSlim sync) in D:\a\1\s\src\Lucene.Net\Support\Threading\ReaderWriterLockSlimExtensions.cs:line 40
at Lucene.Net.Support.Threading.ReaderWriterLockSlimExtensions.Read(ReaderWriterLockSlim obj) in D:\a\1\s\src\Lucene.Net\Support\Threading\ReaderWriterLockSlimExtensions.cs:line 74
at Lucene.Net.Index.TaskMergeScheduler.MergeThread.get_IsAlive() in D:\a\1\s\src\Lucene.Net\Support\Index\TaskMergeScheduler.cs:line 535
at Lucene.Net.Index.TaskMergeScheduler.Sync() in D:\a\1\s\src\Lucene.Net\Support\Index\TaskMergeScheduler.cs:line 191
at Lucene.Net.Index.TestTaskMergeScheduler.TestSubclassTaskMergeScheduler() in D:\a\1\s\src\Lucene.Net.Tests\Support\Index\TestTaskMergeScheduler.cs:line 106
  • Lucene.Net.Analysis.Th.TestThaiAnalyzer::TestRandomStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Does not occur when these lines are commented - this is definitely a concurrency problem. See #313 (comment).
    • See #313
  • Lucene.Net.Analysis.Th.TestThaiAnalyzer::TestRandomHugeStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Does not occur when these lines are commented - this is definitely a concurrency problem. See #313 (comment).
    • See #313
  • Lucene.Net.Analysis.Icu.TestICUFoldingFilter::TestRandomStrings()
    • This turned out to be an issue with culture sensitivity that was introduced in 4.8.0-beta00008. See #321.
  • Lucene.Net.Analysis.Icu.TestICUTokenizerCJK::TestRandomStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Does not occur when these lines are commented - this is definitely a concurrency problem. See #313 (comment).
    • The temporary patch in #328 makes this test pass and is proof that this is a concurrency problem with BreakIterator, and ffc8f2a can be reverted once it is fixed.
  • Lucene.Net.Analysis.Icu.TestICUTokenizerCJK::TestRandomHugeStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Does not occur when these lines are commented - this is definitely a concurrency problem. See #313 (comment).
    • The temporary patch in #328 makes this test pass and is proof that this is a concurrency problem with BreakIterator, and ffc8f2a can be reverted once it is fixed.
  • Lucene.Net.Analysis.Icu.TestICUTokenizer::TestRandomStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Does not occur when these lines are commented - this is definitely a concurrency problem. See #313 (comment).
    • The temporary patch in #328 makes this test pass and is proof that this is a concurrency problem with BreakIterator, and ffc8f2a can be reverted once it is fixed.
  • Lucene.Net.Analysis.Icu.TestICUTokenizer::TestRandomHugeStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Does not occur when these lines are commented - this is definitely a concurrency problem. See #313 (comment).
    • The temporary patch in #328 makes this test pass and is proof that this is a concurrency problem with BreakIterator, and ffc8f2a can be reverted once it is fixed.
  • Lucene.Net.Index.TestIndexWriterWithThreads::TestRollbackAndCommitWithThreads()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur
    • Began appearing when #326 was put into place
    • Patched in #335
  • Lucene.Net.Analysis.Icu.TestICUFoldingFilterFactory::Test()
    • Only fails on Ubuntu 18.04 when running on GitHub Actions hosted agent
    • Does not fail using Ubuntu 18.04 on Azure DevOps
    • Stack Trace:
Lucene.Net.Diagnostics.AssertionException : End() called before IncrementToken() returned false!  1) Expected: resume, Actual: r�sum�

term 0, output[i] = resume, termAtt = r�sum�

at Lucene.Net.Analysis.MockTokenizer.End() in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.TestFramework/Analysis/MockTokenizer.cs:line 336
at Lucene.Net.Analysis.BaseTokenStreamTestCase.AssertTokenStreamContents(TokenStream ts, String[] output, Int32[] startOffsets, Int32[] endOffsets, String[] types, Int32[] posIncrements, Int32[] posLengths, Nullable1 finalOffset, Nullable1 finalPosInc, Boolean[] keywordAtts, Boolean offsetsAreCorrect, Byte[][] payloads) in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs:line 415
at Lucene.Net.Analysis.BaseTokenStreamTestCase.AssertTokenStreamContents(TokenStream ts, String[] output, Int32[] startOffsets, Int32[] endOffsets, String[] types, Int32[] posIncrements, Int32[] posLengths, Nullable1 finalOffset, Boolean[] keywordAtts, Boolean offsetsAreCorrect) in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs:line 428 at Lucene.Net.Analysis.BaseTokenStreamTestCase.AssertTokenStreamContents(TokenStream ts, String[] output, Int32[] startOffsets, Int32[] endOffsets, String[] types, Int32[] posIncrements, Int32[] posLengths, Nullable1 finalOffset, Boolean offsetsAreCorrect) in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs:line 433
at Lucene.Net.Analysis.BaseTokenStreamTestCase.AssertTokenStreamContents(TokenStream ts, String[] output, Int32[] startOffsets, Int32[] endOffsets, String[] types, Int32[] posIncrements, Int32[] posLengths, Nullable`1 finalOffset) in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs:line 438
at Lucene.Net.Analysis.BaseTokenStreamTestCase.AssertTokenStreamContents(TokenStream ts, String[] output) in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs:line 453
at Lucene.Net.Analysis.Icu.TestICUFoldingFilterFactory.Test() in /home/runner/work/lucenenet/lucenenet/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu/TestICUFoldingFilterFactory.cs:line 39

  • Lucene.Net.Analysis.Core.TestRandomChains::TestRandomChains_() and Lucene.Net.Analysis.Core.TestRandomChains::TestRandomChainsWithLargeStrings()
    • Happens so rarely that it requires [FindFirstFailingSeed] or [Repeat(100)] attribute to force it to occur (this no longer works after fixing the random seed functionality in the test framwork, we need to wait for it to happen in Azure DevOps)
    • Began appearing when #326 was put into place
    • 2021-11-19 - The cause is now known - see #271 (comment).
    • Upon multiple runs, the only common failure point seems to be HunspellStemFilter
    • Stack Trace:
Test Name:	TestRandomChainsWithLargeStrings
Test Outcome:	Failed
Result Message:	
Thread threw exception: Lucene.Net.Diagnostics.AssertionException: accCount=1 vs existing accept=False states=[0:1]
   at Lucene.Net.Util.Automaton.BasicOperations.Determinize(Automaton a) in F:\Projects\lucenenet\src\Lucene.Net\Util\Automaton\BasicOperations.cs:line 874
   at Lucene.Net.Util.Automaton.MinimizationOperations.MinimizeHopcroft(Automaton a) in F:\Projects\lucenenet\src\Lucene.Net\Util\Automaton\MinimizationOperations.cs:line 67
   at Lucene.Net.Util.Automaton.RegExp.ToAutomaton(IDictionary`2 automata, IAutomatonProvider automaton_provider) in F:\Projects\lucenenet\src\Lucene.Net\Util\Automaton\RegExp.cs:line 616
   at Lucene.Net.Analysis.Hunspell.Dictionary.ParseAffix(SortedDictionary`2 affixes, String header, TextReader reader, String conditionPattern, IDictionary`2 seenPatterns, IDictionary`2 seenStrips) in F:\Projects\lucenenet\src\Lucene.Net.Analysis.Common\Analysis\Hunspell\Dictionary.cs:line 498
   at Lucene.Net.Analysis.Hunspell.Dictionary.ReadAffixFile(Stream affixStream, Encoding decoder) in F:\Projects\lucenenet\src\Lucene.Net.Analysis.Common\Analysis\Hunspell\Dictionary.cs:line 301
   at Lucene.Net.Analysis.Hunspell.Dictionary..ctor(Stream affix, IList`1 dictionaries, Boolean ignoreCase) in F:\Projects\lucenenet\src\Lucene.Net.Analysis.Common\Analysis\Hunspell\Dictionary.cs:line 162
   at Lucene.Net.Analysis.Hunspell.Dictionary..ctor(Stream affix, Stream dictionary) in F:\Projects\lucenenet\src\Lucene.Net.Analysis.Common\Analysis\Hunspell\Dictionary.cs:line 125
   at Lucene.Net.Analysis.Core.TestRandomChains.DictionaryArgProducer.Create(Random random) in F:\Projects\lucenenet\src\Lucene.Net.Tests.Analysis.Common\Analysis\Core\TestRandomChains.cs:line 511
   at Lucene.Net.Analysis.Core.TestRandomChains.NewRandomArg[T](Random random, Type paramType) in F:\Projects\lucenenet\src\Lucene.Net.Tests.Analysis.Common\Analysis\Core\TestRandomChains.cs:line 747
   at Lucene.Net.Analysis.Core.TestRandomChains.NewFilterArgs(Random random, TokenStream stream, Type[] paramTypes) in F:\Projects\lucenenet\src\Lucene.Net.Tests.Analysis.Common\Analysis\Core\TestRandomChains.cs:line 810
   at Lucene.Net.Analysis.Core.TestRandomChains.MockRandomAnalyzer.NewFilterChain(Random random, Tokenizer tokenizer, Boolean offsetsAreCorrect) in F:\Projects\lucenenet\src\Lucene.Net.Tests.Analysis.Common\Analysis\Core\TestRandomChains.cs:line 1023
   at Lucene.Net.Analysis.Core.TestRandomChains.MockRandomAnalyzer.CreateComponents(String fieldName, TextReader reader) in F:\Projects\lucenenet\src\Lucene.Net.Tests.Analysis.Common\Analysis\Core\TestRandomChains.cs:line 849
   at Lucene.Net.Analysis.Analyzer.GetTokenStream(String fieldName, TextReader reader) in F:\Projects\lucenenet\src\Lucene.Net\Analysis\Analyzer.cs:line 265
   at Lucene.Net.Analysis.BaseTokenStreamTestCase.CheckAnalysisConsistency(Random random, Analyzer a, Boolean useCharFilter, String text, Boolean offsetsAreCorrect, Field field) in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Analysis\BaseTokenStreamTestCase.cs:line 1044
   at Lucene.Net.Analysis.BaseTokenStreamTestCase.CheckRandomData(Random random, Analyzer a, Int32 iterations, Int32 maxWordLength, Boolean useCharFilter, Boolean simple, Boolean offsetsAreCorrect, RandomIndexWriter iw) in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Analysis\BaseTokenStreamTestCase.cs:line 929
   at Lucene.Net.Analysis.BaseTokenStreamTestCase.AnalysisThread.Run() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Analysis\BaseTokenStreamTestCase.cs:line 735
Result StandardOutput:	
Exception from random analyzer: 
charfilters=
  HTMLStripCharFilter([System.IO.StringReader])
  PatternReplaceCharFilter([a, , Lucene.Net.Analysis.CharFilters.HTMLStripCharFilter])
tokenizer=
  ReversePathHierarchyTokenizer([Lucene.Net.Analysis.Core.TestRandomChains+CheckThatYouDidntReadAnythingReaderWrapper, 蒫, 醙, 24])
filters=
  CJKBigramFilter([ValidatingTokenFilter@368e0a2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,type=word,positionLength=1, 83])
  IndonesianStemFilter([ValidatingTokenFilter@25a042e term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,type=word,positionLength=1,keyword=False, false])
  HunspellStemFilter([ValidatingTokenFilter@20aee5b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,type=word,positionLength=1,keyword=False, Lucene.Net.Analysis.Hunspell.Dictionary, true, false])
  LatvianStemFilter([ValidatingTokenFilter@6f125a term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,type=word,positionLength=1,keyword=False])
offsetsAreCorrect=False
Result StandardError:	TEST FAIL: useCharFilter=True text='smfmwktkf \\'\\'

Note we now have an azure-pipelines.yml configuration file in our repository that anyone can use to setup a build pipeline to see the tests run on Windows, macOS and Linux by setting up a free Azure DevOps account. If you create a public project to run the tests in, it will take roughly an hour to see the test results (a private project will take significantly longer on the free subscription because they only provide a single parallel job).

Repeatability

The tests that fail are now marked with the AwaitsFixAttribute. The default behavior of the test framework is to ignore the tests that are marked with this attribute. Any of the following changes can be made to make the AwaitsFixAttribute tests run so they can be debugged.

  1. Comment the AwaitsFixAttribute on the test in question.
  2. Add a file named lucene.testsettings.json in the directory of the test project or a parent directory of the test project with the content:
 {
  "tests": {
    "awaitsfix": "true"
  }
}

To specify a specific target framework to run the tests on, edit the TestTargetFramework.props file. Note that the last property with a given name wins, so to test on .NET Core 3.1 instead of .NET 5, change the file as follows:

Before

    <!--<TargetFramework>net48</TargetFramework>-->
    <!--<TargetFramework>netcoreapp2.1</TargetFramework>-->
    <!--<TargetFramework>netcoreapp3.1</TargetFramework>-->
    <TargetFramework>net5.0</TargetFramework>

After

    <!--<TargetFramework>net48</TargetFramework>-->
    <!--<TargetFramework>netcoreapp2.1</TargetFramework>-->
    <TargetFramework>netcoreapp3.1</TargetFramework>
    <!--<TargetFramework>net5.0</TargetFramework>-->

NOTES

  • If you want to work on one of these issues, please open a new GitHub issue and make it refer to this one, so your efforts aren't duplicated by someone else. Assign the issue to yourself. If you can't work out the issue, make sure that you unassign yourself and comment on it below that it is still unresolved and up for grabs.
  • Sometimes problems can be spotted just by comparing the Lucene 4.8.0 code against Lucene.Net 4.8.0 code.
  • The code should be checked to make sure there were no translation problems from Java to C#. This may be easier than it sounds, as you can type phrases like "java HashMap equivalent c#" into Google to find the answers easily.
  • We can change the code to properly match the behavior of Java, but no cheating by changing the conditions of the test! Unless, of course, the test conditions were translated to C# wrong, which has been known to happen.
  • Random failures can often be made to happen more frequently by adding a RepeatAttribute to the top of the test. Try running 30 or 40 times and you will see the failure much more often.
  • If you find a solution to make the test pass, please open a PR on GitHub or alternatively post the solution here so we can try it ourselves.
  • If you get the same warm fuzzy feeling we do when we make a test green, feel free to fix another one.

Also, let us know if you find any failing test that is not on this list.

JIRA link - [LUCENENET-619] created by nightowl888

.NETify the public API where appropriate

Although we haven't abandoned the line-by-line port of Java lucene, there are many idioms in Java that make little to no sense in a .NET assembly. The API can change to allow for a conventional .NET experience, while still maintaining the ability and ease during the porting process of Java logic.

  • Change Getxxx() and Setxxx() methods to .NET Properties
  • Implement the dispose pattern properly. Try, at all costs, to only use finalizers when necessary. They are expensive, and most of the classes used already have finalizers that will be called.
  • Convert Java Iterator-style classes (see TermEnum, TermDocs and others) to implement IEnumerable
  • When catching exceptions, do not use throw; instead of throw ex; to maintain the stack trace

JIRA link - [LUCENENET-467] created by ccurrens

Application pool crashes due to missing index file

This problem is critical because when it happens the application pool crashes and the entire website stops working with 503 error code.

Errors are visible in Windows Event Log (Application), attached.

When index files are deleted and index is rebuilt the problem goes away.

I was not able to reproduce the problem by copying corrupted index files (also attached) to another server.

By googling the error message it looks like it's coming from file CompoundFileReader.cs, method OpenInput.

if (entry == null)
throw new System.IO.IOException("No sub-file with id " + id + " found");

JIRA link - [LUCENENET-646] created by jbogusz

The design and implementation of a better ThreadLocal<T>

This issue was first reported in LUCENENET-640.

When using very high number of threads for long periods of time, the garbage collector will cause all threads to block for several seconds. More details here.

There is also a solution being tested, which is blogged about here.

Ideally, we should adapt the fix after it has been thoroughly tested, provided Microsoft doesn't fix the problem with ThreadLocal first.

JIRA link - [LUCENENET-644] created by nightowl888

Sequential IndexWriter performance in concurrent environments.

When creating Lucene.Net indices in parallel, sequential-like performance is experienced. Profiling 8 concurrent IndexWriter instances writing in parallel shows that WeakIdentityMap::IdentityWeakReference::Equals spends most time garbage collecting (94.91%) and TokenStream::AssertFinal (87.09% garbage collecting) in my preliminary tests (see screenshots).

The WeakIdentityMap implementation uses an IdentityWeakReference as key, which is implemented as a class. By inspection of this class, it is merely a System.Runtime.InteropServices.GCHandle wrapper as can be seen in the mono project, manually wrapping of this struct in a struct rather than a class - will eliminate some of the immense amounts of garbage collection.

JIRA link - [LUCENENET-640] created by matt2843

NullReferenceException in DrillSideways.Search - ReqExclScorer.GetCost

We have migrated all our Lucene 3.0 code to Lucene 4.8. However
when searching with DrillSideways.Search we sometimes get a NullReferenceException with this stacktrace:

System.NullReferenceException: Object reference not set to an instance of an object.
at Lucene.Net.Search.ReqExclScorer.GetCost() in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net\Search\ReqExclScorer.cs:line 148
at Lucene.Net.Facet.DrillSidewaysScorer.Score(ICollector collector, Int32 maxDoc) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Facet\DrillSidewaysScorer.cs:line 139
at Lucene.Net.Search.IndexSearcher.Search(IList`1 leaves, Weight weight, ICollector collector) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net\Search\IndexSearcher.cs:line 649
at Lucene.Net.Facet.DrillSideways.Search(DrillDownQuery query, ICollector hitCollector) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Facet\DrillSideways.cs:line 194
at Lucene.Net.Facet.DrillSideways.Search(ScoreDoc after, DrillDownQuery query, Int32 topN) in C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Facet\DrillSideways.cs:line 249

I managed to reproduce this in an unit test. If you add this unit test for example to TestDrillSideways.cs it will throw a NullReferenceException when running. This unit test should give 0 results because the criteria "Age != 23 AND Name == e" matches nothing. However I sometimes have the same issue when the query returns multiple results but that is currently a bit harder to reproduce in a unit test.

[Test]
public virtual void TestFacetNRE()
{
    Directory dir = NewDirectory();
    Directory taxoDir = NewDirectory();
<span class="code-comment">// Writes facet ords to a separate directory from the

// main index:
var taxoWriter = new DirectoryTaxonomyWriter(taxoDir, OpenMode.CREATE);

FacetsConfig config = <span class="code-keyword">new</span> FacetsConfig();

RandomIndexWriter writer = <span class="code-keyword">new</span> RandomIndexWriter(Random(), dir, Similarity, TimeZone);

Document doc = <span class="code-keyword">new</span> Document();
doc.Add(<span class="code-keyword">new</span> Field(<span class="code-quote">"Name"</span>, <span class="code-quote">"John"</span>, Documents.StringField.TYPE_STORED));
doc.Add(<span class="code-keyword">new</span> Field(<span class="code-quote">"Age"</span>, <span class="code-quote">"19"</span>, Documents.StringField.TYPE_STORED));
doc.Add(<span class="code-keyword">new</span> FacetField(<span class="code-quote">"Function"</span>, <span class="code-quote">"Developer"</span>));
writer.AddDocument(config.Build(taxoWriter, doc));

doc = <span class="code-keyword">new</span> Document();
doc.Add(<span class="code-keyword">new</span> Field(<span class="code-quote">"Name"</span>, <span class="code-quote">"Steven"</span>, Documents.StringField.TYPE_STORED));
doc.Add(<span class="code-keyword">new</span> Field(<span class="code-quote">"Age"</span>, <span class="code-quote">"23"</span>, Documents.StringField.TYPE_STORED));
doc.Add(<span class="code-keyword">new</span> FacetField(<span class="code-quote">"Function"</span>, <span class="code-quote">"Sales"</span>));
writer.AddDocument(config.Build(taxoWriter, doc));

<span class="code-comment">// NRT open

IndexSearcher searcher = NewSearcher(writer.Reader);

<span class="code-comment">// NRT open

var taxoReader = new DirectoryTaxonomyReader(taxoWriter);

DrillSideways ds = <span class="code-keyword">new</span> DrillSideways(searcher, config, taxoReader);

<span class="code-keyword">var</span> query = <span class="code-keyword">new</span> BooleanQuery(<span class="code-keyword">true</span>);
query.Add(<span class="code-keyword">new</span> TermQuery(<span class="code-keyword">new</span> Term(<span class="code-quote">"Age"</span>, <span class="code-quote">"23"</span>)), Occur.MUST_NOT);
query.Add(<span class="code-keyword">new</span> WildcardQuery(<span class="code-keyword">new</span> Term(<span class="code-quote">"Name"</span>, <span class="code-quote">"*e*"</span>)), Occur.MUST);

<span class="code-keyword">var</span> mydrillDownQuery = <span class="code-keyword">new</span> DrillDownQuery(config, query);
mydrillDownQuery.Add(<span class="code-quote">"Function"</span>, <span class="code-quote">"Developer"</span>);

<span class="code-keyword">var</span> z = ds.Search(mydrillDownQuery, <span class="code-keyword">null</span>, <span class="code-keyword">null</span>, 10, <span class="code-keyword">null</span>, <span class="code-keyword">false</span>, <span class="code-keyword">false</span>);

IOUtils.Dispose(searcher.IndexReader, taxoReader, writer, taxoWriter, dir, taxoDir);

}

JIRA link - [LUCENENET-598] created by harold.harkema

Lucene & Memory Mapped Files

This came in on the user mailing list on 15-July-2019 and was originally reported by Vincent Van Den Berghe ([email protected])

 

Hello everyone,

 

I've just had an interesting performance debugging session, and one of the things I've learned is probably applicable for Lucene.NET.

I'll give it here with no guarantees, hoping that it might be useful to someone.

 

Lucene uses memory mapped files for reading, most notably via MemoryMappedFileByteBuffer. Profiling indicated that there are 2 calls that have quite some overhead:

 

        public override ByteBuffer Get(byte[] dst, int offset, int length)

        public override byte Get()

 

These calls spend their time in 2 methods of MemoryMappedViewAccessor:

 

public int ReadArray(long position, T[] array, int offset, int count) where T : struct; public byte ReadByte(long position);

 

The implementation of both contains a lot of overhead, especially ReadArray: apart from the parameter validation, this method makes sure that the generic parameter T is properly aligned. This is irrelevant in our use case, since T is byte. But because the method implementation doesn't make any assumptions on T (other than the fact that is must be a value type, which is the generic constraint), every call goes through the same motions, every time.

Microsoft should have provided specializations for common value types, and certainly for byte arrays. Sadly, this is not the case.

The other one, ReadByte, acquires and releases the (unsafe) pointer before derefencing it to return one single byte.

 

A way to do this more efficiently (while avoiding unsafe code), is to acquire the pointer handle associated with the view accessor, and use that pointer to marshal information back to the caller.

To do this, MemoryMappedFileByteBuffer needs one extra member variable to hold the address:

 

       private long m_Ptr;

 

 

Then, the 2 MemoryMappedFileByteBuffer constructors need to be rewritten as follows (mainly to avoid code duplication):

 

              public MemoryMappedFileByteBuffer(MemoryMappedViewAccessor accessor, int capacity)

                           : this(accessor, capacity, 0)

             

Unknown macro: {

              }

 

              public MemoryMappedFileByteBuffer(MemoryMappedViewAccessor accessor, int capacity, int offset)

                     : base(capacity)

              {

                     this.accessor = accessor;

                     this.offset = offset;

                     System.Runtime.CompilerServices.RuntimeHelpers.PrepareConstrainedRegions();

                     try

                    

Unknown macro: {

                     }

                     finally

                    

Unknown macro: {

                           bool success = false;

                           accessor.SafeMemoryMappedViewHandle.DangerousAddRef(ref success);

                           m_Ptr = accessor.SafeMemoryMappedViewHandle.DangerousGetHandle().ToInt64() + accessor.PointerOffset;

                     }

              }

 

The only thing this does is getting the pointer handle. Yes, the method has the word "Dangerous" in it, but it's perfectly safe . Note that this needs .NET version 4.5.1 or later, because we want the starting position of the view from the beginning of the memory mapped file through the PointerOffset property which is unavailable in earlier .NET releases.

What the constructor does is to get a 64-bit quantity representing the start of the memory mapped view. The special construct with an "empty try block" conforms to the documentation regarding constrained execution regions (although I think it's more of a cargo-cult thing, since constrained execution doesn't solve a lot of problems in this case).

 

Finally, the Dispose method needs to be extended to release the pointer handle using DangerousRelease:

 

        public void Dispose()

        {

            if (accessor != null)

           

Unknown macro: {

              accessor.SafeMemoryMappedViewHandle.DangerousRelease();

              accessor.Dispose();

              accessor = null;

            }

        }

 

At this point, we can replace the ReadArray in ByteBuffer Get by this:

 

Marshal.Copy(new IntPtr(m_Ptr + Ix(NextGetIndex(length))), dst, offset, length);

 

And the ReadByte method becomes:

 

        public override byte Get()

       

Unknown macro: {

              return Marshal.ReadByte(new IntPtr(m_Ptr + Ix(NextGetIndex())));

        }

 

 

The Marshal class contains various read method to read various data types (ReadInt16, ReadInt32), and it would be possible to rewrite all other methods that currently assemble the types byte-per-byte. This is left as an exercise for the reader. In any case, these methods have a lot less overhead than the corresponding methods in the memory view accessor.

 

In my measurements, even when files reside on slow devices, the performance improvements are noticeable: I'm seeing improvements of 5%, especially for large segments. If you have slow I/O, the slow I/O still dominates, of course: no such thing as a free lunch and all that.

 

As I said, no guarantees. Have fun with it! If you find something that is unacceptable, let me know.

 

 

Vincent

 

JIRA link - [LUCENENET-629] created by nightowl888

Missing BufferedChecksum

Hi @NightOwl888 , just noticed the latest preview version marked the BufferedChecksum class internal with this commit 6e88977

As we were using it in our code, just to understand, is this functionality something we should re-implement on our side, or will it eventually be exposed again by Lucene.Net?

Fully document Codec Factories and include usage samples

We have diverged from Lucene by replacing the NamedSPILoader and SPIClassIterator with abstract factories to load codec types (possibly from a dependency injection container). While the API docs are mostly complete, they are missing usage samples of how to extend the factories to register custom codec, doc values format, and postings format types.

JIRA link - [LUCENENET-625] created by nightowl888

Setup tests on additional platforms that .NET Standard supports

We have recently setup an Azure DevOps YAML build configuration, which makes adding testing environments easier to our CI process. Adding tests to ensure support on macOS and Linux has been completed already, and (fortunately) the number of bugs that were found was fairly minimal (most of them were addressed in beta00006). However, there are other platforms .NET Standard supports that are commonly in use by Lucene.NET users which we should also setup our tests to run on.

Looking through the JIRA and GitHub issues, 4 platforms that have been mentioned are:

We should do a survey to find out what other platforms may be commonly in use and make the determination how important it is to test on them, and of course take into consideration whether the tests require a special setup that Azure Pipelines doesn't provide on any of the Microsoft-hosted agents.

JIRA link - [LUCENENET-633] created by nightowl888

TokenStream.IncrementToken() is called after Dispose() is called

When overriding Dispose(bool) in either TokenStream subclasses, Lucene.Net will call Dispose() before it is done using the TokenStream, and call IncrementToken() again.

 

The behavior can be observed in the Lucene.Net.Collation.TestICUCollationKeyFilter.TestCollationKeySort() when the ICUCollationKeyFilter implements Dispose().

JIRA link - [LUCENENET-611] created by nightowl888

Need tests to ensure 2-way index/codec compatibility with Lucene

Lucene came with several tests to ensure backward compatibility with indexes, and included many zipped index archives which are used to ensure Lucene.NET can read indexes that were produced by Lucene.

However, we have no tests to ensure that indexes produced by Lucene.NET 4.8.0 can be read by Lucene 4.8.0.

A way this could be done:

1. Create a command line utility that is part of the Lucene.NET build that produces a series of test index cases (at least 1 case per codec/doc values format/postings format combination). Unlike the existing compatibility tests that zip the indexes into embedded resources, we should be creating indexes based on the current build of Lucene.NET.
2. Create a Java/JUnit test project that depends on Lucene 4.8.0 and add a test per test case.
3. Add build/test tasks to azure-pipelines.yml to run the .NET utility to produce the indexes and then execute Java/JUnit tests on each target framework/OS/platform

These tests should be designed in such a way that when we upgrade to the next version of Lucene the tests are simple to upgrade as well.

JIRA link - [LUCENENET-613] created by nightowl888

Convert Java Iterator classes to implement IEnumerable<T>

The Iterator pattern in Java is equivalent to IEnumerable in .NET. Classes that were directly ported in Java using the Iterator pattern, cannot be used with Linq or foreach blocks in .NET.

Next() would be equivalent to .NET's MoveNext(), and in the below case, Term() would be as .NET's Current property. In cases as below, it will require TermEnum to become an abstract class with Term and DocFreq properties, which would be returned from another class or method that implemented IEnumerable.

 
	public abstract class TermEnum : IDisposable
	{
		public abstract bool Next();
		public abstract Term Term();
		public abstract int DocFreq();
		public abstract void  Close();
	public abstract void Dispose();
	}

would instead look something like:

 
	public class TermFreq
	{
		public abstract Term { get; }
		public abstract int { get; }
	}

public abstract class TermEnum : IEnumerable, IDisposable
{
// ...
}

Keep in mind that it is important that if the class being converted implements IDisposable, the class that is enumerating the terms (in this case TermEnum) should inherit from both IEnumerable and IDisposable. This won't be any change to the user, as the compiler automatically calls IDisposable when used in a foreach loop.

JIRA link - [LUCENENET-469] created by ccurrens

Performance decreasement migrating from 3.0.3 to 4.8.0

Hello,

I'm currently migrating a project, that uses Lucene.Net, from .Net Framework to .Net Core.

I had to migrate from 3.0.3 version of Lucene to 4.8.0 (beta0007) but I noticed performance decreasement while measuring response time while requesting the database. The response time using Lucene 4.8.0 is two to three times higher than 3.0.3 version.

I read some forum topics where other users noticed this performance decreasement too. Are you aware about that? Do you plan to do something to improve performances of Lucene 4.8.0?

Best regards

JIRA link - [LUCENENET-647] created by Mo Dje

Make RandomizedTesting into a separate library

We have ported the generators from RandomizedTesting that Lucene.NET needs to support its tests and added this code to the TestFramework (unlike Lucene), both as a static utility class and as extension methods. The generators we have now in the TestFramework are probably enough to cover most random testing use cases for end users.

That said, it would be useful for the .NET community at large if the random generators were made into a separate library so randomized test data can be easily generated without referencing the Lucene.NET test framework. The latest version of RandomizedTesting has more options for generating random data than what we have ported.

The library should be named after RandomizedTesting (possibly RandomizedTesting.Generators) so that someday if the test runner is also ported (see #264), there is a logical way to integrate it.

This task would make a great little project for anyone who wants to contribute to the Lucene.NET effort.

JIRA link - [LUCENENET-635] created by nightowl888

Reduce locking in FieldCacheImpl::Cache::Get

We noticed a lot of contention in FieldCacheImpl::Cache::Get (our queries use a lot of query time joins + sorting, so we hit the field cache a lot).

We use a SearcherManager with warm-up queries to populate the field cache so we would expect it to be initialized in most cases before we hit it for actual requests.

The implementation seems to lock even for the happy path (when everything's already initialized). This seems like a by-product of the choice of data structures (the underlying WeakDictionary, WeakHashMap etc are not threadsafe) and so the locking is required in case the dictionary gets resized.

Ideally we could be using thread-safe data structures and only lock when initializing the data.

JIRA link - [LUCENENET-610] created by sthmathew

Fix Random Seed Functionality in TestFramework

The random seed that is set to StringHelper.GOOD_FAST_HASH_SEED through the tests.seed system property in Java currently is the same as the seed that NUnit uses in NUnit.Framework.TestContext.CurrentContext.Random. If we continue using NUnit's Random property, more research is needed to figure out how to set that seed the way the NUnit designers intended so it can be documented for end users.

During testing, I noticed that NUnit does set the seed to a consistent pseudo-random value, so the same test can be run over and over with exactly the same random setup. However, the seed is re-generated when the code in the project under test is changed. I noted that NUnit writes the seed to a file named nunit_random_seed.tmp in the bin folder of the test project when run in Visual Studio.

A Typical Testing Scenario in Lucene

  1. Test fails
  2. Test failure generates an error message that includes the random seed as a hexadecimal string (that represents a long)
  3. Developer sets the tests.seed system property to the same hexadecimal string that caused the test to fail
  4. Debugging can take place because the pseudo-randomly generated conditions are exactly the same as they were when the test failed
  5. A fix is devised

Testing in .NET

All we have been able to do is get the same test to fail multiple times, but that breaks as soon as any code is changed. We could probably revert back if we save a copy of the nunit_random_seed.tmp, but this is very complicated to do in addition to debugging, and not very intuitive.

We are missing:

  1. The test framework reporting the seed that it was using when the test failed
  2. The ability to set the random seed

I suspect that although setting the seed cannot be done publicly, we can hack it by using .NET Reflection to set an internal field. There was some discussion about setting it through an attribute, but it doesn't sound like that has been implemented.

I haven't looked into what it will take to extend NUnit to attach the random seed to the test message that is reported. One option (but not a very good one) would be to change the Lucene.Net.TestFramework.Assert class to add the random seed to all of test messages. The main thing we need is to read the seed value. The only place it seems to be exposed is in the bin/nunit_random_seed.tmp file, but it would be preferable to read it through a property within the test session.

Testing Lucene.NET Itself

For end users, the above would solve all of the issues. However, for Lucene.NET we often need to generate the same conditions as Java to determine where the execution paths diverge. This is a problem because the NUnit Random class (probably) isn't the same implementation as the Java Random class. Also, in Java the random seed is a long, but in .NET, it is an int.

We have ported the Java Random class in J2N, named Randomizer. Ideally, we would use this implementation during testing. However, NUnit doesn't seem to have a way to inject a custom implementation of Random nor does it have a way to read the seed it uses to seed a new instance of Randomizer for the test framework or a way to override that setting manually during debugging.

Ideal Solution

  1. By default, the test framework uses a seed generated by the same hook that NUnit generates their seed. Alternatively, we could use System.DateTime.UtcNow.Ticks.
  2. The seed generated would be a long.
  3. The generated seed would be used to seed the J2N.Randomizer class, which the getter of the LuceneTestFramework.Random property would provide and cache.
  4. The test framework would ensure the seed is always output along with failure test messages, as a hexadecimal string.
  5. The "System Properties" feature allows tests:seed to be set to a hexadecimal string, which if set would override the auto-generated random seed used in step 3.

Note that the ideal solution doesn't necessarily involve NUnit in the equation. With Java's Random class ported and "System Properties" solved in a way that doesn't involve hacking the system's environment variables when debugging, we are much closer to fixing this to make it compatible with Java Lucene so we can test apples to apples.

The main thing we are missing is writing the random seed out into the test message with hexadecimal formatting (the same string that is used in Lucene). Other than that, setting up the system property to override the automatically generated seed should be fairly straightforward.

API: Review to ensure IDisposable is being used correctly and disposable pattern implemented correctly

There have been several issues found recently with the disposable pattern not being implemented correctly. I have also been made aware that there is at least one class (Lucene.Net.Store.Lock, if I recall correctly) that is designed to be re-opened after it is closed.

We need a review to ensure all classes that implement disposable are doing it correctly and have correctly implemented the dispose pattern both for sealed and unsealed types. We also need to have a close look at whether any classes should be reverted back to using Close() instead of Dispose() on account that the class instance was designed to be used again after the Dispose() call.

We should also explicitly check to ensure that Dispose() is set up to be called more than once safely. That is, after the first call, all additional calls should be a no-op.

JIRA link - [LUCENENET-626] created by nightowl888

Complete ICU4N to production release

ICU4J is Lucene's biggest dependency. Several attempts have been made to utilize various alternatives:

  1. ICU4NET
  2. icu-dotnet

But we ran into several issues:

  1. Lack of support for 32/64 bit
  2. Lack of support for .NET Standard Platforms
  3. Lack of features, and problems when trying to implement them
  4. Lack of thread safety

We finally ended up doing a direct port of about 40% of ICU4J's features in order to support Lucene.NET. The project is named ICU4N, and is progressing in an external GitHub repository. There are several up-for-grabs issues that we could use some help with to get Lucene.NET into production.

https://github.com/NightOwl888/ICU4N

 

JIRA link - [LUCENENET-628] created by nightowl888

Docs - store .net code snippets beside the converted markdown

The initial part of the docs conversion process is running a csproj executable which converts all of the javadocs html files based on their 4.8.0 release tag into consumable markdown files that are compatible with our api docs (which use DocFx). NOTE: These are not the same as the API docs output that are created from the /// comments in our code. These are extra java doc files that document things like namespaces and libraries

Each time this conversion process executes it overwrites all markdown files which means that we don't want to be manually changing the output markdown files because if we did that we'd have to go analyze all changes made by this process and revert parts of the changed files that we have knowingly changed which is not fun or sustainable (there's a lot of files!)

So instead of manually changing the output files we have written code for various tweaks of these files to ensure they are output correctly. Some of these tweaks are specific to certain namespaces and projects, etc... For embedded code snippets we should be able to do the same thing:

  • Store .net code snippets in separate files, we'll need to come up with a scheme for this to associate a code snippet for where it needs to be merged in
  • During the conversion process we can detect java code snippets and then lookup our own external files to see if there's a snippet for that section available and if so, replace the java snippet with the .net one

~15(+) hrs

Error using Lucene.Net.Facet 4.8.0-beta00005 with Xamarin.iOS

I'm using Lucene.Net.Facet 4.8.0-beta00005 in a big Xamarin project.

With Xamarin.Android and Xamarin.UWP it's all right.

But With Xamarin.iOS on device (Ipad), i'm receiving this error:

Attempting to JIT compile method 'Lucene.Net.Support.LurchTable2:InternalInsert> (int,Lucene.Net.Facet.Taxonomy.FacetLabel,int&,Lucene.Net.Support.LurchTable`2/Add2Info&)' while running in aot-only mode. See https://developer.xamarin.com/guides/ios/advanced_topics/limitations/ for more information.

at Lucene.Net.Support.LurchTable2[TKey,TValue].Insert[T] (TKey key, T& value) <0x2570f48 + 0x000e0> in <063e095c95d945a4ace32ab83d1227eb#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at (wrapper unknown) System.Object.gsharedvt_in() at Lucene.Net.Support.LurchTable2[TKey,TValue].AddOrUpdate (TKey key, TValue addValue, Lucene.Net.Support.KeyValueUpdate2[TKey,TValue] fnUpdate) <0x232824c + 0x0013b> in <063e095c95d945a4ace32ab83d1227eb#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Lucene.Net.Facet.Taxonomy.LRUHashMap2[TKey,TValue].Put (TKey key, TValue value) <0x2c487f8 + 0x0015b> in <79d3a7b905954d0993025c09c5d087ce#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Lucene.Net.Facet.Taxonomy.Directory.DirectoryTaxonomyReader.GetOrdinal (Lucene.Net.Facet.Taxonomy.FacetLabel cp) <0x2c51970 + 0x0019b> in <79d3a7b905954d0993025c09c5d087ce#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Lucene.Net.Facet.Taxonomy.Int32TaxonomyFacets.GetTopChildren (System.Int32 topN, System.String dim, System.String[] path) <0x2c481dc + 0x0008f> in <79d3a7b905954d0993025c09c5d087ce#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Login.MyMB.Lucene.Client.LuceneArticoliSearcher.GetListaArticoloXRicercaAvanzataConRicercaSemplice (System.Collections.Generic.List1[T] listParametri) <0x224add0 + 0x001bb> in <8f49891e0f0546e185aba7424d294ef7#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Login.MyMB.Lucene.Client.LuceneArticoliSearcher.GetListaArticoloConRicercaSemplice (System.Collections.Generic.List1[T] listParametri) <0x224afbc + 0x0009f> in <8f49891e0f0546e185aba7424d294ef7#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at MyMB.Forms.RicercaLucene.RicercaArticoloLucene.GetListaArticoliXRicercaSemplice (Login.MyMB.Interface.IAmbiente ambiente, Login.MyMB.Lucene.Client.LuceneArticoliSearcher las, System.Collections.Generic.List`1[T] ListParametri, System.Boolean isAbilitataRicercaBarcode) <0xe47fc0 + 0x000e7> in :0 ...............................

At the link https://docs.microsoft.com/it-it/xamarin/ios/internals/limitations , I found the problem cause (I suppose...):

Value types as Dictionary Keys Using a value type as a Dictionary key is problematic, as the default Dictionary constructor attempts to use EqualityComparer.Default. EqualityComparer.Default, in turn, attempts to use Reflection to instantiate a new type which implements the IEqualityComparer interface. This works for reference types (as the reflection+create a new type step is skipped), but for value types it crashes and burns rather quickly once you attempt to use it on the device. Workaround: Manually implement the IEqualityComparer interface in a new type and provide an instance of that type to the Dictionary (IEqualityComparer) constructor.

So, what can I do? Thank you in advance, Enrico Caltran +393357485560 [email protected]

JIRA link - [LUCENENET-602] created by enycaltran

Docs - Build/Deploy Automation

Building the API documentation files for a given release is currently a manual process and also involves manually updating the websites.

As part of our build/deploy pipeline we need to automate as much of this as possible.

  • document the docs building procedure
  • document the website updating procedure
  • create new 4.8.0-beta00008 api docs release
  • automate more of the docs building procedure including passing in version numbers to populate global variables and branch targets
    • Need to figure out how to use global variables in the md with docfx
  • update website with new links
  • ensure the website has correct download links

Fix issues with culture in Lucene.Net.Queries

In Java, number to string/string to number conversions were nearly all invariant. Not sure if that is the correct decision in .NET. If not, we need overloads to pass culture. Priority lower because technically this can be done out of band with the release.

JIRA link - [LUCENENET-631] created by nightowl888

Deprecated - Error using Lucene.Net.Facet 4.8.0-beta00005 with Xamarin.iOS

I'm using Lucene.Net.Facet 4.8.0-beta00005 in a big Xamarin project.

With Xamarin.Android and Xamarin.UWP it's all right.

But With Xamarin.iOS on device (Ipad), i'm receiving this error:

Attempting to JIT compile method 'Lucene.Net.Support.LurchTable2<Lucene.Net.Facet.Taxonomy.FacetLabel, Lucene.Net.Facet.Taxonomy.Directory.DirectoryTaxonomyReader/Int32Class>:InternalInsert<Lucene.Net.Support.LurchTable2/Add2Info<Lucene.Net.Facet.Taxonomy.FacetLabel, Lucene.Net.Facet.Taxonomy.Directory.DirectoryTaxonomyReader/Int32Class>> (int,Lucene.Net.Facet.Taxonomy.FacetLabel,int&,Lucene.Net.Support.LurchTable`2/Add2Info<Lucene.Net.Facet.Taxonomy.FacetLabel, Lucene.Net.Facet.Taxonomy.Directory.DirectoryTaxonomyReader/Int32Class>&)' while running in aot-only mode. See https://developer.xamarin.com/guides/ios/advanced_topics/limitations/ for more information.

at Lucene.Net.Support.LurchTable2[TKey,TValue].Insert[T] (TKey key, T& value) <0x2570f48 + 0x000e0> in <063e095c95d945a4ace32ab83d1227eb#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at (wrapper unknown) System.Object.gsharedvt_in() at Lucene.Net.Support.LurchTable2[TKey,TValue].AddOrUpdate (TKey key, TValue addValue, Lucene.Net.Support.KeyValueUpdate2[TKey,TValue] fnUpdate) <0x232824c + 0x0013b> in <063e095c95d945a4ace32ab83d1227eb#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Lucene.Net.Facet.Taxonomy.LRUHashMap2[TKey,TValue].Put (TKey key, TValue value) <0x2c487f8 + 0x0015b> in <79d3a7b905954d0993025c09c5d087ce#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Lucene.Net.Facet.Taxonomy.Directory.DirectoryTaxonomyReader.GetOrdinal (Lucene.Net.Facet.Taxonomy.FacetLabel cp) <0x2c51970 + 0x0019b> in <79d3a7b905954d0993025c09c5d087ce#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Lucene.Net.Facet.Taxonomy.Int32TaxonomyFacets.GetTopChildren (System.Int32 topN, System.String dim, System.String[] path) <0x2c481dc + 0x0008f> in <79d3a7b905954d0993025c09c5d087ce#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Login.MyMB.Lucene.Client.LuceneArticoliSearcher.GetListaArticoloXRicercaAvanzataConRicercaSemplice (System.Collections.Generic.List1[T] listParametri) <0x224add0 + 0x001bb> in <8f49891e0f0546e185aba7424d294ef7#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at Login.MyMB.Lucene.Client.LuceneArticoliSearcher.GetListaArticoloConRicercaSemplice (System.Collections.Generic.List1[T] listParametri) <0x224afbc + 0x0009f> in <8f49891e0f0546e185aba7424d294ef7#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 at MyMB.Forms.RicercaLucene.RicercaArticoloLucene.GetListaArticoliXRicercaSemplice (Login.MyMB.Interface.IAmbiente ambiente, Login.MyMB.Lucene.Client.LuceneArticoliSearcher las, System.Collections.Generic.List`1[T] ListParametri, System.Boolean isAbilitataRicercaBarcode) <0xe47fc0 + 0x000e7> in <f1bb3149abe145459612794f1a096634#2ae0fea9ea4eacaef83bf2e9713bb8ea>:0 ...............................

At the link https://docs.microsoft.com/it-it/xamarin/ios/internals/limitations , I found the problem cause (I suppose...):

Value types as Dictionary Keys Using a value type as a Dictionary<TKey, TValue> key is problematic, as the default Dictionary constructor attempts to use EqualityComparer.Default. EqualityComparer.Default, in turn, attempts to use Reflection to instantiate a new type which implements the IEqualityComparer interface. This works for reference types (as the reflection+create a new type step is skipped), but for value types it crashes and burns rather quickly once you attempt to use it on the device. Workaround: Manually implement the IEqualityComparer interface in a new type and provide an instance of that type to the Dictionary<TKey, TValue> (IEqualityComparer) constructor.

So, what can I do? Thank you in advance, Enrico Caltran +393357485560 [email protected]

[ORIGINAL JIRA TICKET] - https://issues.apache.org/jira/browse/LUCENENET-602

Replace Spanish suffixes by Portuguese suffixes in the Portuguese snowball stemmer

On PortugueseStemmer.cs[1], there are a few suffixes in the PortugueseStemmer which I believe were copied by mistake from SpanishStemmer[2]:

  • "log\u00EDas" should be "logias" (line 137)
  • "log\u00EDa" should be "logia" (line 113)
  • "uciones" should be "uções" (line 139)
  • "uci\u00F3n" should be "ução" (line 120)

For more details, see the original report on nltk project:
nltk/nltk#754

[1] https://github.com/apache/lucene.net/blob/master/src/contrib/Snowball/SF/Snowball/Ext/PortugueseStemmer.cs

[2] https://github.com/apache/lucene.net/blob/master/src/contrib/Snowball/SF/Snowball/Ext/SpanishStemmer.cs

JIRA link - [LUCENENET-547] created by he7d3r

Automate Generation of QueryParser to C#

The Lucene team is using a tool called javacc to generate the main business logic behind the query parsers. If we had a similar tool it could help:

  • Speed up the process of porting/upgrading QueryParser
  • Reduce the number of bugs in these modules caused by doing it manually
  • Most importantly, QueryParser could potentially be generated without using exceptions for control flow

The javacc tool uses a configuration file as input and creates java code as output. Here are some examples of those configuration files:

This has not been fully researched, but there are at least 2 potential ways we could approach this:

  1. Find a similar tool to javacc in .NET that supports similar options that were used in javacc, and create a converter tool to change the javacc configuration into a configuration that the .NET tool supports.
  2. Do a direct port of javacc to C#, and fix its logic to use a more efficient control flow mechanism than exceptions (perhaps goto would be the most direct replacement).

It seems according to this document that using a port of javacc should be our first choice because of the performance benchmarks of the resultant code. And certainly that would eliminate the risk of having a .NET tool not support an option that we need either now or for some future version of Lucene.

JIRA link - [LUCENENET-620] created by nightowl888

[Serializable] Classes

In Lucene.Net 3.0.3 several classes were marked with the [Serializable] attribute. The same has been done to several of the classes in the Lucene.Net (core), but most of the classes in the sub-projects are still not serializable.

Some of the legacy tests that were carried over required certain classes to be serializable (LUCENENET-170 and LUCENENET-338), which is how this issue was first discovered.

At the very least, all Queries, Filters, and Analyzers should be marked [Serializable], but it is unclear what criteria version 3.0.3 used to determine which other classes should be serializable. We need a clear strategy for this as well as the task to be done.

JIRA link - [LUCENENET-574] created by nightowl888

Investigate Slow Tests in Lucene.Net.Util Namespace

As part of #261, here is a list of known tests in the Lucene.Net.Util namespace that are taking significantly longer in .NET than they did in Java. The percentage is the amount of time that .NET takes compared to Java (for example, if the .NET test is running 10x longer the percentage will be 1000% of the Java test).

These tests need to be investigated to determine what is causing the slowness. Fixing these issues will likely ripple across the entire project reducing the time the tests take to run as well as impacting end users.

Slow Running Tests

  • Lucene.Net.Util.TestDocIdBitSet.TestAgainstBitSet (11583% - Now 206%)
  • Lucene.Net.Util.TestMathUtil.TestAcoshMethod (7400% - Now 200%)
  • Lucene.Net.Facet.Taxonomy.WriterCache.TestCharBlockArray.TestArray (5498% - Now 171%)
  • Lucene.Net.Util.TestAttributeSource.TestCaptureState (4850% - Now 1300%)
  • Lucene.Net.Util.TestBroadWord.TestPerfSelectAllBitsBroad (4267% - Now 733%)
  • Lucene.Net.Util.TestFixedBitSet.TestAgainstBitSet (2999% - Now 26%)
  • Lucene.Net.Util.Automaton.TestBasicOperations.TestEmptyLanguageConcatenate (2750% - Now 300%)
  • Lucene.Net.Util.TestRecyclingByteBlockAllocator.TestAllocate (2333% - Now 33%)
  • Lucene.Net.Util.TestPackedInts.TestAppendingLongBuffer (1597% - Now 27%)
  • Lucene.Net.Util.TestIdentityHashSet.TestCheck (1176% - Now 863%)
  • Lucene.Net.Util.TestPagedBytes.TestDataInputOutput (1015% Now 51%)
  • Lucene.Net.Util.TestByteBlockPool.TestReadAndWrite (1008% Now 158%)
  • Lucene.Net.Util.TestWAH8DocIdSet.TestAgainstBitSet (895% Now 67%)
  • Lucene.Net.Util.TestPagedBytes.TestDataInputOutput2 (849% Now 106%)
  • Lucene.Net.Util.TestBasicOperations.TestEmptySingletonConcatenate (775%)
  • Lucene.Net.Util.TestBytesRefHash.TestSize (762% Now 87%)
  • Lucene.Net.Util.TestOpenBitSet.TestAgainstBitSet (720% Now 155%)
  • Lucene.Net.Util.TestNumericUtils.TestRandomSplit (708%)
  • Lucene.Net.Util.Fst.TestBytesStore.TestRandom (681% Now 98%)
  • Lucene.Net.Util.Packed.TestEliasFanoSequence.TestAdvanceToAndBackToMultiples (555% - Now 131%)
  • Lucene.Net.Util.TestOfflineSorter.TestSmallRandom (555% - Now 30%)
  • Lucene.Net.Util.TestBytesRefHash.TestSort (414% - Now 73%)
  • Lucene.Net.Util.TestBytesRefHash.TestGet (343% - Now 62%)
  • Lucene.Net.Util.TestBytesRefHash.TestAddByPoolOffset (331% - Now 110%)
  • Lucene.Net.Util.TestUnicodeUtil.TestCodePointCount (294% - Now 32%)
  • Lucene.Net.Util.Fst.TestFsts.TestRandomWords (289%)
  • Lucene.Net.Util.TestBytesRefHash.TestFind (284% - Now 65%)
  • Lucene.Net.Util.TestBytesRefHash.TestCompact (273% - Now 80%)
  • Lucene.Net.Util.TestUnicodeUtil.TestUTF8toUTF32 (256% - Now 34%)
  • Lucene.Net.Util.TestOfflineSorter.TestIntermediateMerges (242% - Now 30%)
  • Lucene.Net.Util.TestOpenBitSet.TestSmall (188% - Now 40%)

These tests were not apples-apples comparisons, the Java tests were run on an older and slower machine that was outclassed in RAM, CPU, and disk read/write speeds by the machine that was used to test in .NET. The percentage is the approximate value that was measured by the test frameworks, but is probably significantly lower than the difference using a proper benchmark on the same machine.

BitSet vs BitArray

The TestAgainstBitSet test is slow in almost every test that utilizes it.

One potential reason for this is that we are using the .NET BitArray class and in Java Lucene was using a BitSet. While remapping to .NET APIs is the preferred way of porting, in this case we have classes with completely different behaviors. BitArray is a fixed size, and BitSet automatically expands as needed. To make BitArray expand, we need to reallocate it, and those extra allocations may be contributing to these tests being slow.

Note that J2N has a C# implementation of BitSet that would be acceptable to use instead of BitArray if this is indeed one of the contributors.

Docs - convert java snippets to c#

Requires: #283

Once we have an idea about how this works #283 we can start the snippet conversion. This could be broken down into smaller tasks and listed as up for grabs.

~15 hrs


This issue no longer depends on #283, since that issue was addressed by #396. So this issue is no longer blocked.

Here is a list of all of the documentation files that will require a review to update code examples. Note that links are a different issue that is being worked on in #300. So in this issue, we should focus on:

  1. Updating Java code examples to C#. Make sure the examples will compile the way they are written after the conversion.
  2. Updating references to Java technologies (types, JVM, class path, etc) to their .NET equivalents. Some of this will require some research. Note that some things like links over to Java-related wikipedia articles are probably best to leave alone and other things like types that exist in Java but don't in .NET and are still required to make the flow of the docs work might need to be clarified for the average .NET developer.
  3. Make sure the formatting of the Markdown document is similar to how it was in Lucene - that is, make sure the numbered and unordered lists come up correctly, tables are legible, etc.
  4. Look out for issues with API usability and open issues about them if any are found.

NOTE: The main documents to focus on here are named overview.md and package.md. The rest could probably just use a cursory review.

Fix asserts that are failing on .NET Standard 2.0

There are a few asserts that fail on .NET Standard 2.0. They have been conditionally compiled out of the code due to the fact that on .NET Core 2.1 they cause the test runner to fatally crash, however these are a sign that something could be seriously wrong with the codecs (possibly causing index corruption).

Here are the known failures:

JIRA link - [LUCENENET-624] created by nightowl888

Finish implementation of "System Properties" for .NET

In Java, System Properties are file based properties that can be overridden by for the specific environment (environment variables).

To implement similar functionality in .NET, we have added the Lucene.Net.Support.SystemProperties class, which currently just reads/writes environment variables.

With the release of the TestFramework, it is now more important to have Hierarchical file configuration based properties that can be utilized in test projects in order to control the features of the TestFramework. They should also still be able to be specified as environment variables that override the file-based settings.

We should use a JSON-based file format, ideally following an existing convention in .NET.

The closest match in .NET Core appears to be the Microsoft.Extensions.Configuration API. We need to investigate using this API as a replacement for Lucene.Net.Support.SystemProperties, and come up with a read-write hierarchical file-based configuration solution that can be overridden by environment variables on any platform.

JIRA link - [LUCENENET-638] created by nightowl888

FacetsCollector.Search n Parameter unclear

Hello, I am scratching my head, what is the purpose of the parameter n in the method FacetsCollector.Search(..., n, ...). From various posts out there in internet (including java questions), it looks like the argument limits the results for Facets which deliver n+ hits. But this is not the behavior I observe, in my case, whatever number I use, the results are the same. Here is my code I use, based on the sample I found here.

One suspicious place in the code is that the return value of the Search method is not used. But this is the same case as in the referenced github sample.

any help is appreciated.. thank you!

            BooleanQuery searchQuery = this.CreateLuceneSearchQuery(performCourseSearchRequest);

            var facetsCollector = new FacetsCollector();

            /*
             * This value should theoretically limit the result to return only facets with at least N results (hits). For some reason this
             * seems not being working and regardless what the number is, it returns always everything.
             */
            const int NumHitsLimit = 1;

            // Return value is not used, the searchQuery affects the referenced facetsCollector
            _ = FacetsCollector.Search(catalogSearchContainer.GetCurrentIndexSearcher(), searchQuery, NumHitsLimit, facetsCollector);

            var facets = new FastTaxonomyFacetCounts(catalogSearchContainer.GetCurrentTaxonomyReader(), FacetsConfigFactory.CreateFacetsConfig(), facetsCollector);

            foreach (var facetField in facetFields)
            {
                FacetResult facetResult = facets.GetTopChildren(MaxFacetChildrenHits, facetField);

                if (facetResult != null)
                {
                    facetResults.Add(new FacetItem(facetResult));
                }
            }

            return facetResults;

RegexpQuery doesn't maintain any reference to the Regexp

We're using your parser to parse lucene queries as part of a conversion mechanism we have, so we traverse the AST in order to build up our own representation.

The one missing part right now is that a RegexpQuery does not give access to the original Regexp, or even the Term.

Lets say you have:

var ast = (RegexpQuery)parser.Parse("foo:/ab?def/", "all");

There seems to be no way to access ab?def, as a term, string or regexp. By this point, it has already been turned into an automaton and the original term has been thrown away or is in a protected property.

The parser should not make assumptions like this IMO. It is up to the consumer what we do with the AST, rather than forcing us to use your "automaton" representation. Would it be possible to somehow expose the untouched, original value of the query?

Identify/Fix Bottlenecks

We need some help profiling Lucene.NET to identify and/or fix any potential bottlenecks. Please make us aware of anything you find by opening a new GitHub issue or by letting us know on the dev mailing list.

JIRA link - [LUCENENET-630] created by nightowl888

Create NuGet Icons

We need icons for the NuGet packages of Lucene.NET's dependent projects:

  • J2N
    • No specific preference to design, incorporating aspects of .NET and Java logos
  • ICU4N
    • Should incorporate the official Unicode icon (don't much care for the ICU project's logo)
  • Morfologik.Stemming
    • No specific preference to design
  • Spatial4n
    • Ideally showing a circle, square, and triangle in the design

JIRA link - [LUCENENET-639] created by nightowl888

Factor out ICheckSum and replace with System.Security.Cryptography.HashAlgorithm

IChecksum was brought over from Java in order to plug in the CRC32 class.

However, .NET already has its own abstract class, HashAlgorithm that is used to as a base class for all cryptographic algorithms. There are also several 3rd party implementations of CRC32 that implement HashAlgorithm that we could use (provided we drop support for .NET framework < 4.6.1). Crc32.NET claims to be one of the fastest implementations.

IChecksum is only utilized in a couple of places, but Lucene.NET's types should also be renamed accordingly to show they are using HashAlgorigthm instead of IChecksum:

  • BufferedChecksum > BufferedHashAlgorithm
  • BufferedChecksumIndexInput > BufferedHashAlgorithmIndexInput

The APIs between IChecksum and HashAlgorithm differ, but being that they serve the exact same purpose we should be able to make this work with a bit of refactoring.

JIRA link - [LUCENENET-637] created by nightowl888

Port ConditionalWeakTable from .NET Core 3.x to .NET Standard 2.0

As per LUCENENET-610, the WeakDictionary that has been created to support FieldCache and a few other Lucene.NET features does not perform well enough in highly concurrent environments.

The ConditionalWeakTable is a suitable replacement on .NET Standard 2.1, but on .NET Standard 2.0, it doesn't expose the enumerator or the AddOrUpdate method. All of Lucene.NET's usages require one or the other.

So, it would definitely be worth the effort to port the full implementation to .NET Standard 2.0.

This port should be added to J2N in the J2N.Runtime.CompilerServices namespace (to match .NET's namespace convention).

JIRA link - [LUCENENET-636] created by nightowl888

Review tests for Lucene.Net assembly

Occasionally, tests are still being found with important lines that have been commented because the functionality didn't exist at the time they were ported. Additionally, some test conditions have been changed from what they were in Java (although in some cases this is a necessary change due to a difference in platforms and in other cases it is a bug). We need some assurance that none of our green tests are false positives.

A line-by-line review would be best, but at the very least we should be checking that the implementations are complete and the test conditions are the same as in Java Lucene 4.8.0. Checking at the method level to ensure they all exist and have the right attributes has been completed already on Lucene.Net (core).

The tests in question need to be analyzed to ensure:

  1. That the test conditions weren't changed from what they were in Lucene (without some very good reason)
  2. That none of the asserts or parts of the test setup were simply commented out to make the test pass rather than fixing the underlying problem
  3. (for tests other than Lucene.Net (core)): The test methods/classes are not skipped in Lucene.NET unless they were skipped in Lucene or have some known problem

There may be other unforeseen issues with the tests (such as an incorrect translation of the line), as well, which is why a line-by-line comparison would be best.

Some Other Things to Look For

  1. Missing [Test] attributes. Note that in Lucene tests are run by naming convention for any parameterless method named starting with test, but in .NET the attribute is required.
  2. Formatting of arrays over multiple lines
  3. Stripped comments which would provide needed context
  4. Partially implemented classes/tests
  5. Method calls such as Convert.ToString(int) that require an explicit CultureInfo.Invariant parameter to match Java
  6. APIs that accept/return object that are using value types (boxing/unboxing)
  7. APIs that require casting in order to use them, which is something we should strive to solve
  8. APIs that accept/return non-generic collections, which we should strive to make generic

While it isn't the most important part of the task, the tests are also the easiest place to spot usability issues with the API, so if any are discovered (that aren't already marked LUCENENET TODO) we should open new issues for them as well.

The test projects that need review are:

  • Lucene.Net.Tests._A-D
  • Lucene.Net.Tests._E-I
  • Lucene.Net.Tests._I-J (See #926)
  • Lucene.Net.Tests._J-S (See #914)
  • Lucene.Net.Tests._T-Z (See #897)
  • Lucene.Net.Tests.Classification (See #420)
  • Lucene.Net.Tests.Codecs (See #425)
  • Lucene.Net.Tests.Expressions (See #435)
  • Lucene.Net.Tests.Facet (See #411)
  • Lucene.Net.Tests.Join (See #414)
  • Lucene.Net.Tests.Queries (See #432)

JIRA link - [LUCENENET-632] created by nightowl888

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.