shazwazza / examine Goto Github PK

View Code? Open in Web Editor NEW

355.0 25.0 124.0 29.28 MB

A .NET indexing and search engine powered by Lucene.Net

Home Page: https://shazwazza.github.io/Examine/

C# 70.48% CSS 0.60% PowerShell 0.48% HTML 28.45%

lucene search-engine aspnet indexing searching fluent-api csharp searching-engine hacktoberfest

examine's People

Contributors

Stargazers

Watchers

Forkers

stevetemple michaelulmann girish66 sud33p pmgrove rhayesbite fcingolani matijagrcic peteduncanson marcstoecker ngschumacher bardurhj carael robbaman zeffrin alain-es fspezi emmagarland modulexcite snowattitudes jclementson oliverpicton jt3432 bowserm premkumaranand fidelitylife lars-erik jbreuer enterprisewide kevinjump derickrhodes m-khosravi estei stantoxt brights-ideas alindgren audioproject2017 jenistonj zhhbo mrafea-sa protherj ruo2012 ja0b renick robsiera martijnkooij leopangan ismailmayat dino-herbert guojianbin marciogoularte antonfisher1 profcinders senthil-curated-ref perplexdaniel laniatech awesomedotnetcore bhaidar simonhartfield qsdev mortenbock michael-artlist seanrockster benjaminc matrixdekoder bielu jmayntzhusen davecs1 jonhzy163 andyfelton2 olibos fredzhangau phananhtruc98 tvarshney nobot lranger nzdev nikrimington ronaldbarendse dvdvliet lindeberg mastermann enkelmedia jasonc08 hummans bangush sayedgt johnbawesome sayeduzzamancuet matthewcare irac-ding xtmhm2000 joecklau zharonar nasa03 qaz734913414 bubdm jaandrews aakkssqq bergmania

examine's Issues

Small performance improvement when index node count is zero

if the indexed node count == 0 there's no need for committing/merging or raising events

Null check required in ToExamineXml when the value of a dictionary item is null/empty

Lucene.Net.Store.AlreadyClosedException: this IndexReader is closed - during shutdown

When running under stress tests that shuts down the appdomain very often while also trying to view search results, an exception may occur:

Lucene.Net.Store.AlreadyClosedException: this IndexReader is closed

This is because when the app domain is shutdown, the reader is closed, but at the same time another request might still be iterating it. So we need to only close the reader at the last possible moment just before the appdomain terminates.

OrderBy and OrderByDescending not working

I'm trying to search the index and have it return the results ordered, but it isn't working for me; the results are always returned in the same order. Here is my query:

BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["FooSearcher"]; ISearchCriteria searchCriteria = searcher.CreateSearchCriteria(); IBooleanOperation boolOperation = searchCriteria.NodeTypeAlias("fooNodeTypeAlias"); boolOperation = boolOperation.And().OrderBy("fooName"); ISearchResults results = searcher.Search(boolOperation.Compile());

Here is my index:
<IndexSet SetName="FooIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/{machinename}/Foo/"> <IndexAttributeFields> <add Name="id" /> <add Name="nodeName"/> <add Name="nodeTypeAlias" /> </IndexAttributeFields> <IndexUserFields> <add Name="fooName" EnableSorting="true" /> </IndexUserFields> </IndexSet>

And here is my searcher:
<add name="FooIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="true" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

Edit: I'm using Examine 0.1.70.0

Getting total count from search without loading all results into memory

I was wondering if it is possible to get the total number of results from the search query without loading all results into memory? My understanding is that BaseSearchProvider.Search() will load everything into memory, is this correct?

My main use for this is for paging.

SimpleDataIndexer can operate much more efficiently when adding nodes to the index

Previous changes were made to examine to pass in an IEnumerable collection to be indexed which would be resolved lazily, the SimpleDataIndexer wasn't updated to support this feature.

Wrong files showing while querying the index path using lucene.

Hi everyone,

I'm new to Lucene. I've an issue while querying the index file (path) while searching for a string in the files (say: doc files). I've looped through all the files using the following code.

string indexFileLocation = txtRootDirectory.Text.Trim();
            Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation,
                 false);
            Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir, false);
            Lucene.Net.Index.Term searchTerm = new Lucene.Net.Index.Term("content", 
                 txtSearch.Text.Trim());
            Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
            Lucene.Net.Search.Hits hits = searcher.Search(query);

for (int i = 0; i < hits.Length(); i++)
            {
                Lucene.Net.Documents.Document doc = hits.Doc(i);
                StringBuilder contentValue = new StringBuilder();
                contentValue.Append(doc.Get("content"));
                string id = doc.Get("id");
                lblSearchResults.Text += id + "<br />";
            }

But unfortunately, I've been getting the same search results, the same file name as follows.

I couldn't figure out if I'm doing wrong anywhere in my code. Please help me out.

Thanks in advance.

Enable adding/indexing of documents with duplicate fields (array fields)

What I need is ability to index documents with multi-value fields, eg. tags. There is no way I can add a document with many values for the same field

Q: How to use 'ExamineManager.Instance.ReIndexNode()' with custom data?

Hi!
I've got a custom (non umbraco) Indexer set up and working properly by looping through my custom data and creating a "SimpleDataSet" for each item. Now I am adding functionality to update the index when operations happen on the custom objects.

I have successfully set up a "Remove from Index" function to run on object delete by looking up the index nodeId for the object, and passing it to 'ExamineManager.Instance.DeleteFromIndex(...)'

Now I'd like to add operations to run on object create and update which would add just the current object to the index. I was looking at 'ExamineManager.Instance.ReIndexNode()' which expects an "XElement" as the representation of the index data, but I am unclear what format that needs to be in, or how to convert a SimpleDataSet into an XElement.

Is it possible to only index a single object? I'd rather not have to run 'ExamineManager.Instance.IndexAll()' every time something is added or changed... But perhaps that isn't possible?

NRT Readers

We can allow having Near Real Time readers in Examine (yes even in v1!), before I only thought this possible based on the ctor but have managed to come up with a nice solution.

Need to be able to add an iterator to the index queue instead of raw data

For example, when indexing thousands of items, it would be much better if we can queue up an iterator for the worker thread to process instead of queuing up already serialized items which can consume memory.

Push into Search methods into BaseSearchProvider or create a Interface and use it

Hi I am trying to integrate Elasticsearch into umbraco using Examine but I hit a roadblock as there are a few places in umbraco where the SearchProvider is cast to the specific LuceneSearcher to use a few extra Search methods implemented on the Specific searcher.

Would it be possible to create two abstract methods
BaseLuceneSearcher: ISearchResults Search(ISearchCriteria searchParams, int maxResults)
LuceneSearcher: ISearchResults Search(string searchText, bool useWildcards, string indexType)

in the BaseSearchProvider, or create a Inferface for the methods that can be implemented by other providers.

this would make it possible to avoid the casts in Umbraco that makes it difficult to implement SearchProviders based on other engines than Lucene.net?

Next up would be to refactor the usage of UmbracoContentIndexer in Umbraco as it its also tied to Lucene.net but that's another story. :-)

Azure providers are missing in this repository?

Hi,

I searched the entire solution and I cannot find the mentioned Azure providers anywhere. Could you please point me at the right location?

Thanks.

When using BaseLuceneSearcher and passing in a max result count, the TotalItemCount is incorrect

for example, specifying a max result of 3 will return 3 results, however the TotalItemCount should return the actual total amount, not just the amount limited.

StackOverflow Exception when running v0.1.67 with Umbraco v7.2.8

I previously had an Umbraco v7.2.8 installation running with Examine v0.1.66 for a while now. After upgrading Examine to v0.1.67, the application fails to start with the following exception occurring every 30 seconds or so.

In order to reproduce the error, create a v7.2.8 installation of Umbraco and upgrade it's Examine Nuget package to v0.1.67. When navigating to the site for the first time, the above exception should occur. When navigating to the site, the Examine indexes folder should be deleted prior to navigating to replicate. To fix, I have downgraded the Examine Nuget package to v0.1.66 which seems to allow the application to start correctly.

I understand this is probably not that much to go off. I've have checked Umbraco's log files and my system's Event Logs and nothing is logged relating to the exception. I can replicate this using a fresh install of Umbraco via Nuget. I'm guessing the issue will be related to the following changes.

v0.1.66...v0.1.67

The exception occurs within LuceneIndex.cs according to Visual Studio:

Support for multiple fields with the same name

Currently if you use DocumentWriting event and create multiple fields with the same name, it will index just fine since Lucene supports that. However, when you search you will get a dictionary error because it is trying to add the same field to the dictionary multiple times.

Since we cannot change the dictionary result since that is a breaking change, we'll support this by doing the following:

The normal dictionary of values will contain the first value, however the there's a new method on the SearchResult object: public IEnumerable<string> GetValues(string key) which will give you all of the values indexed for that field. This method will never return null, if a key doesn't exist at all an empty collection is returned.

Add backported AzureDirectory to work with lucene 2.9.4.1

AzureDirectory on Nuget has moved to a later version of Lucene, we need to keep it as supporting 2.9.4.1 but with the bug fixes, so we'll release a separate version of that to help with some Umbraco related bits.

Issue in Italian PC

I look that when the indexing process is running in "it-IT" culture, and when the data to indexing contains datatypes like Double, Lucene fires an Exception.

I found the row code where it is happen:
Examine.LuceneEngine.Providers.LuceneIndexer.TryConvert<T>(string var, out object parsedVal) merhod.

The tc.ConvertFrom(val) row try to convert val string to T type. If T is Double or Float, and if val contains decimal digits (like "1234.567"), the method can't convert to T because the DOT char is not the decimal separator in "it-IT" culture.

I think that I have the solution.
I look that this code solve the issue:
parsedVal = (T)tc.ConvertFrom(null, System.Globalization.CultureInfo.InvariantCulture, val);

Is this a good solution? Is it possible to apply this patch in Examine?

Thanks

Turn AutomaticallyOptimize = false by default

We shouldn't have AutomaticallyOptimize as true by default, it should be false. Optimization comes with a large overhead and we don't really want sites to suddenly start optimizing large indexes which could cause slowness, etc...

Also note that optimization for lucene is more or less a legacy thing:
http://blog.trifork.com/2011/11/21/simon-says-optimize-is-bad-for-you/

OutOfMemoryException Building Index - Need to make the enumeration more lazy

I've setup an index using a SimpleDataIndexer that is trying to index data from a database table with around 2 million rows. I'm using Umbraco's database object to query the data which appears to use a DataReader to read a row at a time. In my SimpleDataService I'm looping through the objects returned from Umbraco and yielding a new SimpleDataSet. Am I doing something wrong or is indexing this much data just not supported?

Search results: retrieve only specific fields instead of returning the whole doc.

Hi,

I have been reading the Examine's documentation and looking for in Umbraco forums, but couldn't find anything.
I was wondering if it is possible to retrieve only specific fields instead of returning the whole doc in the search results?
If not so, would it be difficult to implement this feature (I am considering to submit a PR)?

I have found the following doc/examples on the internet:
http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/FieldSelector.html
http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/FieldSelectorResult.html
http://kb.ucla.edu/articles/why-are-lucenes-stored-fields-so-slow-to-access
Would it be the correct way to implement this feature?

Thanks,
Alain

Ability to pass in Lucene boolean query to the Lucene searcher

Examine search issue on Umbraco

LuceneIndexer - don't allow rebuild/optimize during shutdown

During app shutdown, once cancellation is requested, do not allow rebuild or optimize

Indexer Error when trying to rebuild the index

I have a simple console app to index the data from a SQL table. I am receiving the following error

Value cannot be null
at System.Web.Configuration.ProvidersHelper.InstantiateProvider(ProviderSettings providerSettings, Type providerType)
at System.Web.Configuration.ProvidersHelper.InstantiateProviders(ProviderSettingsCollection configProviders, ProviderCollection providers, Type providerType)
at Examine.ExamineManager.EnsureProviders() in X:\Projects\Examine\Examine\Projects\Examine\ExamineManager.cs:line 96
at Examine.ExamineManager.get_IndexProviderCollection() in X:\Projects\Examine\Examine\Projects\Examine\ExamineManager.cs:line 72

on this line : ExamineManager.Instance.IndexProviderCollection["Simple2Indexer"].RebuildIndex();

Here is what My app.config looks like

<configSections>
    <!-- For more information on Entity Framework configuration, visit http://go.microsoft.com/fwlink/?LinkID=237468 -->
    <section name="Examine" type="Examine.Config.ExamineSettings, Examine" requirePermission="false" />
    <section name="ExamineLuceneIndexSets" type="Examine.LuceneEngine.Config.IndexSets, Examine" requirePermission="false" />

  </configSections>

  <Examine RebuildOnAppStart="false">
    <ExamineIndexProviders>
      <providers>
        <add name="Simple2Indexer" type="Examine.LuceneEngine.Providers.SimpleDataIndexer, Examine" dataService="LucenePOC.Data.ForumDataReaderService,LucenePOC.Data"  indexTypes="TestType" runAsync="false"/>
        <add name="SecondIndexer" type="Examine.LuceneEngine.Providers.SimpleDataIndexer, Examine" dataService="LucenePOC.Data.ForumDataReaderService,LucenePOC.Data" indexTypes="TestType2" runAsync="false"/>
      </providers>
    </ExamineIndexProviders>
    <ExamineSearchProviders defaultProvider="Simple2Searcher">
      <providers>
        <add name="Simple2Searcher" type="Examine.LuceneEngine.Providers.LuceneSearcher, Examine"  />
        <add name="MultiIndexSearcher" type="Examine.LuceneEngine.Providers.MultiIndexSearcher, Examine"
         indexSets="Simple2IndexSet,SecondIndexSet" />
      </providers>
    </ExamineSearchProviders>
  </Examine>
  <ExamineLuceneIndexSets>
    <IndexSet SetName="Simple2IndexSet" IndexPath="F:\Temp\Examine\SimpleIndexSet2">
      <IndexUserFields>
        <add Name="Id" />
        <add Name="Link" />
        <add Name="Module" />
        <add Name="Section" />
        <add Name="Message" />
        <add Name="CreatedBy" />
        <add Name="CreatedOn" />
        <add Name="ModifiedBy" />
        <add Name="ModifiedOn" />
      </IndexUserFields>
    </IndexSet>
    <IndexSet SetName="SecondIndexSet" IndexPath="F:\Temp\Examine\SimpleIndexSet2">
      <IndexUserFields>
        <add Name="Id" />
        <add Name="Link" />
        <add Name="Module" />
        <add Name="Section" />
        <add Name="Message" />
        <add Name="CreatedBy" />
        <add Name="CreatedOn" />
        <add Name="ModifiedBy" />
        <add Name="ModifiedOn" />
      </IndexUserFields>
    </IndexSet>
  </ExamineLuceneIndexSets>

My Package.config looks like this

<packages>
  <package id="EntityFramework" version="6.1.3" targetFramework="net452" />
  <package id="Examine" version="0.1.70.0" targetFramework="net452" />
  <package id="Lucene.Net" version="2.9.4.1" targetFramework="net452" />
  <package id="SharpZipLib" version="0.86.0" targetFramework="net452" />
</packages>

Can you please let me know, what I am missing? Thanks

GroupedOr with a boosted stop word causes an Object Null Reference

I ran into a problem today where if I boosted a stop word and passed it to a GroupedOr, it would explode with an Object Null Reference Exception in Examine.LuceneEngine.SearchCriteria.LuceneSearchCriteria line 308.

I was using Umbraco 7.1.8. I haven't had time to see if there are easier ways to reproduce.

This is a trimmed down, example version of the code I was using to cause the problem.
var searchPhrase = "the united states";
var searchTerms = searchPhrase.RemoveStopWords().Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
var siteSearcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var searchCriteria = siteSearcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.GroupedOr(new [] {"nodeName", "navigationTitle"}, searchTerms.Select(t => t.Boost(10)).ToArray());

In case anyone runs into this, the quick word around is to use this cool string extension method I found called RemoveStopWords(). You can just strip the stop words out before you search with them.

Allow settings parentNodeId per content type

Currently on an indexer, you can set a parentNodeId, however the umbraco content indexer can index both content and media so it's impossible to set a parentNodeId that is relevant for both. We should be allowed to set one per content type.

Stacktrace when entering search term with unmatched double quotes

Entering a search term that contains an unmatched number of doulbe-quote characters (") results in a stacktrace (below). If the number of double-quote characters is even (0, 2, 4) it works fine, but having an unmatched one fails. An example search string that causes this error is as follows:

http://[removed]/search?q=zzz"

The site is using an older version of Umbraco (I believe it's 7.1.6), so the line number in the stacktrace (62) doesn't match up the the latest version of Examine (looking at the code I think it should be line 79 in the current version).

I realise this isn't the most helpful bug report - I don't have access to the source of the website, so I'm afraid I can't give any useful information about exact versions of software in use (and won't be able to text a fixed version).

~rbsec

Last Error: System.Web.HttpCompileException
Controller: SearchResultPage
Action: ArticleList
Exception: mscorlib
Length cannot be less than zero. Parameter name: length

System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length, Boolean fAlwaysCopy)
Examine.StringExtensions.RemoveStopWords(String searchText) in x:\Projects\Examine\Examine\Projects\Examine\StringExtensions.cs:line 62
Umbraco.Extensions.Services.SiteSearchService.GetPagesAndEvents(String searchTerm, Nullable`1 enf, Nullable`1 wildcardMaxLenth)
Umbraco.Extensions.Services.ContentService.SearchResults_viewModelGet(String queryString, Nullable`1 enf, Nullable`1 wildcardMaxLength)
Umbraco.Extensions.Controllers.SearchResultPageController.ArticleList()
lambda_method(Closure , ControllerBase , Object[] )
System.Web.Mvc.ReflectedActionDescriptor.Execute(ControllerContext controllerContext, IDictionary`2 parameters)

Create directoryFactory option so that indexes can use any directory type

This feature allows us to not have to create sub-classed indexers/searchers to use custom directories.

Problem with GroupedNot

When using GroupedNot method with one field and multiple values only first value is added to query:

E.g.
searchCriteria.NodeTypeAlias("myDocumentTypeAlias");
searchCriteria.GroupedNot(new[] { "id" }.ToList(), new [] {"1","2","3"});

output query:
LuceneQuery: {+__NodeTypeAlias:myDocumentTypeAlias +(-id:1)}

Also, produced query is not working because of additional + sign right before opening bracket.
Correct query should looks like this:

+__NodeTypeAlias:myDocumentTypeAlias -id:(1 2 3)
or
+__NodeTypeAlias:myDocumentTypeAlias -id:1 -id:2 -id:3
or
+__NodeTypeAlias:myDocumentTypeAlias -(id:1 id:2 id:3)

RebuildOnAppStart removed - it will be up to the application to rebuild if necessary

LuceneIndexer - allow pending adds to be processed when shutdown is requested

Currently when shutdown is requested, all pending adds are cancelled. We want to allow a small window of opportunity for a pending add during a shutdown to get processed.

Exception in background thread killing w3wp

Seeing this exception in Umbraco v8 when displaying the Examine dashboard.

   Lucene.Net.Store.AlreadyClosedException
   à Lucene.Net.Index.IndexReader.EnsureOpen()
   à Lucene.Net.Index.IndexReader.IncRef()
   à Examine.LuceneEngine.Cru.SearcherManager.AcquireSearcher()
   à Examine.LuceneEngine.Cru.SearcherManager.get_IsSearcherCurrent()
   à Examine.LuceneEngine.Cru.NrtManager.MaybeReopen(Boolean)
   à Examine.LuceneEngine.Cru.NrtManagerReopener.Start()
   à System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   à System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   à System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   à System.Threading.ThreadHelper.ThreadStart()

No idea why the thing is throwing but, throwing in a background thread kills w3wp entirely. Creating the issue to keep a reference & details.

When getting multiple values from a document with the same field name GetValues returns empty if only one result

GetValues should return a result even if there is only one value, currently it will only return values if there is more than one value.
To work around this currently you'd have to Union both the GetValues("myField") and the result["myField"] value

Umbraco Examine internal index becomes corrupt regularly

Hi,

On an Umbraco 7.3.2 instance with Examine 0.1.68 installed, we have the out of the box internal index set up:

ExamineIndex.config:

 <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/"/>

ExamineSettings.config:

<ExamineIndexProviders> <providers> <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/> </providers> </ExamineIndexProviders>
<ExamineSearchProviders defaultProvider="ExternalSearcher"> <providers> <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcard="true" enableDefaultEventHandler="true"/> </providers> </ExamineSearchProviders>
We have enabled wildcard search and updates on content saving for this internal index using these settings:
enableLeadingWildcard="true"
enableDefaultEventHandler="true"

This works fine most of the time but then it stops working completely, with no error in the logs. We can see the index has become corrupt because searching fo some content in Umbraco brings no results back. After doing a full site republish the issue gets fixed and the index is back to normal, but this happens regularly.

Is there something we can do to stop the internal index becoming corrupt?

Thank you.

Query strings containing lucene-recognizable boolean operators causes QueryParseException.

I don't know if this is a known issue, but:

Query strings containing lucene-recognizable boolean operators causes QueryParseException.
These include: (!, ||, &&, NOT, OR, AND).

Example from our.umbraco.org: https://our.umbraco.org/search?q=OR
The same happends with other umbraco sites using examine.

Support of Lucene 3.0.3 and .net 4.5

Current version of Examine built on top of Lucene 2.9.4.1 and target .net framework is 4.0
It would be nice to have it built on top of the latest verion of Lucene (currently 3.0.3) and .net v4.5.1

Examine indexing unpublished nodes despite SupportUnpublishedContent = false

Examine indexes published child nodes of unpublished parents both while rebuilding the index and while listening to AfterUpdateDocumentCache/AfterClearDocumentCache.

When rebuilding an index from scratch, published child nodes of unpublished parents are included.
I gave up digging for answers after following:

@UmbracoExamine\BaseUmbracoIndexer.cs line 315

    protected virtual XDocument GetXDocument(string xPath, string type)

        if (this.SupportUnpublishedContent)
        {
            return DataService.ContentService.GetLatestContentByXPath(xPath);
        }
        else
        {
            return DataService.ContentService.GetPublishedContentByXPath(xPath);
        }

@UmbracoExamine\DataServices\UmbracoContentService.cs line 41

    public XDocument GetPublishedContentByXPath(string xpath)
    {
        return library.GetXmlNodeByXPath(xpath).ToXDocument();
    }

@Umbraco.Web\umbraco.presentation\library.cs line 1416

        //TODO: WTF, why is this here? This won't matter if there's an UmbracoContext or not, it will call the same underlying method!
        // only difference is that the UmbracoContext way will check if its in preview mode.
        private static XmlDocument GetThreadsafeXmlDocument()
        {
            return UmbracoContext.Current != null
                       ? UmbracoContext.Current.GetXml()
                       : content.Instance.XmlContent;
        }

        /// <summary>
        /// Queries the umbraco Xml cache with the specified Xpath query
        /// </summary>
        /// <param name="xpathQuery">The XPath query</param>
        /// <returns>Returns nodes matching the xpath query as a XpathNodeIterator</returns>
        public static XPathNodeIterator GetXmlNodeByXPath(string xpathQuery)
        {
            XPathNavigator xp = GetThreadsafeXmlDocument().CreateNavigator();

            return xp.Select(xpathQuery);
        }

@Umbraco.Web\umbraco.presentation\UmbracoContext.cs line 84

        public XmlDocument GetXml()
        {
            var umbracoContext = Umbraco.Web.UmbracoContext.Current;
            var cache = umbracoContext.ContentCache.InnerCache as Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedContentCache;
            if (cache == null)
                throw new InvalidOperationException("Unsupported IPublishedContentCache, only the Xml one is supported.");

            return cache.GetXml(umbracoContext, umbracoContext.InPreviewMode);
        }

It would seem umbraco's published document cache is the root cause

Also, when unpublishing a node the AfterClearDocumentCache does not fire for the children of an unpublished node, leaving children in the index.

Expose Lucene internal properties like QueryParser on the LuceneSearchCriteria

RebuildIndex doesn't need to clear out the index, we can just ctor a new Writer

Currently we clear out the documents to rebuild but that is unnecessary, here's the Lucene docs:

The create argument to the constructor determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also constructors with no create argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.

https://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/index/IndexWriter.html

When using German Analyzer for SearchProvider and IndexProvider, query is incorect.

Hi,

When using _GermanAnalyzer_ for search provider, the _LuceneBooleanOperation_ Compile() method is generating query that looks like this:
{ SearchIndexType: content, LuceneQuery: +(+(contents:searchedword*)) +__IndexType:con }

In the other hand, the _StandardAnalyzer_, which is being used for English language, as search provider, generate following query:
{ SearchIndexType: content, LuceneQuery: +(+(contents:searchedword*)) +__IndexType:content }

After further investingation, it seems that the field _IndexType is being tokenized and stemmed by the _GermanAnalyzer. So from word content we get word con.
With search condition __IndexType:con, Lucene will return 0 results, as the __IndexType has only phrase : content or media.

I'm not sure how to fix it, as the project is complex.
After brief investigation I've found that following line is missing a 4th parameter, that would prohibit this field from being analyzed:

this.search.FieldInternal( LuceneExamineIndexer.IndexTypeFieldName, new ExamineValue(Examineness.Explicit, this.search.SearchIndexType.ToString().ToLower()), BooleanClause.Occur.MUST);

I've a work around for this now, but it is hacky.
When I will have some spare time I will try to create pull request.

Create new Examine.Directory.Sync to support sync'd directories

Deletions do not commit to index until a subsequent re-index of another node.

I'm using Examine on an MVC web application and have had issues with deletions to the index not commiting to the Lucene index. My project has been setup with a DataService that implements an ISimpleDataService and quite happily indexes the data in my Entity Framework database.

However I have discovered that when issuing a delete operation through the ExamineManager class, the node passed to it is not deleted immediately from the index. The node will eventually delete when a new or existing node is indexed, however until that time the node remains in the search index and appears in search results, which causes a 404 within my application as the corresponding database has already been removed while the index entry lingers in the Lucene index.

I decided to have a peak into the source files for Examine to try and understand what is going on and in the file LuceneIndexer in the project path Examine/LuceneEngine/Providers/LuceneIndexer.cs I believe I have found a bug/erroneous function call which I believe is causing this issue.

On line 1521 of the aforementioned file is the following method.

[SecuritySafeCritical]
private void ProcessQueueItem(IndexOperation item, ICollection<IndexedNode> indexedNodes, IndexWriter writer)
{
    switch (item.Operation)
    {
        case IndexOperationType.Add:
            if (ValidateDocument(item.Item.DataToIndex))
            {
                //var added = ProcessIndexQueueItem(item, inMemoryWriter);
                var added = ProcessIndexQueueItem(item, writer);
                indexedNodes.Add(added);
            }
            else
            {
                 //do the delete but no commit - it may or may not exist in the index but since it is not 
                 // valid it should definitely not be there.
                 ProcessDeleteQueueItem(item, writer, false);

                 OnIgnoringNode(new IndexingNodeDataEventArgs(item.Item.DataToIndex, int.Parse(item.Item.Id), null, item.Item.IndexType));
            }
        break;
    case IndexOperationType.Delete:
        ProcessDeleteQueueItem(item, writer, false);
        break;
    default:
        throw new ArgumentOutOfRangeException();
    }
}

For the IndexOperationType.Delete case, I believe the final parameter of the ProcessDeleteQueueItem should be set to true, as it is a flag as to whether to commit the change or not.

As it stands I believe that the current action is not being committed until a subsequent re-index action is processed and commits any outstanding actions to the index because the delete operation is not committing its own change to the index.

I've not managed to test this out yet, but was wondering if you could confirm my suspicions or not.

Kind regards,
Tim

Raw lucene sub-query

I was wondering whether there is way to add raw lucene sub-queries.

If not so, would be very difficult to implement a method like RawQuery(string rawQuery)? That would help to create complex queries using the Fluent API.

Example:

var query = searchCriteria
    .Fields("nodeName","hello")
    .And().RawQuery(" +(metaTitle:hello metaDescription:goodbye)")
    .Compile();

SimpleDataIndexer should not put the result from DataService.GetAllData into memory

Currently the DataService.GetAllData puts the result into memory and then calls the AddNodesToIndex with the memory blob, but we can just iterate of the enumerable and iteratively call AddNodesToIndex. Then we are not doubling up on mem usage

When app pool is shutting down there's potential for a YSOD because the BlockingCollection is disposed

YSOD:

Server Error in '/' Application.

The collection has been disposed.
Object name: 'BlockingCollection'.

Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code. 

Exception Details: System.ObjectDisposedException: The collection has been disposed.
Object name: 'BlockingCollection'.

Source Error: 


Line 1615:            }
Line 1616:            else
Line 1617:            {
Line 1618:                OnIndexingError(
Line 1619:                    new IndexingErrorEventArgs(

Source File: x:\Projects\Examine\Examine\Projects\Examine\LuceneEngine\Providers\LuceneIndexer.cs    Line: 1617 

Stack Trace: 


[ObjectDisposedException: The collection has been disposed.
Object name: 'BlockingCollection'.]
   System.Collections.Concurrent.BlockingCollection`1.CheckDisposed() +2116463
   System.Collections.Concurrent.BlockingCollection`1.TryAddWithNoTimeValidation(T item, Int32 millisecondsTimeout, CancellationToken cancellationToken) +52
   Examine.LuceneEngine.Providers.LuceneIndexer.EnqueueIndexOperation(IndexOperation op) in x:\Projects\Examine\Examine\Projects\Examine\LuceneEngine\Providers\LuceneIndexer.cs:1617
   Examine.LuceneEngine.Providers.LuceneIndexer.IndexAll(String type) in x:\Projects\Examine\Examine\Projects\Examine\LuceneEngine\Providers\LuceneIndexer.cs:829
   UmbracoExamine.BaseUmbracoIndexer.IndexAll(String type) in X:\Projects\Umbraco\Umbraco_7.4\src\UmbracoExamine\BaseUmbracoIndexer.cs:279
   UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild() in X:\Projects\Umbraco\Umbraco_7.4\src\UmbracoExamine\BaseUmbracoIndexer.cs:353
   UmbracoExamine.BaseUmbracoIndexer.RebuildIndex() in X:\Projects\Umbraco\Umbraco_7.4\src\UmbracoExamine\BaseUmbracoIndexer.cs:265
   UmbracoExamine.UmbracoContentIndexer.RebuildIndex() in X:\Projects\Umbraco\Umbraco_7.4\src\UmbracoExamine\UmbracoContentIndexer.cs:483
   Overflow.Controllers.TestController.Index(RenderModel render) in x:\Projects\Umbraco\Umbraco_7.4\src\Umbraco.Web.UI\App_Code\UmbContactController.cs:117
   lambda_method(Closure , ControllerBase , Object[] ) +139
   System.Web.Mvc.ReflectedActionDescriptor.Execute(ControllerContext controllerContext, IDictionary`2 parameters) +229
   System.Web.Mvc.ControllerActionInvoker.InvokeActionMethod(ControllerContext controllerContext, ActionDescriptor actionDescriptor, IDictionary`2 parameters) +35
   System.Web.Mvc.Async.AsyncControllerActionInvoker.<BeginInvokeSynchronousActionMethod>b__39(IAsyncResult asyncResult, ActionInvocation innerInvokeState) +39
   System.Web.Mvc.Async.WrappedAsyncResult`2.CallEndDelegate(IAsyncResult asyncResult) +71
   System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethod(IAsyncResult asyncResult) +42
   System.Web.Mvc.Async.AsyncInvocationWithFilters.<InvokeActionMethodFilterAsynchronouslyRecursive>b__3d() +72
   System.Web.Mvc.Async.<>c__DisplayClass46.<InvokeActionMethodFilterAsynchronouslyRecursive>b__3f() +386
   System.Web.Mvc.Async.<>c__DisplayClass46.<InvokeActionMethodFilterAsynchronouslyRecursive>b__3f() +386
   System.Web.Mvc.Async.<>c__DisplayClass46.<InvokeActionMethodFilterAsynchronouslyRecursive>b__3f() +386
   System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeActionMethodWithFilters(IAsyncResult asyncResult) +42
   System.Web.Mvc.Async.<>c__DisplayClass2b.<BeginInvokeAction>b__1c() +38
   System.Web.Mvc.Async.<>c__DisplayClass21.<BeginInvokeAction>b__1e(IAsyncResult asyncResult) +186
   System.Web.Mvc.Async.AsyncControllerActionInvoker.EndInvokeAction(IAsyncResult asyncResult) +38
   System.Web.Mvc.Controller.<BeginExecuteCore>b__1d(IAsyncResult asyncResult, ExecuteCoreState innerState) +29
   System.Web.Mvc.Async.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult) +67
   System.Web.Mvc.Controller.EndExecuteCore(IAsyncResult asyncResult) +53
   System.Web.Mvc.Async.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult) +36
   System.Web.Mvc.Controller.EndExecute(IAsyncResult asyncResult) +38
   System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__5(IAsyncResult asyncResult, ProcessRequestState innerState) +44
   System.Web.Mvc.Async.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult) +67
   System.Web.Mvc.MvcHandler.EndProcessRequest(IAsyncResult asyncResult) +38
   System.Web.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +399
   System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +137

This can be replicated by doing this in a GET request (not that you should ever do this):

var doc = Services.ContentService.GetById(CurrentPage.Id);
            var xml = doc.ToXml();
            //add an icon attribute to get indexed
            xml.Add(new XAttribute("icon", doc.ContentType.Icon));

            ApplicationContext.RestartApplicationPool(HttpContext);

            ExamineManager.Instance.IndexProviderCollection["InternalIndexer"].RebuildIndex();
            ExamineManager.Instance.IndexProviderCollection["InternalIndexer"].ReIndexNode(xml, IndexTypes.Content);

Use a new IndexWriterTracker to track IndexWriter's across LuceneSearcher and LuceneIndexer to have NRT by default

We have NRT built into Examine v1 based on the ctor overloads but not based on the standard Examine config/provider model. We could achieve this by creating an IndexWriterTracker similar to the DirectoryTracker that we use so that there is only one IndexWriter per Directory which would make NRT by default.

This has some consequences though because many of the GetIndexWriter, etc... methods are virtual and are overridden in Umbraco, so would be hard to force the usage of NRT, but perhaps we can support both and libraries that want NRT will need to adjust their overrides.

Need to make the config abstract so it can be mocked or overridden or replaced

Currently it's very strongly tied to the config file, we need to abstract this out somehow, at least make it replaceable so that we can configure it outside of a web app

Race condition could occur when an index doesn't exist

the searcher and the indexer could simultaneously attempt to initialize an index in a lucene directory if it doesn't exist. I'm not sure if this has ever happened but the theory is in the code because there is legacy code in the searcher that makes sure that an index exists at it's location if it doesn't, but this is the responsibility of the indexer.

If there are multiple app restarts AlreadyClosedException may occur

After running some tests with high concurrency and adding multiple app restarts into the mix, we end up with error logs such as:

2015-03-30 13:25:30,307 [49] ERROR UmbracoExamine.DataServices.UmbracoLogService - [Thread 71] Provider=InternalIndexer, NodeId=-1
System.Exception: IndexSet: InternalIndexSet, Lucene.Net.Store.AlreadyClosedException: this IndexWriter is closed
at Lucene.Net.Index.IndexWriter.EnsureOpen(Boolean includePendingClose)
at Lucene.Net.Index.IndexWriter.EnsureOpen()
at Lucene.Net.Index.IndexWriter.Commit(IDictionary`2 commitUserData)
at Examine.LuceneEngine.Providers.LuceneIndexer.ForceProcessQueueItems(Boolean block) in x:\Projects\Examine\Examine\Projects\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 1456

This is due to the way that the index writer is closed in some cases during disposal. We will block the shutdown thread until the writing has finished and then commit but in some cases another thread might be trying to commit the last batch and then we close to early. To fix this we simply track the number of active entries in ForceProcessQueueItems and during disposal wait until this is zero.