stractorg / stract Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 36.0 14.22 MB

web search done right

Home Page: https://stract.com

License: GNU Affero General Public License v3.0

Rust 90.91% TypeScript 2.91% Python 0.54% Just 0.07% JavaScript 0.13% CSS 0.02% Svelte 5.42% Shell 0.01%

rust search search-engine web

stract's Introduction

Stract is an open source web search engine hosted at stract.com targeted towards tinkerers and developers.

💡 Features

Keyword search that respects your search query.
Fully independent search index with our own crawler.
Advanced query syntax (site:, intitle: etc.).
DDG-style !bang syntax
Wikipedia and stackoverflow sidebar
De-rank websites with third-party trackers
Use optics to almost endlessly customize your search results.
- Limit your searches to blogs, indieweb, educational content etc.
- Customize how signals are combined during search for the final search result
Prioritize links (centrality) from the sites you trust.
Explore the web and find sites similar to the ones you like.
And much more!

👩‍💻 Setup

We recommend everyone to use the hosted version at stract.com, but you can also follow the steps outlined in CONTRIBUTING.md to setup the engine locally.

‍💼 License

Stract is offered under the terms defined under the LICENSE.md file.

📬 Contact

You can contact us at [email protected] or come hang out in our Discord or Matrix server.

🏆 Thank you!

We truly stand on the shoulders of giants and this project would not have been even remotely feasible without them. An especially huge thank you to

The authors and contributors of Tantivy for providing the inverted index library on which Stract is built.
The commoncrawl organization for crawling the web and making the dataset readily available. Even though we have our own crawler now, commoncrawl has been a huge help in the early stages of development.

💰 Funding

This project is currently funded through NGI0 Entrust, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

stract's People

Contributors

Stargazers

Watchers

stract's Issues

Support for phrase queries

Term proximity using sliding window of bi-grams and phrase query with slop

Create a `configure` subcommand that sets up the dev environment

The subcommand should be hidden under something lik a dev feature. The subcommand should create a small index, entity index, autsuggest file etc. We need to host a truncated wikipedia dump and a single warc file on our servers that gets downloaded by the configure command.

The just local command can probably be removed after this has been implemented.

I hope you won't have to resort to the CPU cost of serving ads

"Well, currently we don't. We are bootstraped and trying to keep costs low. In the future we will have, clearly labelled, contextual ads based on your current search query and a subscription option without ads. Just to re-iterate; we will only use your current search to match ads and will never track you across searches."
I hope you won't have to resort to the CPU cost of serving ads.

Re-write the query parser to Lalrpop and add suport for advanced search queries.

The user should be able to search for e.g. "test site:stackoverflow.com" and only websites containing "test" from stackoverflow should be returned.

To support this, we first need to re-write the query parser to Lalrpop as this will make it easier to parse the more advanced query syntax. It will also allow us to use the same parser library when we implement support for goggles.

Ideally we would support most of the operators mentioned here.

Old value of safe-search used when going back to "Search"

If one goes to the settings after performing a search, toggles safe-search, and then goes back to /search, then ?ss= is still set to the old value. However, any new search correctly uses the new value of safe-search.

This is in line what we expect when storing the query in sessions storage, but perhaps some parts of url.search should not be synced.

What are your thoughts on this?

Topic specific centrality during ranking

We should calculate a centrality score for a fixed number of topics.
I think this can be done by considering a set of websites that has been classified as a specific topic, and only consider outbound links from those sites. Same principle as how personalized centrality will be implemented.

We can use the following datasets for site classification:
https://www.kaggle.com/datasets/shawon10/url-classification-dataset-dmoz
https://www.kaggle.com/datasets/hetulmehta/website-classification

I don't know how to add your search engine to Firefox

I tried it, I think it's great and I like the unique results it's producing, but the chance I will continue to use it if it's not one of the search engines registered in my browser is exactly zero. I would love it if you made that possible or explained how we can do it, so that we can actually use this cool search engine instead of just seeing it once and forgetting it forever.

Support for `similar:example.com` queries

It would be nice to be able to restrict the search for sites that are similar to a given webpage(s). It can probably be implemented by doing a url search in the index, take the top N results and perform a MoreLikeThisQuery with Occur::Should between the N queries and an Occur::Must at the end.
This will require the parser to have acces to the index it builds the query for to perform the initial url search.

Bm25 requires at least one term

An empty search term crashes the page

Add more scripts to tokenizer with tests

Just like we support Han, Arabic etc. we should support more scripts. I think a full list can be found here.

features that would be great to have

there are some features that I miss in trystract.com

definitions(definitions for a query like 'define enthusiasm')
calculator widget
weather widget
instant answers

[Suggestion] Docker-compose

It would be nice to release stract as a docker image and to allow people to easily host stract with a docker compose

Create docker compose file
Host built images for dev and release channels

Stackoverflow sidebar and small answer cards

The answer cards should be present under most stackoverflow results

Add ability to filter by date

Could probably be implemented by performing a range query on the 'last updated' field

Suspicious char/byte offset mixing

These are some places where byte offsets might be incorrectly used in place of char offsets or vice versa.

stract/core/src/spell/spell_checker.rs

Lines 37 to 43 in 0ac07ab

 let word_len = word.chars().count(); 

 if word_len > 1 { 

 for i in 0..word_len { 

 let delete = word 

 .char_indices() 

 .filter(|(j, _)| *j != i)

In this one the chars are counted, but char_indices produces byte offsets.

stract/core/src/search_prettifier/entity.rs

Lines 53 to 55 in 0ac07ab

 if let Some(first_bullet) = value.find('•') { 

 value = value.chars().skip(first_bullet + 1).collect(); 

 }

find returns a byte offset, but it is used as a char offset.

stract/core/src/search_prettifier/mod.rs

Line 107 in 0ac07ab

pretty_url = pretty_url.chars().take(pretty_url.len() - 1).collect();

.len() returns a byte offset, but is used as a char offset.

stract/core/src/api/autosuggest.rs

Lines 28 to 34 in 0ac07ab

 let idx = suggestion 

 .chars() 

 .zip(query.chars()) 

 .position(|(suggestion_char, query_char)| suggestion_char != query_char) 

 .unwrap_or(query.len()); 

 let mut new_suggestion: String = suggestion.chars().take(idx).collect();

.position() returns a char offset, but in .unwrap_or(query.len()) a byte offset is used. The resulting .take(idx) might use the byte offset given by query.len().

Entity sidebar should ignore special characters

Searching for x-men does not currently return any sidebar as the query is tokenized as ['x-men'] whereas it should be ['x', 'men']. We should use the same tokenzier as the title-field

Display issue with non-Unicode encoded results

Noticed that the search preview of pages extract that are not Unicode (at least one that is in windows-1252) doesn’t display right:

There might be away to detect the page encoding and get the preview right?

Dark mode for mobile navigation drop down

Currently the mobile navigation drop down displays with a while background even in dark mode.

Parse DMOZ site descriptions

Some sites don't provide a description meta tag in the HTML. For these sites, we should take the description from DMOZ unless they have specified the NOODP tag.

use rusttls instead of openssl

Bug: Search term & optic disappear when using back button

Search for something (e.g. "Land Cruiser 80)
Optional: add optic
Click on first result
Return to search results
Search term (& optic) has (have) disappeared

edit: iOS 17 + Firefox

Sometimes terms are not properly highlighted in snippet

Searching for "linus torvalds" fails to highlight "torvalds" for some reason

Support for DDG-like !bang queries

Results

Why, when I search for "Stract", none of the results include trystract, or even have stract in the title?

Test if PGO can improve performance

It would be fun to see if PGO can improve the search performance in the final application.

Goggle sites cannot start with numbers

This leads to unambiguous grammar in the parser if we implement it naively.

Only show image if query matches image title or description

We could probably just store a hashset of the image text + caption and check if they match during search. We will only need to deserialize the hashset for the results we show to the user, so it hopefully wouldn't be too expensive

Spell checker

Good starting points are this blog post from Peter Norvig and this blog post by Wolf Garbe. We can probably build the spell checker datastructure from all the terms in the index.

Tapping suggestions doesn't work on iOS

Tapping a search bar suggestion on iOS does nothing except hide the suggestions.

Use page centrality during ranking

We already calculate the centrality for each specific webpage. This should be used during ranking and probably be prioritised higher than the host centrality.

Adding this should also give an extra boost to homepages since they tend to have a higher centrality score, and also mitigate the issue we have with multiple result from same site

Document setup steps

install rust
install clang
(optional) install cargo-watch
(optional) install just
(optional) install git-lfs and download data

Missing URL encoding for WolframAlpha bang

The bang !wa 1 + 5 + 102 redirects to WolframAlpha correctly, but the expression does not seem to be encoded correctly, since the input query on WA is 1 5 102 missing the +'s.

The given URL query is ?i=1+++5+++102, while the expected is ?i=1+%2B+5+%2B+102.

Getting started

I've compiled stract via cargo build --release. What do I do next?

How much disk space is required?

I can run the indexer / crawler / scraper via stract indexer, stract crawler and stract autosuggest-scrape.

do you need to run the crawler first?

I can run the search servers via stract search-server and stract entity-search-server.

I can run the API server via stract api.

Empty optic selector for repeated search

This issue tracks the problem identified in #97.

If a second search is performed without an optic, the optic selector gets empty whereas it should have "No Optic" still selected.

Increase indexing performance by only using detect-lang when guessing website region

~19% of time during indexing seems to be used to detect the language at various stages. Most of these can probably be omitted.

[QUESTION] Can Hearchco scrape you?

Hello, this project looks interesting. As a fellow open source developer I wanted to ask for permission to add your main instance (stract.com) to Hearchco meta-search engine. If it is fine by you for us to scrape it, if not I won't be adding it to respect your contributions to the open source community.

Take number of ads on website into account during ranking

We can get the data from Who Tracks Me and store it as a fastfield in the inverted index

Explore page accessibility

The input should have type=url or similar for correct keyboard on mobile
A loading indicator should be shown on changes

Index segments should not be merged on commit, but should be explicitly merged by the indexer after commit.

Also have a merge policy that merges M segments into N of approximately equal size

Don't remove punctuations during tokenization

Currently "example.com" get's tokenized to ["example", "com"]. This should be tokenized to ["example", ".", "com"] instead as this will give far better search results for navigational searches and searches like "c++". The tokenizer needs to be changed both in the indexing and search stage.

Text containing whitespaces such as "example query" should still be tokenized as ["example", "query"]

Cache webpage parsing functions

Add Safe search

Search term becomes [object Object] when choosing a suggestion from the list of search suggestions

Bug?: Guillemets are not recognised as quotation marks

I’m guessing you use "…" to enable the search for fixed terms, which is great.

There’s a funny side effect when you use non-English style quotation marks, where they are recognised as part of the search term(s). (E.g. „…“, «…», …)

Still not sure whether a bug or a feature. :-)

	let word_len = word.chars().count();

	if word_len > 1 {
	for i in 0..word_len {
	let delete = word
	.char_indices()
	.filter(\|(j, _)\| *j != i)

	if let Some(first_bullet) = value.find('•') {
	value = value.chars().skip(first_bullet + 1).collect();
	}

	let idx = suggestion
	.chars()
	.zip(query.chars())
	.position(\|(suggestion_char, query_char)\| suggestion_char != query_char)
	.unwrap_or(query.len());

	let mut new_suggestion: String = suggestion.chars().take(idx).collect();