Git Product home page Git Product logo

stract's Introduction




Stract is an open source web search engine hosted at stract.com targeted towards tinkerers and developers.




💡 Features

  • Keyword search that respects your search query.
  • Fully independent search index with our own crawler.
  • Advanced query syntax (site:, intitle: etc.).
  • DDG-style !bang syntax
  • Wikipedia and stackoverflow sidebar
  • De-rank websites with third-party trackers
  • Use optics to almost endlessly customize your search results.
    • Limit your searches to blogs, indieweb, educational content etc.
    • Customize how signals are combined during search for the final search result
  • Prioritize links (centrality) from the sites you trust.
  • Explore the web and find sites similar to the ones you like.
  • And much more!

👩‍💻 Setup

We recommend everyone to use the hosted version at stract.com, but you can also follow the steps outlined in CONTRIBUTING.md to setup the engine locally.

‍💼 License

Stract is offered under the terms defined under the LICENSE.md file.

📬 Contact

You can contact us at [email protected] or come hang out in our Discord or Matrix server.

🏆 Thank you!

We truly stand on the shoulders of giants and this project would not have been even remotely feasible without them. An especially huge thank you to

  • The authors and contributors of Tantivy for providing the inverted index library on which Stract is built.
  • The commoncrawl organization for crawling the web and making the dataset readily available. Even though we have our own crawler now, commoncrawl has been a huge help in the early stages of development.

💰 Funding

This project is currently funded through NGI0 Entrust, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

NLnet foundation logo     NGI Zero Logo

stract's People

Contributors

a0m0rajab avatar andypiper avatar crispypin avatar dependabot[bot] avatar jmillerv avatar lamemakes avatar mikkeldenker avatar oeb25 avatar peulicke avatar sekoiatree avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stract's Issues

Create a `configure` subcommand that sets up the dev environment

The subcommand should be hidden under something lik a dev feature. The subcommand should create a small index, entity index, autsuggest file etc. We need to host a truncated wikipedia dump and a single warc file on our servers that gets downloaded by the configure command.

The just local command can probably be removed after this has been implemented.

I hope you won't have to resort to the CPU cost of serving ads

"Well, currently we don't. We are bootstraped and trying to keep costs low. In the future we will have, clearly labelled, contextual ads based on your current search query and a subscription option without ads. Just to re-iterate; we will only use your current search to match ads and will never track you across searches."
I hope you won't have to resort to the CPU cost of serving ads.

Re-write the query parser to Lalrpop and add suport for advanced search queries.

The user should be able to search for e.g. "test site:stackoverflow.com" and only websites containing "test" from stackoverflow should be returned.

To support this, we first need to re-write the query parser to Lalrpop as this will make it easier to parse the more advanced query syntax. It will also allow us to use the same parser library when we implement support for goggles.

Ideally we would support most of the operators mentioned here.

Old value of safe-search used when going back to "Search"

If one goes to the settings after performing a search, toggles safe-search, and then goes back to /search, then ?ss= is still set to the old value. However, any new search correctly uses the new value of safe-search.

This is in line what we expect when storing the query in sessions storage, but perhaps some parts of url.search should not be synced.

What are your thoughts on this?

Topic specific centrality during ranking

We should calculate a centrality score for a fixed number of topics.
I think this can be done by considering a set of websites that has been classified as a specific topic, and only consider outbound links from those sites. Same principle as how personalized centrality will be implemented.

We can use the following datasets for site classification:
https://www.kaggle.com/datasets/shawon10/url-classification-dataset-dmoz
https://www.kaggle.com/datasets/hetulmehta/website-classification

I don't know how to add your search engine to Firefox

I tried it, I think it's great and I like the unique results it's producing, but the chance I will continue to use it if it's not one of the search engines registered in my browser is exactly zero. I would love it if you made that possible or explained how we can do it, so that we can actually use this cool search engine instead of just seeing it once and forgetting it forever.

Support for `similar:example.com` queries

It would be nice to be able to restrict the search for sites that are similar to a given webpage(s). It can probably be implemented by doing a url search in the index, take the top N results and perform a MoreLikeThisQuery with Occur::Should between the N queries and an Occur::Must at the end.
This will require the parser to have acces to the index it builds the query for to perform the initial url search.

features that would be great to have

there are some features that I miss in trystract.com

  1. definitions(definitions for a query like 'define enthusiasm')
  2. calculator widget
  3. weather widget
  4. instant answers

[Suggestion] Docker-compose

It would be nice to release stract as a docker image and to allow people to easily host stract with a docker compose

  1. Create docker compose file
  2. Host built images for dev and release channels

Suspicious char/byte offset mixing

These are some places where byte offsets might be incorrectly used in place of char offsets or vice versa.


let word_len = word.chars().count();
if word_len > 1 {
for i in 0..word_len {
let delete = word
.char_indices()
.filter(|(j, _)| *j != i)

In this one the chars are counted, but char_indices produces byte offsets.


if let Some(first_bullet) = value.find('•') {
value = value.chars().skip(first_bullet + 1).collect();
}

find returns a byte offset, but it is used as a char offset.


pretty_url = pretty_url.chars().take(pretty_url.len() - 1).collect();

.len() returns a byte offset, but is used as a char offset.


let idx = suggestion
.chars()
.zip(query.chars())
.position(|(suggestion_char, query_char)| suggestion_char != query_char)
.unwrap_or(query.len());
let mut new_suggestion: String = suggestion.chars().take(idx).collect();

.position() returns a char offset, but in .unwrap_or(query.len()) a byte offset is used. The resulting .take(idx) might use the byte offset given by query.len().

Display issue with non-Unicode encoded results

Noticed that the search preview of pages extract that are not Unicode (at least one that is in windows-1252) doesn’t display right:

Screenshot 2024-02-09 at 10 43 50

There might be away to detect the page encoding and get the preview right?

Parse DMOZ site descriptions

Some sites don't provide a description meta tag in the HTML. For these sites, we should take the description from DMOZ unless they have specified the NOODP tag.

Results

Why, when I search for "Stract", none of the results include trystract, or even have stract in the title?

Use page centrality during ranking

We already calculate the centrality for each specific webpage. This should be used during ranking and probably be prioritised higher than the host centrality.

Adding this should also give an extra boost to homepages since they tend to have a higher centrality score, and also mitigate the issue we have with multiple result from same site

Document setup steps

  • install rust
  • install clang
  • (optional) install cargo-watch
  • (optional) install just
  • (optional) install git-lfs and download data

Getting started

I've compiled stract via cargo build --release. What do I do next?

How much disk space is required?

I can run the indexer / crawler / scraper via stract indexer, stract crawler and stract autosuggest-scrape.

  • do you need to run the crawler first?

I can run the search servers via stract search-server and stract entity-search-server.

I can run the API server via stract api.

Empty optic selector for repeated search

This issue tracks the problem identified in #97.

If a second search is performed without an optic, the optic selector gets empty whereas it should have "No Optic" still selected.

[QUESTION] Can Hearchco scrape you?

Hello, this project looks interesting. As a fellow open source developer I wanted to ask for permission to add your main instance (stract.com) to Hearchco meta-search engine. If it is fine by you for us to scrape it, if not I won't be adding it to respect your contributions to the open source community.

Explore page accessibility

  • The input should have type=url or similar for correct keyboard on mobile
  • A loading indicator should be shown on changes

Don't remove punctuations during tokenization

Currently "example.com" get's tokenized to ["example", "com"]. This should be tokenized to ["example", ".", "com"] instead as this will give far better search results for navigational searches and searches like "c++". The tokenizer needs to be changed both in the indexing and search stage.

Text containing whitespaces such as "example query" should still be tokenized as ["example", "query"]

Bug?: Guille­mets are not recognised as quotation marks

I’m guessing you use "…" to enable the search for fixed terms, which is great.

There’s a funny side effect when you use non-English style quotation marks, where they are recognised as part of the search term(s). (E.g. „…“, «…», …)

Still not sure whether a bug or a feature. :-)

Multiple results from same site

We should strive to have a diverse set of results in the search from multiple sites. The ranking for subsequent results from the same site should therefore be penalised.

I think this can be done with a custom collector implementation and pre-hash the site name for every website and store it in a fast field

Entity sidebar

Show entities from wikipedia in the sidebar during search.

I wonder how do you manage to store the data

"By default, we do store some usage statistics in order to improve the search results. Specifically the following information is stored for each search:"
I wonder how do you manage to store the data (using spinning rust for example)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.