Git Product home page Git Product logo

ahmia-index's Introduction

Ahmia index

The Ahmia search engine uses Elasticsearch indexes to save website text.

Installation

  • Install Elasticsearch 8
  • Install Python3 and pip
  • Install the Python packages required, preferably in a virtual environment, with:
pip install -r requirements.txt

Configuration

example.env contains some default values that should work out of the box. Copy this to .env to create your own instance of environment settings:

cp example.env .env

Review the .env file to ensure that it fits your needs. Make any modifications needed there.

Elasticsearch

Default configuration is enough to run index in dev mode. Here is suggestion for a more secure configuration

/etc/security/limits.conf

elasticsearch - nofile unlimited
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

/etc/default/elasticsearch

MAX_OPEN_FILES=unlimited
MAX_LOCKED_MEMORY=unlimited

/etc/elasticsearch/elasticsearch.yml

bootstrap.memory_lock: true

/etc/elasticsearch/jvm.options

-Xms15g
-Xmx15g

Start the service

sudo systemctl start elasticsearch

Give users permissions to use the HTTPS cert

Any user on the system can read the certificate file, which is generally acceptable for a public certificate authority (CA) certificate as it does not contain sensitive private keys.

sudo mkdir -p /usr/local/share/ca-certificates/
sudo cp /etc/elasticsearch/certs/http_ca.crt /usr/local/share/ca-certificates/
sudo chmod 644 /usr/local/share/ca-certificates/http_ca.crt

Init mappings

Please set mappings running for the first time

bash setup_index.sh

Alternatively, you could set up the indices manually, somehow like this:

curl -i --cacert /usr/local/share/ca-certificates/http_ca.crt -u elastic -XPUT \
'https://localhost:9200/tor-2024-01/' \
-H 'Content-Type: application/json' -d "@./mappings_tor.json"

Keep latest-tor aliase pointed to latest monthly indices

This needs to be the first time you deploy and then once per month

python point_to_indexes.py

Filter some abuse sites

bash call_filtering.sh

Crontab

# Execute child abuse text filtering over the index every hour
30 * * * * cd /home/juha/ahmia-index && bash wrap_filtering.sh > ./crontab_filter.log 2>&1
# First of Each Month:
10 04 01 * * cd /home/juha/ahmia-index && python point_to_indexes.py --add > ./add_alias.log 2>&1
# On 6th of Each Month
10 04 06 * * cd /home/juha/ahmia-index && python point_to_indexes.py --rm > ./remove_alias.log 2>&1

Keep Elasticsearch running: autorestart

sudo apt install restartd

# Add the following line to /etc/restartd.conf
elasticsearch "elasticsearch" "echo 'Elasticsearch is not running!' >>/tmp/restartd_restart.out && service elasticsearch restart >> /tmp/restartd_restart.out" "echo 'Elasticsearch is running!' >/tmp/restartd.out"

sudo service restartd restart

ahmia-index's People

Contributors

chamalis avatar iriahi avatar juhanurmi avatar mdhash avatar mikerah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ahmia-index's Issues

Change the way index aliases are updated

Currently the cron job that handles the updates of latest-tor and latest-i2p:

 "10 04 16 * * cd /usr/local/home/juha/ahmia-index && python point_to_indexes.py > ./change_alias.log 2>&1"

is executed every 16th of month.

That means that the data of the first 15 days of each month will be unavailable during that period. One easy way to fix this is to add new index at 1st of month and remove the old one (2 months older) at 16th.

That increases the maximum number of indexed data kept online to 2.5 months, but it's worth the effort.

Avoid potential dependency conflicts between ahmia-index and urllib3

Hi, as shown in the following full dependency graph of ahmia-index, ahmia-index requires urllib3 (the latest version), while the installed version of requests(2.22.0) requires urllib3>=1.21.1,<1.26.

According to Pip's “first found wins” installation strategy, urllib3 1.25.3 is the actually installed version.

Although the first found package version urllib3 1.25.3 just satisfies the later dependency constraint (urllib3>=1.21.1, <1.26), it will lead to a build failure once developers release a newer version of urllib3.

Dependency tree--------

ahmia-index(version range:)
| +-beautifulsoup4(version range:==4.6.0)
| +-certifi(version range:==2017.4.17)
| +-chardet(version range:==3.0.4)
| +-idna(version range:==2.5)
| +-python-decouple(version range:==3.1)
| +-requests(version range:>=2.20.0)
| | +-chardet(version range:>=3.0.2,<3.1.0)
| | +-idna(version range:>=2.5,<2.9)
| | +-urllib3(version range:>=1.21.1,<1.26)
| | +-certifi(version range:>=2017.4.17)
| +-urllib3(version range:>=1.24.2)

Thanks for your attention.
Best,
Neolith

Elasticserarch v7.15.2 seems unsupported

call_filtering.sh script at least only return error 400's for the actual filtering operations.

The way it's said in the README about use 6.5+ is bad, if 7+ is not supported or I misunderstood it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.