keremzaman / semantic-sh Goto Github PK

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).

License: MIT License

Python 98.05% Dockerfile 1.37% Shell 0.58%

simhash word-vectors fasttext bert locality-sensitive-hashing transformer text-similarity text-clustering text-search

semantic-sh's Introduction

semantic-sh

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models such as BERT.

Documentation

Requirements
Installation
Usage
API Server
Some Implementation Details

Requirements

fasttext
transformers
pytorch
numpy
flask

Installing via pip

$ pip install semantic-sh

Usage

from semantic_sh import SemanticSimHash

Use with BERT:

sh = SemanticSimHash(model_type='bert-base-multilingual-cased', dim=768)

Use with fasttext:

sh = SemanticSimHash(model_type='fasttext', dim=300, model_path='/path/to/cc.en.300.bin')

Use with GloVe:

sh = SemanticSimHash(model_type='glove', dim=300, model_path='/path/to/glove.6B.50d.txt')

Use with word2vec:

sh = SemanticSimHash(model_type='word2vec', dim=300, model_path='/path/to/en.w2v.txt')

Additional parameters

Customize threshold (default:0) , hash length (default: 256-bit) and add stop words list.

sh = SemanticSimHash(model_type='fasttext', key_size=128, dim=300, model_path='pat_to_fasttext_vectors.bin', thresh=0.8, stop_words=['the', 'i', 'you', 'he', 'she', 'it', 'we', 'they'])

Note: BERT-based models do not require stop words list.

Hash your text

sh.get_hash(['<your_text_0>', '<your_text_1>'])

Add document

Add your document to the proper group

sh.add_document(['<your_text_0>', '<your_text_1>'])

Find similar

Get all documents in the same group with the given text

sh.find_similar('<your_text>')

Get Hamming Distance between 2 texts

sh.get_distance('<first_text>', '<second_text>')

Go through all document groups

Get all similar document groups which have more than 1 document

for docs in sh.get_similar_groups():
   print(docs)

Save data

Save added documents, hash function, model and parameters

sh.save('model.dat')

Load from saved file

Load all parameters, documents, hash function and model from saved file

sh = SemanticSimHash.load('model.dat')

API Server

Easily deploy a simple text similarity engine on web.

Installation

$ git clone https://github.com/KeremZaman/semantic-sh.git

Standalone Usage

server.py [-h] [--host HOST] [--port PORT] [--model-type MODEL_TYPE]
                 [--model-path MODEL_PATH] [--key-size KEY_SIZE] [--dim DIM]
                 [--stop-words [STOP_WORDS [STOP_WORDS ...]]]
                 [--load-from LOAD_FROM]

optional arguments:
  -h, --help            show this help message and exit

app:
  --host HOST
  --port PORT

model:
  --model-type MODEL_TYPE
                        Type of model to run: fasttext or any pretrained model
                        name from huggingface/transformers
  --model-path MODEL_PATH
                        Path to vector files of fasttext models
  --key-size KEY_SIZE   Hash length in bits
  --dim DIM             Dimension of text representations according to chosen
                        model type
  --stop-words [STOP_WORDS [STOP_WORDS ...]]
                        List of stop words to exclude

loader:
  --load-from LOAD_FROM
                        Load previously saved state

Using with WSGI Container

from gevent.pywsgi import WSGIServer
from server import init_app

app = init_app(params) # same params as initialize SemantcSimHash object

http_server = WSGIServer(('', 5000), app)
http_server.serve_forever()

NOTE: Sample code uses gevent but you can use any WSGI container which can be used with Flask app object instead.

API Reference

POST /api/hash

Return hashes of given documents

Request Body

{
    "documents": [
        "Here is the first document",
        "and second document"
    ]
}

Response Body

{
    "hashes": [
        "0x7f636944d8c8",
        "0x5d134944428a4"
    ]
}

POST /api/add

Add given documents and return hash and custom IDs of the documents

Request Body

{
    "documents": [
        "Here is the first document",
        "and second document"
    ]
}

Response Body

{
    "documents": [
        {
            "id": 1,
            "hash": 0x5d134944428a4"
        },
        {
            "id": 2,
            "hash": 0x7f636944d8c8"
        }
    ]
}

POST /api/find-similar

Return similar documents to given text

Request Body

{
    "text": "Here is the text"
}

Response Body

{
    "similar_texts": [
        "Here is the text",
        "First text here",
        "Here is text"
    ]
}

POST /api/distance

Return Hamming distance between source and target texts

Request Body

{
    "src": "Here is the source text",
    "tgt": "Target text for measuring distance"
}

Response Body

{
    "distance": 21
}

GET /api/similarity-groups

Return buckets having more than one document ID

GET /api/text/<int:id>

Return the document according to its ID

With docker

Run the api server on port 4000

docker run -ti -p 4000:4000 -v `pwd`/data:/opt/data  semantic-sh:latest --port=4000 --model-type=bert-base-multilingual-cased --model-path=/opt/data

With docker-compose

Run the api server on port 4000

docker-compose up -d semantic-sh

Some Implementation Details

This is a simplified implementation of simhash by just creating random vectors and assigning 1 or 0 according to the result of dot product of each of these vectors with represantation of the text.

License

MIT

semantic-sh's People

Contributors

Stargazers

Watchers

Forkers

bellyfat olivierh59500 monkeyfx

semantic-sh's Issues

add/return custom attributes

Hi,

Hope you are all well !

It would be useful to add custom attributes like the doc_id when indexing or retrieving similar documents.

Thanks in advance :-) for any insights or inputs on that.

Cheers,
X

make a rest service with flask

Hi,

Hope you are all well !

Do you think it is possible to make a restful web service of semantic-sh ? or to dockerize it ?

Cheers,
X

save/load custom models

Hi,

Hope you are all well !

Is it possible to save/dump models and to load them again afterwards ? avoiding the re-index all documents because I have 230k of them.

Cheers,
X

semantic-sh crash after some document addition

Hi,

Hope you are all well !

I tried to push 300k abstracts into semantic-sh but the server crash without any debug information at some point.

Is there a way to enable some debug informations ?
Do you want me to prepare you a dump of my dataset ?

Thanks in advance for any replies or insights about that.

Cheers,
X

get hash endpoint error

Hi,

Hope you are all well !

I tried to get the hash of an abstract and it triggers the following error:

semantic-sh_1                 |  * Tip: There are .env or .flaskenv files present. Do "pip install python-dotenv" to use them.
semantic-sh_1                 |  * Serving Flask app "server" (lazy loading)
semantic-sh_1                 |  * Environment: production
semantic-sh_1                 |    WARNING: This is a development server. Do not use it in a production deployment.
semantic-sh_1                 |    Use a production WSGI server instead.
semantic-sh_1                 |  * Debug mode: off
semantic-sh_1                 | /opt/service/semantic_sh/semantic_sh.py:51: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
semantic-sh_1                 |   return np.vstack((np.random.normal(0, 1, dim) for i in range(0, key_size)))
semantic-sh_1                 |  * Running on http://0.0.0.0:5001/ (Press CTRL+C to quit)
semantic-sh_1                 | Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
semantic-sh_1                 | [2020-08-09 06:48:59,687] ERROR in app: Exception on /api/hash [GET]
semantic-sh_1                 | Traceback (most recent call last):
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
semantic-sh_1                 |     response = self.full_dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1952, in full_dispatch_request
semantic-sh_1                 |     rv = self.handle_user_exception(e)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1821, in handle_user_exception
semantic-sh_1                 |     reraise(exc_type, exc_value, tb)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
semantic-sh_1                 |     raise value
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request
semantic-sh_1                 |     rv = self.dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request
semantic-sh_1                 |     return self.view_functions[rule.endpoint](**req.view_args)
semantic-sh_1                 |   File "./server.py", line 20, in generate_hash
semantic-sh_1                 |     return hex(sh.get_hash(txt))
semantic-sh_1                 |   File "/opt/service/semantic_sh/semantic_sh.py", line 88, in get_hash
semantic-sh_1                 |     y = np.matmul(self._proj, enc)
semantic-sh_1                 | ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 768 is different from 300)
semantic-sh_1                 | 51.210.37.251 - - [09/Aug/2020 06:48:59] "GET /api/hash?text=Recent+work+has+demonstrated+substantial+gains+on+many+NLP+tasks+and+benchmarks+by+pre-training+on+a+large+corpus+of+text+followed+by+fine-tuning+on+a+specific+task.+While+typically+task-agnostic+in+architecture%2C+this+method+still+requires+task-specific+fine-tuning+datasets+of+thousands+or+tens+of+thousands+of+examples.+By+contrast%2C+humans+can+generally+perform+a+new+language+task+from+only+a+few+examples+or+from+simple+instructions+-+something+which+current+NLP+systems+still+largely+struggle+to+do.+Here+we+show+that+scaling+up+language+models+greatly+improves+task-agnostic%2C+few-shot+performance%2C+sometimes+even+reaching+competitiveness+with+prior+state-of-the-art+fine-tuning+approaches.+Specifically%2C+we+train+GPT-3%2C+an+autoregressive+language+model+with+175+billion+parameters%2C+10x+more+than+any+previous+non-sparse+language+model%2C+and+test+its+performance+in+the+few-shot+setting.+For+all+tasks%2C+GPT-3+is+applied+without+any+gradient+updates+or+fine-tuning%2C+with+tasks+and+few-shot+demonstrations+specified+purely+via+text+interaction+with+the+model.+GPT-3+achieves+strong+performance+on+many+NLP+datasets%2C+including+translation%2C+question-answering%2C+and+cloze+tasks%2C+as+well+as+several+tasks+that+require+on-the-fly+reasoning+or+domain+adaptation%2C+such+as+unscrambling+words%2C+using+a+novel+word+in+a+sentence%2C+or+performing+3-digit+arithmetic.+At+the+same+time%2C+we+also+identify+some+datasets+where+GPT-3%27s+few-shot+learning+still+struggles%2C+as+well+as+some+datasets+where+GPT-3+faces+methodological+issues+related+to+training+on+large+web+corpora.+Finally%2C+we+find+that+GPT-3+can+generate+samples+of+news+articles+which+human+evaluators+have+difficulty+distinguishing+from+articles+written+by+humans.+We+discuss+broader+societal+impacts+of+this+finding+and+of+GPT-3+in+general. HTTP/1.1" 500 -

Any idea how to sort it ? Is it related to the server configuration ?

Cheers,
X

keremzaman / semantic-sh Goto Github PK

semantic-sh's Introduction

semantic-sh

Documentation

Requirements

Installing via pip

Usage

Use with BERT:

Use with fasttext:

Use with GloVe:

Use with word2vec:

Additional parameters

Hash your text

Add document

Find similar

Get Hamming Distance between 2 texts

Go through all document groups

Save data

Load from saved file

API Server

Installation

Standalone Usage

Using with WSGI Container

API Reference

With docker

With docker-compose

Some Implementation Details

License

semantic-sh's People

Contributors

Stargazers

Watchers

Forkers

semantic-sh's Issues

Recommend Projects

Recommend Topics

Recommend Org