Hi all, It seems like there has been a breaking change in HF that causes the sci-b

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Update HuggingFace Sci-base example about declutr HOT 2 CLOSED

johngiorgi commented on May 27, 2024

Update HuggingFace Sci-base example

from declutr.

Comments (2)

FL33TW00D commented on May 27, 2024 1

@JohnGiorgi thanks very much for the quick response and resolution.

from declutr.

JohnGiorgi commented on May 27, 2024

Hi @FL33TW00D,

Hmm, that code you copied from the model card does look out of date. Good catch.

I believe that this is what the model card should read, which is similar to what is in this repos README.

import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])