johnsnowlabs / nlu Goto Github PK

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.

License: Apache License 2.0

Python 99.93% Shell 0.07%

nlu natural-language-understanding sentiment-classifier text-classification transformers language-detection named-entity-recognition seq2seq t5 lemmatizer

nlu's Introduction

John Snow Labs: State-of-the-art NLP in Python

The John Snow Labs library provides a simple & unified Python API for delivering enterprise-grade natural language processing solutions:

15,000+ free NLP models in 250+ languages in one line of code. Production-grade, Scalable, trainable, and 100% open-source.
Open-source libraries for Responsible AI (NLP Test), Explainable AI (NLP Display), and No-Code AI (NLP Lab).
1,000+ healthcare NLP models and 1,000+ legal & finance NLP models with a John Snow Labs license subscription.

Homepage: https://www.johnsnowlabs.com/

Docs & Demos: https://nlp.johnsnowlabs.com/

Features

🚀 Spark-NLP : State of the art NLP at scale!
🤖 NLU : 1 line of code to conquer NLP!
🕶 Visual NLP : Empower your NLP with a set of eyes!
💊 Healthcare NLP : Heal the world with NLP!
⚖ Legal NLP : Bring justice with NLP!
💲 Finance NLP : Understand Financial Markets with NLP!
🎨 NLP-Display Visualize and Explain NLP!
📊 NLP-Test : Deliver Reliable, Safe and Effective Models!
🔬 NLP-Lab : No-Code Tool to Annotate & Train new Models!

Installation

! pip install johnsnowlabs

from johnsnowlabs import nlp
nlp.load('emotion').predict('Wow that was easy!')

See the documentation for more details.

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# Example of Named Entity Recognition
nlp.load('ner').predict("Dr. John Snow is an British physician born in 1813")

Returns :

entities	entities_class	entities_confidence
John Snow	PERSON	0.9746
British	NORP	0.9928
1813	DATE	0.5841

# Example of Question Answering 
nlp.load('answer_question').predict("What is the capital of Paris")

Returns :

text	answer
What is the capital of France	Paris

# Example of Sentiment classification
nlp.load('sentiment').predict("Well this was easy!")

Returns :

text	sentiment_class	sentiment_confidence
Well this was easy!	pos	0.999901

nlp.load('ner').viz('Bill goes to New York')

Returns:
For a full overview see the 1-liners Reference and the Workshop.

Use Licensed Products

To use John Snow Labs' paid products like Healthcare NLP, [Visual NLP], [Legal NLP], or [Finance NLP], get a license key and then call nlp.install() to use it:

! pip install johnsnowlabs
# Install paid libraries via a browser login to connect to your account
from johnsnowlabs import nlp
nlp.install()
# Start a licensed session
nlp.start()
nlp.load('en.med_ner.oncology_wip').predict("Woman is on  chemotherapy, carboplatin 300 mg/m2.")

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# visualize entity resolution ICD-10-CM codes 
nlp.load('en.resolve.icd10cm.augmented')
    .viz('Patient with history of prior tobacco use, nausea, nose bleeding and chronic renal insufficiency.')

returns:

# Temporal Relationship Extraction&Visualization
nlp.load('relation.temporal_events')\
    .viz('The patient developed cancer after a mercury poisoning in 1999 ')

returns:

Helpful Resources

Take a look at the official Johnsnowlabs page page: https://nlp.johnsnowlabs.com for user documentation and examples

Resource	Description
General Concepts	General concepts in the Johnsnowlabs library
Overview of 1-liners	Most common used models and their results
Overview of 1-liners for healthcare	Most common used healthcare models and their results
Overview of all 1-liner Notebooks	100+ tutorials on how to use the 1 liners on text datasets for various problems and from various sources like Twitter, Chinese News, Crypto News Headlines, Airline Traffic communication, Product review classifier training,
Connect with us on Slack	Problems, questions or suggestions? We have a very active and helpful community of over 2000+ AI enthusiasts putting Johnsnowlabs products to good use
Discussion Forum	More indepth discussion with the community? Post a thread in our discussion Forum
Github Issues	Report a bug
Custom Installation	Custom installations, Air-Gap mode and other alternatives
The `nlp.load(<Model>)` function	Load any model or pipeline in one line of code
The `nlp.load(<Model>).predict(data)` function	Predict on `Strings`, `List of Strings`, `Numpy Arrays`, `Pandas`, `Modin` and `Spark Dataframes`
The `nlp.load(<train.Model>).fit(data)` function	Train a text classifier for `2-Class`, `N-Classes` `Multi-N-Classes`, `Named-Entitiy-Recognition` or `Parts of Speech Tagging`
The `nlp.load(<Model>).viz(data)` function	Visualize the results of `Word Embedding Similarity Matrix`, `Named Entity Recognizers`, `Dependency Trees & Parts of Speech`, `Entity Resolution`,`Entity Linking` or `Entity Status Assertion`
The `nlp.load(<Model>).viz_streamlit(data)` function	Display an interactive GUI which lets you explore and test every model and feature in Johnsowlabs 1-liner repertoire in 1 click.

License

This library is licensed under the Apache 2.0 license. John Snow Labs' paid products are subject to this End User License Agreement.
By calling nlp.install() to add them to your environment, you agree to its terms and conditions.

nlu's People

Contributors

Stargazers

Watchers

Forkers

sumanthratna alexott mohankhilariwal adbmd c-k-loan stjordanis srik37 sainiudit bobycv06fpm trendingtechnology mfcardenas vvkishere o7s8r6 beatricefallschool mkasmi keitazoumana marcofernandez007 sheerinz gregkvas jeffersonscampos sabirdvd atdavidpark sfrias lakarbatti ssouidi abhishekhm cczhgit hirajanwin mostofa-najmus-sakib achuthasubhash darkknight2223 ahmad225 askmetoo gullfosscapital gyanachand1 brollb varunkothamachu eulerian-tuple manishbhat5 sadam1195 lisaterumi shashank545 cwpdntm marfeljoergsen hercules261188 sardiirfan27 varun-penaganti elcies farik milyiyo murat-karadag prakashcinna slarea rajeshkppt asim5800 sreebhagya-s rajagopal17 tanglespace markussagen roybc aahmadai davidsmith1993 rkhosla102 antoniojesusgarciapalomo alexrogalskiy ai-natural-language-processing-lab jai2033shankar kgars mosalen swati1-ud ivesbai ahmedlone127 davincee marlops nikhileshwar-avs tianchiguaixia roverrwe devintdha luca-martial rohitn rowzzy localhost-server becca-mayers ragiko chaos-observer fredatgithub kieutrinh-t gg-big-org avenchen lyhiving hertera1 dcecchini captainahd johnefemer 0xweb3builder ithink3iam emmakat martyyz2112 aqhali arkajyotichakraborty

nlu's Issues

logging

Hi!
I would like to ask if it is possible to turn off logging or change the logging level from python script that uses nlu library?
Even simple 'import nlu' generates lines of logs, loading models there are tons of them...

Before importing nlu, I am trying to create pyspark context and set desired log level as it is pointed in logs from import nlu: Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel)

But it doesn't seem to help, actually the opposite, I can't load models and do the predictions then...

The other approach was setting logging levels for all possible loggers: nlu, py4j, py4j.java_gateway to CRITICAL in my case

logging.getLogger('nlu').setLevel(logging.CRITICAL)
logging.getLogger('py4j').setLevel(logging.CRITICAL)
logging.getLogger('py4j.java_gateway').setLevel(logging.CRITICAL)

But it also didn't help.
There are still messages from e.g. WARN SparkSession$Builder, WARN ApacheUtils, I tensorflow/core/platform/cpu_feature_guard.cc:142], etc...

How to set the batch size?

Hi,

The prediction process takes a long time to finish so I check the GPU memory usage and find out it only uses 3GB memory ( I have 16GB memory GPU).
I want to set a larger batch size to speed up the process but I can't find the argument.
How to set the batch size when using the predict function?

import nlu
pipe = nlu.load('xx.embed_sentence.labse', gpu=True)
pipe.pipe.predict(text, output_level='document')

Thanks

Unable to load en.ner.dl.bert

I have to following code:

documents = ["Open my files on oceans.", "Open my presentation on oceans.", "open my presentation on week 6 day 3"]
nlu_model = nlu.load('en.ner.dl.bert')
nlu_model.predict(documents, output_level='token')

The nlu.load('en.ner.dl.bert') part causes an error that I am not sure how to fix:

ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]
bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
<class 'AttributeError'>
'NoneType' object has no attribute '__set_missing_model_attributes__'
Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?
The NLU components could not be properly created. Please check previous print messages and Verbose mode for further info

My environment:
ubuntu 20.10
python 3.7.9
pyspark 2.4.7
spark-nlp 2.6.5

I appreciate your help

Unsupported class file major version 58

Hi,
First thanks a lot for this nice work !
It seems that you used spark and the java version on tensorflow to accomplish that right (with python as wrapper)?

I tried to install the nlu package on my python 3.8 which for now doesn't work (and it's okay #11 ).
So I created a virtual environement with python 3.7.

Launching ipython, import nlu that works.

However when I try to do as the doc : nlu.load('sentiment').predict('Why is NLU is awesome? Because of the sauce!') (In fact just nlu.load('sentiment') is the reason of the crash)

It return an

<class 'pyspark.sql.utils.IllegalArgumentException'>
'Unsupported class file major version 58'

I'm using Archlinux kernel zen 5.9.1 on a new virtual env with only wheel and nlu installed on python 3.7.9

I have java 8 / 11 and 14 installed

Model Loading

I am loading model like this

import sparknlp
import nlu

spark = sparknlp.start()
df = spark.read.csv("nlp_data.csv")
res = nlu.load("pos").predict(df[["text"]].rdd.flatMap(lambda x: x).collect())
print(res)
spark.stop()

Each time I get the following messages in my console:

com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3f17e4b8-0bdf-40c5-9879-d62f9c2dc974;1.0
        confs: [default]
        found com.johnsnowlabs.nlp#spark-nlp_2.12;5.2.3 in central
        found com.typesafe#config;1.4.2 in central
        found org.rocksdb#rocksdbjni;6.29.5 in central
        found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
        found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
        found com.amazonaws#aws-java-sdk-core;1.12.500 in central
        found commons-logging#commons-logging;1.1.3 in central
        found commons-codec#commons-codec;1.15 in central
        found org.apache.httpcomponents#httpclient;4.5.13 in central
        found org.apache.httpcomponents#httpcore;4.4.13 in central
        found software.amazon.ion#ion-java;1.0.2 in central
        found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.12.6 in central
        found joda-time#joda-time;2.8.1 in central
        found com.amazonaws#jmespath-java;1.12.500 in central
        found com.github.universal-automata#liblevenshtein;3.0.0 in central
        found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
        found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
        found com.google.code.gson#gson;2.3 in central
        found it.unimi.dsi#fastutil;7.0.12 in central
        found org.projectlombok#lombok;1.16.8 in central
        found com.google.cloud#google-cloud-storage;2.20.1 in central
        found com.google.guava#guava;31.1-jre in central
        found com.google.guava#failureaccess;1.0.1 in central
        found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
        found com.google.errorprone#error_prone_annotations;2.18.0 in central
        found com.google.j2objc#j2objc-annotations;1.3 in central
        found com.google.http-client#google-http-client;1.43.0 in central
        found io.opencensus#opencensus-contrib-http-util;0.31.1 in central
        found com.google.http-client#google-http-client-jackson2;1.43.0 in central
        found com.google.http-client#google-http-client-gson;1.43.0 in central
        found com.google.api-client#google-api-client;2.2.0 in central
        found com.google.oauth-client#google-oauth-client;1.34.1 in central
        found com.google.http-client#google-http-client-apache-v2;1.43.0 in central
        found com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 in central
        found com.google.code.gson#gson;2.10.1 in central
        found com.google.cloud#google-cloud-core;2.12.0 in central
        found io.grpc#grpc-context;1.53.0 in central
        found com.google.auto.value#auto-value-annotations;1.10.1 in central
        found com.google.auto.value#auto-value;1.10.1 in central
        found javax.annotation#javax.annotation-api;1.3.2 in central
        found com.google.cloud#google-cloud-core-http;2.12.0 in central
        found com.google.http-client#google-http-client-appengine;1.43.0 in central
        found com.google.api#gax-httpjson;0.108.2 in central
        found com.google.cloud#google-cloud-core-grpc;2.12.0 in central
        found io.grpc#grpc-alts;1.53.0 in central
        found io.grpc#grpc-grpclb;1.53.0 in central
        found org.conscrypt#conscrypt-openjdk-uber;2.5.2 in central
        found io.grpc#grpc-auth;1.53.0 in central
        found io.grpc#grpc-protobuf;1.53.0 in central
        found io.grpc#grpc-protobuf-lite;1.53.0 in central
        found io.grpc#grpc-core;1.53.0 in central
        found com.google.api#gax;2.23.2 in central
        found com.google.api#gax-grpc;2.23.2 in central
        found com.google.auth#google-auth-library-credentials;1.16.0 in central
        found com.google.auth#google-auth-library-oauth2-http;1.16.0 in central
        found com.google.api#api-common;2.6.2 in central
        found io.opencensus#opencensus-api;0.31.1 in central
        found com.google.api.grpc#proto-google-iam-v1;1.9.2 in central
        found com.google.protobuf#protobuf-java;3.21.12 in central
        found com.google.protobuf#protobuf-java-util;3.21.12 in central
        found com.google.api.grpc#proto-google-common-protos;2.14.2 in central
        found org.threeten#threetenbp;1.6.5 in central
        found com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha in central
        found com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha in central
        found com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha in central
        found com.fasterxml.jackson.core#jackson-core;2.14.2 in central
        found com.google.code.findbugs#jsr305;3.0.2 in central
        found io.grpc#grpc-api;1.53.0 in central
        found io.grpc#grpc-stub;1.53.0 in central
        found org.checkerframework#checker-qual;3.31.0 in central
        found io.perfmark#perfmark-api;0.26.0 in central
        found com.google.android#annotations;4.1.1.4 in central
        found org.codehaus.mojo#animal-sniffer-annotations;1.22 in central
        found io.opencensus#opencensus-proto;0.2.0 in central
        found io.grpc#grpc-services;1.53.0 in central
        found com.google.re2j#re2j;1.6 in central
        found io.grpc#grpc-netty-shaded;1.53.0 in central
        found io.grpc#grpc-googleapis;1.53.0 in central
        found io.grpc#grpc-xds;1.53.0 in central
        found com.navigamez#greex;1.0 in central
        found dk.brics.automaton#automaton;1.11-8 in central
        found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 in central
        found com.microsoft.onnxruntime#onnxruntime;1.16.3 in central
:: resolution report :: resolve 1966ms :: artifacts dl 54ms
        :: modules in use:
        com.amazonaws#aws-java-sdk-core;1.12.500 from central in [default]
        com.amazonaws#aws-java-sdk-kms;1.12.500 from central in [default]
        com.amazonaws#aws-java-sdk-s3;1.12.500 from central in [default]
        com.amazonaws#jmespath-java;1.12.500 from central in [default]
        com.fasterxml.jackson.core#jackson-core;2.14.2 from central in [default]
        com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.12.6 from central in [default]
        com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
        com.google.android#annotations;4.1.1.4 from central in [default]
        com.google.api#api-common;2.6.2 from central in [default]
        com.google.api#gax;2.23.2 from central in [default]
        com.google.api#gax-grpc;2.23.2 from central in [default]
        com.google.api#gax-httpjson;0.108.2 from central in [default]
        com.google.api-client#google-api-client;2.2.0 from central in [default]
        com.google.api.grpc#gapic-google-cloud-storage-v2;2.20.1-alpha from central in [default]
        com.google.api.grpc#grpc-google-cloud-storage-v2;2.20.1-alpha from central in [default]
        com.google.api.grpc#proto-google-cloud-storage-v2;2.20.1-alpha from central in [default]
        com.google.api.grpc#proto-google-common-protos;2.14.2 from central in [default]
        com.google.api.grpc#proto-google-iam-v1;1.9.2 from central in [default]
        com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0 from central in [default]
        com.google.auth#google-auth-library-credentials;1.16.0 from central in [default]
        com.google.auth#google-auth-library-oauth2-http;1.16.0 from central in [default]
        com.google.auto.value#auto-value;1.10.1 from central in [default]
        com.google.auto.value#auto-value-annotations;1.10.1 from central in [default]
        com.google.cloud#google-cloud-core;2.12.0 from central in [default]
        com.google.cloud#google-cloud-core-grpc;2.12.0 from central in [default]
        com.google.cloud#google-cloud-core-http;2.12.0 from central in [default]
        com.google.cloud#google-cloud-storage;2.20.1 from central in [default]
        com.google.code.findbugs#jsr305;3.0.2 from central in [default]
        com.google.code.gson#gson;2.10.1 from central in [default]
        com.google.errorprone#error_prone_annotations;2.18.0 from central in [default]
        com.google.guava#failureaccess;1.0.1 from central in [default]
        com.google.guava#guava;31.1-jre from central in [default]
        com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
        com.google.http-client#google-http-client;1.43.0 from central in [default]
        com.google.http-client#google-http-client-apache-v2;1.43.0 from central in [default]
        com.google.http-client#google-http-client-appengine;1.43.0 from central in [default]
        com.google.http-client#google-http-client-gson;1.43.0 from central in [default]
        com.google.http-client#google-http-client-jackson2;1.43.0 from central in [default]
        com.google.j2objc#j2objc-annotations;1.3 from central in [default]
        com.google.oauth-client#google-oauth-client;1.34.1 from central in [default]
        com.google.protobuf#protobuf-java;3.21.12 from central in [default]
        com.google.protobuf#protobuf-java-util;3.21.12 from central in [default]
        com.google.re2j#re2j;1.6 from central in [default]
        com.johnsnowlabs.nlp#spark-nlp_2.12;5.2.3 from central in [default]
        com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.4.4 from central in [default]
        com.microsoft.onnxruntime#onnxruntime;1.16.3 from central in [default]
        com.navigamez#greex;1.0 from central in [default]
        com.typesafe#config;1.4.2 from central in [default]
        commons-codec#commons-codec;1.15 from central in [default]
        commons-logging#commons-logging;1.1.3 from central in [default]
        dk.brics.automaton#automaton;1.11-8 from central in [default]
        io.grpc#grpc-alts;1.53.0 from central in [default]
        io.grpc#grpc-api;1.53.0 from central in [default]
        io.grpc#grpc-auth;1.53.0 from central in [default]
        io.grpc#grpc-context;1.53.0 from central in [default]
        io.grpc#grpc-core;1.53.0 from central in [default]
        io.grpc#grpc-googleapis;1.53.0 from central in [default]
        io.grpc#grpc-grpclb;1.53.0 from central in [default]
        io.grpc#grpc-netty-shaded;1.53.0 from central in [default]
        io.grpc#grpc-protobuf;1.53.0 from central in [default]
        io.grpc#grpc-protobuf-lite;1.53.0 from central in [default]
        io.grpc#grpc-services;1.53.0 from central in [default]
        io.grpc#grpc-stub;1.53.0 from central in [default]
        io.grpc#grpc-xds;1.53.0 from central in [default]
        io.opencensus#opencensus-api;0.31.1 from central in [default]
        io.opencensus#opencensus-contrib-http-util;0.31.1 from central in [default]
        io.opencensus#opencensus-proto;0.2.0 from central in [default]
        io.perfmark#perfmark-api;0.26.0 from central in [default]
        it.unimi.dsi#fastutil;7.0.12 from central in [default]
        javax.annotation#javax.annotation-api;1.3.2 from central in [default]
        joda-time#joda-time;2.8.1 from central in [default]
        org.apache.httpcomponents#httpclient;4.5.13 from central in [default]
        org.apache.httpcomponents#httpcore;4.4.13 from central in [default]
        org.checkerframework#checker-qual;3.31.0 from central in [default]
        org.codehaus.mojo#animal-sniffer-annotations;1.22 from central in [default]
        org.conscrypt#conscrypt-openjdk-uber;2.5.2 from central in [default]
        org.projectlombok#lombok;1.16.8 from central in [default]
        org.rocksdb#rocksdbjni;6.29.5 from central in [default]
        org.threeten#threetenbp;1.6.5 from central in [default]
        software.amazon.ion#ion-java;1.0.2 from central in [default]
        :: evicted modules:
        commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default]
        commons-codec#commons-codec;1.11 by [commons-codec#commons-codec;1.15] in [default]
        com.google.protobuf#protobuf-java-util;3.0.0-beta-3 by [com.google.protobuf#protobuf-java-util;3.21.12] in [default]
        com.google.protobuf#protobuf-java;3.0.0-beta-3 by [com.google.protobuf#protobuf-java;3.21.12] in [default]
        com.google.code.gson#gson;2.3 by [com.google.code.gson#gson;2.10.1] in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   85  |   0   |   0   |   5   ||   80  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-3f17e4b8-0bdf-40c5-9879-d62f9c2dc974
        confs: [default]
        0 artifacts copied, 80 already retrieved (0kB/27ms)


pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ / ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ — ]Download done! Loading the resource.
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ | ]sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ / ]Download done! Loading the resource.
[ — ]2024-02-06 14:43:45.340048: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[OK!]

Is it indicating that I am downloading the model(s) from the internet agin and again, or am I downloading it from the jar files?
I assume that the jar files are now on my local system since it took some time when I first installed spark-nlp, and now it just prints the jars information almost immediately when I run the code

DataFrame problem with pyspark and pandas interaction

When executing the following code, an error occurs

from johnsnowlabs import nlp

pipeline = nlp.load('sentiment')
pipeline.predict("I love this Documentation! It's so good!")

...
Approximate size to download 354.6 KB
Download done! Loading the resource.
[OK!]
Warning::Spark Session already created, some configs may not take.
Traceback (most recent call last):
  File "/home/user/Documents/test/nlu/test_maen.py", line 8, in <module>
    pipeline.predict("I love this Documentation! It's so good!")
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 468, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/utils/predict_helper.py", line 166, in __predict__
    pipe.fit()
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 202, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 101, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/session.py", line 603, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 299, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 327, in _convert_from_pandas
    for column, series in pdf.iteritems():
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pandas/core/generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'. Did you mean: 'isetitem'?

There is a solution for this error on stackoverflow
Maybe you should specify the right version in the dependencies of the johnsnowlabs module?
For example pandas >= 1.3.5, < 2

Platform - Fedora Linux 36

openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7, mixed mode, sharing)

Something went wrong during loading and fitting the pipe...

I saw this error occur in the closed issues and I believe it was fixed in a later version. I'm not sure if this is the same issue as well.

system:
Windows 10
Python 3.8.8
Pyspark 3.0.2
NLU 3.1.1
Spark 3.1.2

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.UnsatisfiedLinkError:

Exception:
Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

Breaking dependencies

Hello I'am trying to run your lab into wsl but an error occure with dependencies. The full trace bellow:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/pipeline.py", line 468, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/utils/predict_helper.py", line 166, in __predict__
    pipe.fit()
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/pipeline.py", line 202, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/nlu/pipe/pipeline.py", line 101, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pyspark/sql/session.py", line 673, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py", line 299, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py", line 331, in _convert_from_pandas
    for column, series in pdf.iteritems():
  File "/home/callmarl/workzone/nlp/env/lib/python3.9/site-packages/pandas/core/generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'

callmarl@LAPTOP-QS9M6N2F ~/workzone/nlp % python --version
Python 3.9.2
callmarl@LAPTOP-QS9M6N2F ~/workzone/nlp % pip freeze
asttokens==2.4.0
backcall==0.2.0
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
colorama==0.4.6
databricks-api==0.9.0
databricks-cli==0.17.7
dataclasses==0.6
decorator==5.1.1
exceptiongroup==1.1.3
executing==1.2.0
idna==3.4
ipython==8.15.0
jedi==0.19.0
johnsnowlabs==5.0.7
matplotlib-inline==0.1.6
nlu==5.0.0
numpy==1.25.2
oauthlib==3.2.2
pandas==2.1.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
pkg_resources==0.0.0
prompt-toolkit==3.0.39
ptyprocess==0.7.0
pure-eval==0.2.2
py4j==0.10.9
pyarrow==13.0.0
pydantic==1.10.11
Pygments==2.16.1
PyJWT==2.8.0
pyspark==3.1.2
python-dateutil==2.8.2
pytz==2023.3.post1
requests==2.31.0
six==1.16.0
spark-nlp==5.0.2
spark-nlp-display==4.1
stack-data==0.6.2
svgwrite==1.4
tabulate==0.9.0
traitlets==5.9.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==1.26.16
wcwidth==0.2.6

using NLU-biobert for entity linking or word embedding

I just wanted to enquire how can one use this model for entity linking? I believe I did see some linking and pos-tagging but is there some documentation that shows matching words to it's meaning rather than just matching with similarity? I want to load a spark database and use the model to perform word embedding by meaning on the whole dataset and store the output in another data frame, also being able to measure its performance by various metrics.

Issue with nlu.load('sentiment')

I'm trying to follow the example at nlu/examples/colab/component_examples/sequence2sequence/translation_demo.ipynb but I keep on getting this error when nlu.load('sentiment').

My code:

import nlu 
nlu.load('sentiment').predict('I love NLU! <3')

My error:

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]
<class 'pyspark.sql.utils.IllegalArgumentException'>
'Unsupported class file major version 55'
Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?
<nlu.NluError at 0x7f046d214be0>

Unable to pip install nlu (macOS BigSur, Python3.9)

Hello,
I was able to pip install nlu befiore upgrading macOS.
After the upgrade, I wanted to get a clean environment, and when I tried to install nlu again I got this error:

pablos-MBP:spark pablo$ pip install nlu
Defaulting to user installation because normal site-packages is not writeable
Collecting nlu
  Using cached nlu-1.0.2-py3-none-any.whl (150 kB)
Collecting pyarrow>=0.16.0
  Using cached pyarrow-1.0.1.tar.gz (1.3 MB)
  Installing build dependencies ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/local/opt/[email protected]/bin/python3.9 /Users/pablo/Library/Python/3.9/lib/python/site-packages/pip install --ignore-installed --no-user --prefix /private/var/folders/8j/lbsf0k851g391m73x6y10rsr0000gn/T/pip-build-env-xh65myma/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'cython >= 0.29' 'numpy==1.14.5; python_version<'"'"'3.7'"'"'' 'numpy==1.16.0; python_version>='"'"'3.7'"'"'' setuptools setuptools_scm wheel
       cwd: None
  Complete output (4217 lines):

If you want I can share the 4217 lines of the complete error, prbably is the same error as the other ticket about compatibility with Python 3.8, in this case 3.9, so is really any 3.7+?

Something went wrong during completing the DAG for the Spark NLP Pipeline.

What can I do to solve this problem?

combining 'sentiment' and 'emotion' models causes crash

I'm working in a Google Colab notebook and I set up via

!wget http://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

import nlu

a quick version check nlu.version() confirms 3.4.2

Several of the official tutorial notebooks (for ex.: XLNet)) create a multi-model pipeline that includes both 'sentiment' and 'emotion'.

Direct copy of content from the notebook:

import pandas as pd

# Download the dataset 
!wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp

# Load dataset to Pandas
df = pd.read_csv('/tmp/train-balanced-sarcasm.csv')

pipe = nlu.load('sentiment pos xlnet emotion') 

df['text'] = df['comment']

max_rows = 200

predictions = pipe.predict(df.iloc[0:100][['comment','label']], output_level='token')

predictions

However, running a prediction on this pipe results in the following error:


sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
xlnet_base_cased download started this may take some time.
Approximate size to download 417.5 MB
[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-1-9b2e4a06bf65> in <module>()
     34 
     35 # NLU to gives us one row per embedded word by specifying the output level
---> 36 predictions = pipe.predict( df.iloc[0:5][['text','label']], output_level='token' )
     37 
     38 display(predictions)

9 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SentimentDLModel_6c1a68f3f2c7.

Current inputCols: sentence_embeddings@glove_100d. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=sentence,is_nlp_annotator=true,type=document)
(column_name=sentence_embeddings@tfhub_use,is_nlp_annotator=true,type=sentence_embeddings).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: sentence_embeddings

Having experimented with various combinations of models, it turns out that the problem is caused whenever 'sentiment' and 'emotion' models are specified in the same pipeline (regardless of pipeline order or what other models are listed).

Running pipe = nlu.load('emotion ANY OTHER MODELS') or pipe = nlu.load('sentiment ANY OTHER MODELS') will be successful, so it really appears to be only a result of combining 'sentiment' and 'emotion'

Is this a known bug? Does anyone have any suggestions for fixing?

My temporary solution has been to run emoPipe = nlu.load('emotion').predict() in isolation, then inner join the resulting dataframe to the the resulting df of pipe = nlu.load('sentiment pos xlnet').predict().

However, I would like to understand better what is failing and to know if there is a way to streamline the inclusion of all models.

Thanks

error while using biobert PubMed PMC

Hi, I am totally interested in this NLU biobert library. its totally easy to implement yet understandable. However, I faced difficulties while to use this NLU biobert for my project. So I wanna run this code:

`import nlu

embeddings_df2 = nlu.load('en.embed.biobert.pubmed_pmc_base_cased', gpu=True).predict(df['text'], output_level='token')
embeddings_df2`

I am using google colab with GPU. After approximately 40 mins, its suddenly stopped and resulted the error

biobert_pubmed_pmc_base_cased download started this may take some time.
Approximate size to download 386.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]

Exception happened during processing of request from ('127.0.0.1', 40522)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1033, in send_command
response = connection.send_command(command)
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1212, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/usr/lib/python3.7/socketserver.py", line 316, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.7/socketserver.py", line 347, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.7/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.7/socketserver.py", line 720, in init
self.handle()
File "/usr/local/lib/python3.7/dist-packages/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/lib/python3.7/dist-packages/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/lib/python3.7/dist-packages/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/lib/python3.7/dist-packages/pyspark/serializers.py", line 595, in read_int
raise EOFError
EOFError

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:35473)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 977, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1115, in start
self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused
Exception occured
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 435, in predict
data, stranger_features, output_datatype = DataConversionUtils.to_spark_df(data, self.spark, self.raw_text_column)
TypeError: cannot unpack non-iterable NoneType object
Exception occured
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 435, in predict
data, stranger_features, output_datatype = DataConversionUtils.to_spark_df(data, self.spark, self.raw_text_column)
TypeError: cannot unpack non-iterable NoneType object
ERROR:nlu:Exception occured
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 438, in predict
self.configure_light_pipe_usage(data.count(), multithread)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py", line 585, in count
return int(self._jdf.count())
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 336, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o1231.count

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/nlu/pipe/pipeline.py", line 435, in predict
data, stranger_features, output_datatype = DataConversionUtils.to_spark_df(data, self.spark, self.raw_text_column)
TypeError: cannot unpack non-iterable NoneType object
No accepted Data type or usable columns found or applying the NLU models failed.
Make sure that the first column you pass to .predict() is the one that nlu should predict on OR rename the column you want to predict on to 'text'
On try to reset restart Jupyter session and run the setup script again, you might have used too much memory
Full Stacktrace was (<class 'TypeError'>, TypeError('cannot unpack non-iterable NoneType object'), <traceback object at 0x7f4ed5dd60f0>)
Additional info:
<class 'TypeError'> pipeline.py 435
cannot unpack non-iterable NoneType object
Stuck? Contact us on Slack! https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA

I already tried 2-3 times. in my opinion, probably due to RAM exceeding. However, I already activated the GPU itself. Any solution for this? Thanks in advance.

nlu support on Python 3.8

On import nlu, looks like pyspark/cloudpickle.py is failing with:

TypeError: an interger is required (got type bytes) . On some research, I found this is an issue with running pysark on Python 3.8. I am not sure if this is the only cause, but if it is, i recommend placing a requirements for Python<3.8

GPU support

Hi,
I am using the Marian Models for translation.
It works fine, but I am assuming it works only on CPU
(I am using the following code:
pipe_translate = nlu.load('hu.translate_to.en')
translate = pipe_translate.predict("Sziasztok, mi a helyzet?")
and the predict part takes about 5 second, and I have an A100 GPU,
I dont think this should take so long...)
I can't figure it out, how to use the GPU, or how to check, if it uses the GPU...
(print (tf.test.gpu_device_name()) show the the GPU is there...)
Where can I find some documentation/info about this issue?
I had some issues with CUDA and java installation, but right now these look fine...

Thanks

Unknown environment issue with BioBert

I am using the nlu BioBert mapper to improve upon a tool that already exists called text2term. A few weeks ago, I was able to get the tool working on a personal computer (Mac), but shortly after when I switched to my new work computer (also Mac, same OS but with an Apple Chip instead of Intel), the program no longer worked even with the same source code, Python, and Java version.

A coworker recreated the issue with an Apple Chip computer, Python 3.9.5, and Java 17. If you have any insights, please let me know.

Here are the co-requirements, as well as the versions and the error:
Python 3.10.6 (Also tried 3.9.13)
Java version "1.8.0_341" (Also tried Java 16)
requirements.txt:

Owlready2==0.36
argparse==1.4.0
pandas==1.4.1
numpy==1.23.2
gensim==4.1.2
scipy==1.8.0
scikit-learn==1.0.2
setuptools==60.9.3
requests==2.27.1
tqdm==4.62.3
sparse_dot_topn==0.3.1
bioregistry==0.4.63
nltk==3.7
rapidfuzz==2.0.5
shortuuid==1.0.9

Error:

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
[OK!]
Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/pipe/component_resolution.py", line 276, in get_trained_component_for_nlp_model_ref
    component.get_pretrained_model(nlp_ref, lang, model_bucket),
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/components/embeddings/sentence_bert/BertSentenceEmbedding.py", line 13, in get_pretrained_model
    return BertSentenceEmbeddings.pretrained(name,language,bucket) \
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/annotator/embeddings/bert_sentence_embeddings.py", line 231, in pretrained
    return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/pretrained/resource_downloader.py", line 40, in downloadModel
    j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/internal/__init__.py", line 317, in __init__
    super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py", line 26, in __init__
    self._java_obj = self.new_java_obj(java_obj, *args)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/sparknlp/internal/extended_java_wrapper.py", line 36, in new_java_obj
    return self._new_java_obj(java_class, *args)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/pyspark/ml/wrapper.py", line 86, in _new_java_obj
    return java_obj(*java_args)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/__init__.py", line 234, in load
    nlu_component = nlu_ref_to_component(nlu_ref)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/pipe/component_resolution.py", line 160, in nlu_ref_to_component
    resolved_component = get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_params)
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/pipe/component_resolution.py", line 287, in get_trained_component_for_nlp_model_ref
    raise ValueError(f'Failure making component, nlp_ref={nlp_ref}, nlu_ref={nlu_ref}, lang={lang}, \n err={e}')
ValueError: Failure making component, nlp_ref=sent_biobert_pmc_base_cased, nlu_ref=en.embed_sentence.biobert.pmc_base_cased, lang=en, 
 err=An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/__main__.py", line 48, in <module>
    Text2Term().map_file(arguments.source, arguments.target, output_file=arguments.output, csv_columns=csv_columns,
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/t2t.py", line 63, in map_file
    return self.map(source_terms, target_ontology, source_terms_ids=source_terms_ids, base_iris=base_iris,
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/t2t.py", line 115, in map
    self._do_biobert_mapping(source_terms, target_terms, biobert_file)
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/t2t.py", line 161, in _do_biobert_mapping
    biobert = BioBertMapper(ontology_terms)
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/biobert_mapper.py", line 28, in __init__
    self.biobert = self.load_biobert()
  File "/Users/jason/Documents/GitHub/ontology-mapper/text2term/biobert_mapper.py", line 34, in load_biobert
    biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
  File "/Users/jason/.pyenv/versions/3.10.6/lib/python3.10/site-packages/nlu/__init__.py", line 249, in load
    raise Exception(
Exception: Something went wrong during creating the Spark NLP model_anno_obj for your request =  en.embed_sentence.biobert.pmc_base_cased Did you use a NLU Spell?

Error while trying to load nlu.load('embed_sentence.bert')

I am trying to create sentence similarity model using Spark_nlp, but i am getting the below two different errors.

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]

IllegalArgumentException Traceback (most recent call last)
File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\pipe\component_resolution.py:276, in get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_configs)
274 if component.get_pretrained_model:
275 component = component.set_metadata(
--> 276 component.get_pretrained_model(nlp_ref, lang, model_bucket),
277 nlu_ref, nlp_ref, lang, False, license_type)
278 else:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\components\embeddings\sentence_bert\BertSentenceEmbedding.py:13, in BertSentence.get_pretrained_model(name, language, bucket)
11 @staticmethod
12 def get_pretrained_model(name, language, bucket=None):
---> 13 return BertSentenceEmbeddings.pretrained(name,language,bucket)
14 .setInputCols('sentence')
15 .setOutputCol("sentence_embeddings")

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\annotator\embeddings\bert_sentence_embeddings.py:231, in BertSentenceEmbeddings.pretrained(name, lang, remote_loc)
230 from sparknlp.pretrained import ResourceDownloader
--> 231 return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\pretrained\resource_downloader.py:40, in ResourceDownloader.downloadModel(reader, name, language, remote_loc, j_dwn)
39 try:
---> 40 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
41 except Py4JJavaError as e:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\internal_init_.py:317, in _DownloadModel.init(self, reader, name, language, remote_loc, validator)
316 def init(self, reader, name, language, remote_loc, validator):
--> 317 super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
318 name, language, remote_loc)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\internal\extended_java_wrapper.py:26, in ExtendedJavaWrapper.init(self, java_obj, *args)
25 self.sc = SparkContext._active_spark_context
---> 26 self._java_obj = self.new_java_obj(java_obj, *args)
27 self.java_obj = self._java_obj

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\sparknlp\internal\extended_java_wrapper.py:36, in ExtendedJavaWrapper.new_java_obj(self, java_class, *args)
35 def new_java_obj(self, java_class, *args):
---> 36 return self._new_java_obj(java_class, *args)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\pyspark\ml\wrapper.py:69, in JavaWrapper._new_java_obj(java_class, *args)
68 java_args = [_py2java(sc, arg) for arg in args]
---> 69 return java_obj(*java_args)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\py4j\java_gateway.py:1304, in JavaMember.call(self, *args)
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1307 for temp_arg in temp_args:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\pyspark\sql\utils.py:134, in capture_sql_exception..deco(*a, **kw)
131 if not isinstance(converted, UnknownException):
132 # Hide where the exception came from that shows a non-Pythonic
133 # JVM exception message.
--> 134 raise_from(converted)
135 else:

File :3, in raise_from(e)

IllegalArgumentException: requirement failed: Was not found appropriate resource to download for request: ResourceRequest(sent_small_bert_L2_128,Some(en),public/models,4.0.2,3.3.0) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@c7c973f

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu_init_.py:234, in load(request, path, verbose, gpu, streamlit_caching, m1_chip)
233 continue
--> 234 nlu_component = nlu_ref_to_component(nlu_ref)
235 # if we get a list of components, then the NLU reference is a pipeline, we do not need to check order

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\pipe\component_resolution.py:160, in nlu_ref_to_component(nlu_ref, detect_lang, authenticated)
159 else:
--> 160 resolved_component = get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_params)
162 if resolved_component is None:

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu\pipe\component_resolution.py:287, in get_trained_component_for_nlp_model_ref(lang, nlu_ref, nlp_ref, license_type, model_configs)
286 except Exception as e:
--> 287 raise ValueError(f'Failure making component, nlp_ref={nlp_ref}, nlu_ref={nlu_ref}, lang={lang}, \n err={e}')
289 return component

ValueError: Failure making component, nlp_ref=sent_small_bert_L2_128, nlu_ref=embed_sentence.bert, lang=en,
err=requirement failed: Was not found appropriate resource to download for request: ResourceRequest(sent_small_bert_L2_128,Some(en),public/models,4.0.2,3.3.0) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@c7c973f

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
Cell In [16], line 2
1 import nlu
----> 2 pipe = nlu.load('embed_sentence.bert')
3 print("pipe",pipe)

File c:\users\ramesar2\appdata\local\programs\python\python38\lib\site-packages\nlu_init_.py:249, in load(request, path, verbose, gpu, streamlit_caching, m1_chip)
247 print(e[1])
248 print(err)
--> 249 raise Exception(
250 f"Something went wrong during creating the Spark NLP model_anno_obj for your request = {request} Did you use a NLU Spell?")
251 # Complete Spark NLP Pipeline, which is defined as a DAG given by the starting Annotators
252 try:

Exception: Something went wrong during creating the Spark NLP model_anno_obj for your request = embed_sentence.bert Did you use a NLU Spell?

Column names and pipe settings change after saving

For column names we must save the nlu_ref used to create the pipe
For pipeline, there has to be some bug in saving or writing settings to the annotator

Error when loading match.datetime component

import nlu
nlu.load('match.datetime').predict('In the years 2000/01/01 to 2010/01/01 a lot of things happened')

Running it in colab pip install nlu pyspark==3.0.2
Get this Error:
Exception: Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

Elmo Not work

I install all package and run examples
but Elmo not work

or

please help me!

Java problems when using the library

Hello, I followed all the installation steps in the documentation, but it was not enough to get the library working.

Then I had to install the JDK, specify the interpreter and the path to the JDK

import os
from johnsnowlabs import nlp

os.environ["PYSPARK_DRIVER_PYTHON"] = "D:\\myproject\\nlp_command\\.venv\\Scripts"
os.environ["JAVA_HOME"] = "C:\\Program Files\\Java\\jdk-20"

pipeline = nlp.load('sentiment')
# pipeline.predict("I love this Documentation! It's so good!")

But I'm still getting the "Java gateway process exited before sending its port number" error.

Platform - windows 10

java version "20.0.2" 2023-07-18
Java(TM) SE Runtime Environment (build 20.0.2+9-78)
Java HotSpot(TM) 64-Bit Server VM (build 20.0.2+9-78, mixed mode, sharing)

load error

I get the following error when trying the following:

import nlu
nlu.load('elmo')

using configuration:
OS: Windows 10
Java version: 1.8.0_311 (Java 8)
Pyspark – version: 3.1.2

:: loading settings :: url = jar:file:/C:/Spark/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\Lukas.ivy2\cache
The jars for the packages stored in: C:\Users\Lukas.ivy2\jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f9a2f2a7-e7ac-44f5-a922-ae1493621cbc;1.0
confs: [default]
found com.johnsnowlabs.nlp#spark-nlp_2.12;3.3.4 in central
found com.typesafe#config;1.4.1 in central
found org.rocksdb#rocksdbjni;6.5.3 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.603 in central
found com.github.universal-automata#liblevenshtein;3.0.0 in central
found com.google.code.findbugs#annotations;3.0.1 in central
found net.jcip#jcip-annotations;1.0 in central
found com.google.code.findbugs#jsr305;3.0.1 in central
found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
found com.google.code.gson#gson;2.3 in central
found it.unimi.dsi#fastutil;7.0.12 in central
found org.projectlombok#lombok;1.16.8 in central
found org.slf4j#slf4j-api;1.7.21 in central
found com.navigamez#greex;1.0 in central
found dk.brics.automaton#automaton;1.11-8 in central
found org.json4s#json4s-ext_2.12;3.5.3 in central
found joda-time#joda-time;2.9.5 in central
found org.joda#joda-convert;1.8.1 in central
found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 in central
found net.sf.trove4j#trove4j;3.0.3 in central
:: resolution report :: resolve 391ms :: artifacts dl 16ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.603 from central in [default]
com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
com.google.code.findbugs#annotations;3.0.1 from central in [default]
com.google.code.findbugs#jsr305;3.0.1 from central in [default]
com.google.code.gson#gson;2.3 from central in [default]
com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default]
com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default]
com.johnsnowlabs.nlp#spark-nlp_2.12;3.3.4 from central in [default]
com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 from central in [default]
com.navigamez#greex;1.0 from central in [default]
com.typesafe#config;1.4.1 from central in [default]
dk.brics.automaton#automaton;1.11-8 from central in [default]
it.unimi.dsi#fastutil;7.0.12 from central in [default]
joda-time#joda-time;2.9.5 from central in [default]
net.jcip#jcip-annotations;1.0 from central in [default]
net.sf.trove4j#trove4j;3.0.3 from central in [default]
org.joda#joda-convert;1.8.1 from central in [default]
org.json4s#json4s-ext_2.12;3.5.3 from central in [default]
org.projectlombok#lombok;1.16.8 from central in [default]
org.rocksdb#rocksdbjni;6.5.3 from central in [default]
org.slf4j#slf4j-api;1.7.21 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 21 | 0 | 0 | 0 || 21 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f9a2f2a7-e7ac-44f5-a922-ae1493621cbc
confs: [default]
0 artifacts copied, 21 already retrieved (0kB/0ms)
22/01/14 17:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
elmo download started this may take some time.
22/01/14 17:31:05 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
EXCEPTION: Could not resolve singular Component for type=elmo and nlp_ref=elmo and nlu_ref=elmo and lang =en
Traceback (most recent call last):
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\pipe\component_resolution.py", line 708, in construct_component_from_identifier
return Embeddings(get_default=False, nlp_ref=nlp_ref, nlu_ref=nlu_ref, lang=language,
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\components\embedding.py", line 98, in init
else : self.model =SparkNLPElmo.get_pretrained_model(nlp_ref, lang)
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\components\embeddings\elmo\spark_nlp_elmo.py", line 14, in get_pretrained_model
return ElmoEmbeddings.pretrained(name,language)
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\annotator.py", line 7760, in pretrained
return ResourceDownloader.downloadModel(ElmoEmbeddings, name, lang, remote_loc)
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\pretrained.py", line 50, in downloadModel
file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\internal.py", line 231, in init
super(_GetResourceSize, self).init(
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\internal.py", line 165, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "D:.venv\python3.8_nlu\lib\site-packages\sparknlp\internal.py", line 175, in new_java_obj
return self._new_java_obj(java_class, *args)
File "D:.venv\python3.8_nlu\lib\site-packages\pyspark\ml\wrapper.py", line 66, in _new_java_obj
return java_obj(*java_args)
File "D:.venv\python3.8_nlu\lib\site-packages\py4j\java_gateway.py", line 1304, in call
return_value = get_return_value(
File "D:.venv\python3.8_nlu\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "D:.venv\python3.8_nlu\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: org/json4s/package$MappingException
at org.json4s.ext.EnumNameSerializer.deserialize(EnumSerializer.scala:53)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at org.json4s.Formats$$anonfun$customDeserializer$1.applyOrElse(Formats.scala:66)
at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:180)
at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:167)
at scala.collection.AbstractTraversable.collectFirst(Traversable.scala:108)
at org.json4s.Formats$.customDeserializer(Formats.scala:66)
at org.json4s.Extraction$.customOrElse(Extraction.scala:775)
at org.json4s.Extraction$.extract(Extraction.scala:454)
at org.json4s.Extraction$.extract(Extraction.scala:56)
at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
at com.johnsnowlabs.util.JsonParser$.parseObject(JsonParser.scala:28)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:101)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:129)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:128)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:593)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
at scala.collection.AbstractIterator.to(Iterator.scala:1431)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
at scala.collection.AbstractIterator.toList(Iterator.scala:1431)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:128)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:123)
at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:78)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:445)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:577)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.json4s.package$MappingException
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 51 more

Traceback (most recent call last):
File "D:.venv\python3.8_nlu\lib\site-packages\nlu_init_.py", line 236, in load
nlu_component = nlu_ref_to_component(nlu_ref, authenticated=is_authenticated)
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\pipe\component_resolution.py", line 171, in nlu_ref_to_component
resolved_component = resolve_component_from_parsed_query_data(language, component_type, dataset,
File "D:.venv\python3.8_nlu\lib\site-packages\nlu\pipe\component_resolution.py", line 320, in resolve_component_from_parsed_query_data
raise ValueError(f'EXCEPTION : Could not create NLU component for nlp_ref={nlp_ref} and nlu_ref={nlu_ref}')
ValueError: EXCEPTION : Could not create NLU component for nlp_ref=elmo and nlu_ref=elmo

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "D:.venv\python3.8_nlu\lib\site-packages\nlu_init_.py", line 255, in load
raise Exception(
Exception: Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

bad casing for nlp ref

Some nlp refs in spell book have wrong casing.
Double check with Modelshub/S3 metadata and fix

did the last version support python==3.8.10

hx@hx-image:~$ streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py
Traceback (most recent call last):
File "/home/hx/.local/bin/streamlit", line 5, in
from streamlit.web.cli import main
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/init.py", line 55, in
from streamlit.delta_generator import DeltaGenerator as _DeltaGenerator
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/delta_generator.py", line 38, in
from streamlit import config, cursor, env_util, logger, runtime, type_util, util
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/cursor.py", line 18, in
from streamlit.runtime.scriptrunner import get_script_run_ctx
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/init.py", line 16, in
from streamlit.runtime.runtime import Runtime as Runtime
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/runtime.py", line 28, in
from streamlit.runtime.app_session import AppSession
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/app_session.py", line 35, in
from streamlit.runtime import caching, legacy_caching
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/caching/init.py", line 21, in
from streamlit.runtime.state.session_state import WidgetMetadata
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/state/init.py", line 16, in
from streamlit.runtime.state.safe_session_state import (
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/state/safe_session_state.py", line 20, in
from streamlit.runtime.state.session_state import (
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 44, in
from streamlit.type_util import ValueFieldName, is_array_value_field_name
File "/home/hx/.local/lib/python3.8/site-packages/streamlit/type_util.py", line 35, in
import pyarrow as pa
File "/home/hx/.local/lib/python3.8/site-packages/pyarrow/init.py", line 65, in
import pyarrow.lib as _lib
File "pyarrow/compat.pxi", line 43, in init pyarrow.lib
File "/home/hx/.local/lib/python3.8/site-packages/cloudpickle/init.py", line 3, in
from cloudpickle.cloudpickle import *
File "/home/hx/.local/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 167, in
_cell_set_template_code = _make_cell_set_template_code()
File "/home/hx/.local/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 148, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)

spark nlu load error

Am trying to explore NLU models first and then the NLU Healthcare models.
nlu.load('emotion') step is failing. Attached the logs.

OS – Linux RHEL
Pyspark – version 3.0.1
Command used for install - python3 -m pip install nlu pyspark==3.0.1 --trusted-host pypi.org --trusted-host files.pythonhosted.org
I have created a python venv and install the NLU per above command.

I also tried reinstalling with below command:
python3 -m pip install --upgrade nlu streamlit pyspark==3.0.2

Code below:
import nlu
pp=nlu.load('emotion')
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[ / ]
An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.NoClassDefFoundError: org/tensorflow/Tensor
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.read(TensorflowWrapper.scala:397)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel(TensorflowSerializeModel.scala:145)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel$(TensorflowSerializeModel.scala:120)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflowModel(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow(ClassifierDLModel.scala:278)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow$(ClassifierDLModel.scala:276)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflow(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1$adapted(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:47)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:46)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:46)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:35)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:333)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:327)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:456)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.tensorflow.Tensor
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 34 more
[OK!]
EXCEPTION: Could not resolve singular Component for type=emotion and nlp_ref=classifierdl_use_emotion and nlu_ref=emotion and lang =en
Traceback (most recent call last):
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/pipe/component_resolution.py", line 852, in construct_component_from_identifier
is_licensed=is_licensed)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/components/classifier.py", line 69, in init
else : self.model = ClassifierDl.get_pretrained_model(nlp_ref, language)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/components/classifiers/classifier_dl/classifier_dl.py", line 11, in get_pretrained_model
return ClassifierDLModel.pretrained(name,language,bucket)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/annotator.py", line 8063, in pretrained
return ResourceDownloader.downloadModel(ClassifierDLModel, name, lang, remote_loc)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/pretrained.py", line 62, in downloadModel
raise e
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/pretrained.py", line 59, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/internal.py", line 214, in init
name, language, remote_loc)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/internal.py", line 165, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/sparknlp/internal.py", line 175, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/pyspark/ml/wrapper.py", line 69, in _new_java_obj
return java_obj(*java_args)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.NoClassDefFoundError: org/tensorflow/Tensor
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.read(TensorflowWrapper.scala:397)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel(TensorflowSerializeModel.scala:145)
at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel.readTensorflowModel$(TensorflowSerializeModel.scala:120)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflowModel(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow(ClassifierDLModel.scala:278)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.readTensorflow$(ClassifierDLModel.scala:276)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel$.readTensorflow(ClassifierDLModel.scala:291)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.annotators.classifier.dl.ReadClassifierDLTensorflowModel.$anonfun$$init$$1$adapted(ClassifierDLModel.scala:285)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:47)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:46)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:46)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:57)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:35)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:333)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:327)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:456)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.tensorflow.Tensor
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 34 more

ValueError Traceback (most recent call last)
/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/init.py in load(request, path, verbose, gpu, streamlit_caching)
341 if nlu_ref == '': continue
--> 342 nlu_component = nlu_ref_to_component(nlu_ref, authenticated=is_authenticated)
343 # if we get a list of components, then the NLU reference is a pipeline, we do not need to check order

/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/pipe/component_resolution.py in nlu_ref_to_component(nlu_reference, detect_lang, authenticated, is_recursive_call)
322 authenticated=authenticated,
--> 323 is_recursive_call=is_recursive_call)
324 if resolved_component is None:

/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/pipe/component_resolution.py in resolve_component_from_parsed_query_data(lang, component_type, dataset, component_embeddings, nlu_ref, trainable, path, authenticated, is_recursive_call)
467 if constructed_component is None:
--> 468 raise ValueError(f'EXCEPTION : Could not create NLU component for nlp_ref={nlp_ref} and nlu_ref={nlu_ref}')
469 else:

ValueError: EXCEPTION : Could not create NLU component for nlp_ref=classifierdl_use_emotion and nlu_ref=emotion

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
in
----> 1 pp=nlu.load('emotion')

/apps/sparknlp/spark-nlu/lib64/python3.6/site-packages/nlu/init.py in load(request, path, verbose, gpu, streamlit_caching)
360 print(e[1])
361 raise Exception(
--> 362 "Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?")
363
364

Exception: Something went wrong during loading and fitting the pipe. Check the other prints for more information and also verbose mode. Did you use a correct model reference?

Embed Japanese Sentences with Bert

Hi, thanks for such convenience tool! I would like to ask authors, does this tool supply 'Embed Japanese Sentences with Bert' ? Thank you

[New Feature] LLMs for Machine Translation of slot-annotated data

Describe the feature
Expansion of SLU to new languages requires much work on manual annotation of data. In order to significantly reduce amount of work, LLMs can be used to machine translate slot-annotated data, e.g.
"play me <a> Dune <a> on Youtube " => "Spiele mir <a> Dune <a> auf Youtube "

Such feature is especially useful for expansion of On-Device SLU to new languages, as high quality multilingual transformers/LLMs cannot be used as core SLU model in this case.

Expected behavior
MT-LLM pipeline expects english sentences annotated in generic <> tags format (for example: "play me <a> Dune <a> on Youtube ") and outputs translated sentence in the same format ("Spiele mir <a> Dune <a> auf Youtube "). Such data format can be easily converted to BIO annotation and to other popular NLU formats.

Additional context
https://paperswithcode.com/paper/large-language-models-for-expansion-of-spoken

In our recent work, we fine-tuned MT-LLM called BigTranslate towards MT of slot-annotated NLU data. We used parallel Amazon MASSIVE dataset for fine-tuning. There is significant performance improvement after fine-tuning (compared to zero-shot LLM-based machine translation) on multiATIS++ benchmark.

Here you can find fine-tuned BigTranslate: https://huggingface.co/Samsung/BigTranslateSlotTranslator
Here you can find code for fine-tuning + code for NLU training: https://github.com/samsung/mt-llm-nlu

In summary, we are wondering how we can merge our work into this project ) And what parts of our work might be useful for this proejct (e.g., scripts for conversion from BIO to tags format ??).

Difference between JSL’s “nlu” and “spark-nlp” packages?

Hi, can you please add a comment on the description page about the difference between nlu and spark-nlp libraries?

`pyspark.sql.utils.IllegalArgumentException` on fresh install

Using Windows 10, same errors in Ubuntu WSL
Java version: openjdk version "1.8.0_282" (equivalent to JDK 8)
Installed with pip
in Python 3.6: >>> import nlu without errors

>>> nlu.load('tokenize').predict('Each word and symbol in a sentence will generate token.') # From the homepage
Ivy Default Cache set to: C:\Users\USERNAME\.ivy2\cache
The jars for the packages stored in: C:\Users\USERNAME\.ivy2\jars
:: loading settings :: url = jar:file:/C:/Users/USERNAME/.conda/envs/py36nlp/Lib/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.johnsnowlabs.nlp#spark-nlp_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f4225ffb-68be-4e92-a6fc-c8cf6d7928e2;1.0
        confs: [default]
        found com.johnsnowlabs.nlp#spark-nlp_2.11;2.7.5 in central
        found com.typesafe#config;1.3.0 in central
        found org.rocksdb#rocksdbjni;6.5.3 in central
        found com.amazonaws#aws-java-sdk;1.7.4 in central
        found commons-logging#commons-logging;1.1.1 in central
        found org.apache.httpcomponents#httpclient;4.2 in central
        found org.apache.httpcomponents#httpcore;4.2 in central
        found commons-codec#commons-codec;1.3 in central
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.ivy.util.url.IvyAuthenticator (file:/C:/Users/USERNAME/.conda/envs/py36nlp/Lib/site-packages/pyspark/jars/ivy-2.4.0.jar) to field java.net.Authenticator.theAuthenticator
WARNING: Please consider reporting this to the maintainers of org.apache.ivy.util.url.IvyAuthenticator
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
        found joda-time#joda-time;2.10.10 in central
        [2.10.10] joda-time#joda-time;[2.2,)
        found com.github.universal-automata#liblevenshtein;3.0.0 in central
        found com.google.code.findbugs#annotations;3.0.1 in central
        found net.jcip#jcip-annotations;1.0 in central
        found com.google.code.findbugs#jsr305;3.0.1 in central
        found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
        found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
        found com.google.code.gson#gson;2.3 in central
        found it.unimi.dsi#fastutil;7.0.12 in central
        found org.projectlombok#lombok;1.16.8 in central
        found org.slf4j#slf4j-api;1.7.21 in central
        found com.navigamez#greex;1.0 in central
        found dk.brics.automaton#automaton;1.11-8 in central
        found org.json4s#json4s-ext_2.11;3.5.3 in central
        found org.joda#joda-convert;1.8.1 in central
        found org.tensorflow#tensorflow;1.15.0 in central
        found org.tensorflow#libtensorflow;1.15.0 in central
        found org.tensorflow#libtensorflow_jni;1.15.0 in central
        found net.sf.trove4j#trove4j;3.0.3 in central
:: resolution report :: resolve 1184ms :: artifacts dl 28ms
        :: modules in use:
        com.amazonaws#aws-java-sdk;1.7.4 from central in [default]
        com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
        com.google.code.findbugs#annotations;3.0.1 from central in [default]
        com.google.code.findbugs#jsr305;3.0.1 from central in [default]
        com.google.code.gson#gson;2.3 from central in [default]
        com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default]
        com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default]
        com.johnsnowlabs.nlp#spark-nlp_2.11;2.7.5 from central in [default]
        com.navigamez#greex;1.0 from central in [default]
        com.typesafe#config;1.3.0 from central in [default]
        commons-codec#commons-codec;1.3 from central in [default]
        commons-logging#commons-logging;1.1.1 from central in [default]
        dk.brics.automaton#automaton;1.11-8 from central in [default]
        it.unimi.dsi#fastutil;7.0.12 from central in [default]
        joda-time#joda-time;2.10.10 from central in [default]
        net.jcip#jcip-annotations;1.0 from central in [default]
        net.sf.trove4j#trove4j;3.0.3 from central in [default]
        org.apache.httpcomponents#httpclient;4.2 from central in [default]
        org.apache.httpcomponents#httpcore;4.2 from central in [default]
        org.joda#joda-convert;1.8.1 from central in [default]
        org.json4s#json4s-ext_2.11;3.5.3 from central in [default]
        org.projectlombok#lombok;1.16.8 from central in [default]
        org.rocksdb#rocksdbjni;6.5.3 from central in [default]
        org.slf4j#slf4j-api;1.7.21 from central in [default]
        org.tensorflow#libtensorflow;1.15.0 from central in [default]
        org.tensorflow#libtensorflow_jni;1.15.0 from central in [default]
        org.tensorflow#tensorflow;1.15.0 from central in [default]
        :: evicted modules:
        commons-codec#commons-codec;1.6 by [commons-codec#commons-codec;1.3] in [default]
        joda-time#joda-time;2.9.5 by [joda-time#joda-time;2.10.10] in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   29  |   1   |   0   |   2   ||   27  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-f4225ffb-68be-4e92-a6fc-c8cf6d7928e2
        confs: [default]
        0 artifacts copied, 27 already retrieved (0kB/17ms)
21/03/08 16:01:46 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
        at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
        at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
        at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2823)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2818)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:191)
        at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:147)
        at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:145)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
        at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:343)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:343)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:343)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/03/08 16:01:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
No accepted Data type or usable columns found or applying the NLU models failed.
Make sure that the first column you pass to .predict() is the one that nlu should predict on OR rename the column you want to predict on to 'text'
If you are on Google Collab, click on Run time and try factory reset Runtime run the setup script again, you might have used too much memory
On Kaggle try to reset restart session and run the setup script again, you might have used too much memory
Full Stacktrace: see bottom
Additional info:
<class 'pyspark.sql.utils.IllegalArgumentException'> pipeline.py 1380
Stuck? Contact us on Slack! https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA

Same errors occure when running nlu.load('tokenize').predict('Each word and symbol in a sentence will generate token.')
Full stack trace:

Full Stacktrace was (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Unsupported class file major version 55', 'org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
         at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
         at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
         at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
         at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:50)
         at org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:845)
         at org.apache.spark.util.FieldAccessFinder$$anon$4$$anonfun$visitMethodInsn$7.apply(ClosureCleaner.scala:828)
         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
         at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
         at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
         at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
         at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
         at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
         at org.apache.spark.util.FieldAccessFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:828)
         at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
         at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
         at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
         at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
         at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:272)
         at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:271)
         at scala.collection.immutable.List.foreach(List.scala:392)
         at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:271)
         at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:163)
         at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:820)
         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:819)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
         at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
         at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:819)
         at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
         at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:43)
         at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.GenerateExec.doExecute(GenerateExec.scala:80)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
         at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:43)
         at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
         at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
         at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
         at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
         at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
         at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)
         at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)
         at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
         at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
         at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
         at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
         at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
         at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)
         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
         at py4j.Gateway.invoke(Gateway.java:282)
         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
         at py4j.commands.CallCommand.execute(CallCommand.java:79)
         at py4j.GatewayConnection.run(GatewayConnection.java:238)
         at java.base/java.lang.Thread.run(Thread.java:834)'), <traceback object at 0x0000024A29857188>)

Could not locate executable null\bin\winutils.exe

Hi, thanks for the package, I'm starting t explore it and it looks good so far!
I've juts faced some minor issues when trying to run it on my windows machine and thought about giving a heads up here in case someone finds similar problem.

First is you need to run your python as admin, because folders are created (in order to store downloaded models I presume) and it will cause errors if no permission is granted. Is there a way to choose where these models are downloaded to? This might help with that

Second, after the installations steps in the instructions I got this error when trying to run nlu:

Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

I found the answer to this problem here: https://stackoverflow.com/a/50430966/11483674

Missing embedding results

Hello,

It seems like embeddings are not returned when running any of the embedding predictions. Sentiment and other models do return results fine though.

Any ideas what could I be missing here?

nlu 3.3.0
pyspark 3.0.2
py4j 0.10.9
spark-nlp 3.3.2
running on google colab

support pandas >= 2.0

latest NLU is incompatible with pandas >=2.0

how to use language detect

Remove the hard dependency on the pyspark

Right now, the nlu package has a hard dependency on the pyspark making it hard to use with Databricks runtime, or other compatible Spark runtime. Instead, this package should either rely on implicit dependency completely, or use something like findspark package, something like done here.

P.S. the spark-nlp package itself doesn't depend on the pyspark

Installing Java

Hello,
I wanted to know if it's possible to use nlu on a 'non-cells' editor like VS Code.
I tried to, but I have this error :
Exception: Java gateway process exited before sending its port number

I looked into the colab file and I think I need to paste this

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

but I don't have this file /usr/lib/jvm/java-8-openjdk-amd64

Could you please send it to me or on Github if it's the solution ?
Thanks for your help

predict() - pyspark IndexError on python 3.11.4

Python version: 3.11.4
pyspark version: 3.1.2

model.predict('I love NLU! <3')
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]

Warning::Spark Session already created, some configs may not take.
Traceback (most recent call last):
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 630, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 503, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 484, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 156, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in _extract_code_globals
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in <setcomp>
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 437, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 630, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 503, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 484, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 156, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in _extract_code_globals
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 236, in <setcomp>
    out_names = {names[oparg] for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users//nlu/nlu/pipe/pipeline.py", line 485, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//nlu/nlu/pipe/utils/predict_helper.py", line 267, in __predict__
    pipe.fit()
  File "/Users//nlu/nlu/pipe/pipeline.py", line 204, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//nlu/nlu/pipe/pipeline.py", line 103, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/sql/session.py", line 673, in createDataFrame
    return super(SparkSession, self).createDataFrame(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/sql/pandas/conversion.py", line 300, in createDataFrame
    return self._create_dataframe(data, schema, samplingRatio, verifySchema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/sql/session.py", line 701, in _create_dataframe
    jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2618, in _to_java_object_rdd
    return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
                                                ^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2949, in _jrdd
    wrapped_func = _wrap_function(self.ctx, self.func, self._prev_jrdd_deserializer,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2828, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/rdd.py", line 2814, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
                      ^^^^^^^^^^^^^^^^^^
  File "/Users//miniconda3/lib/python3.11/site-packages/pyspark/serializers.py", line 447, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

Question regarding Electra Word Emmbeddings

Hi guys,

i just discoverd this amazing library!
My question is, i have a fine-tuned electra model and want to get the word emmbeddings out as you showed in:
https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_ELECTRA_sentence_embeddings_and_t-SNE_visualization_example.ipynb

Is there a way i could plug-in my own model?

Best regards
Chris

Where to obtain "license_keys"?

All the colab examples I wanted to run e.g. this one are asking for a license file via files.upload().
How do I obtain one?
Thanks.

Feature request: (bioschemas)[bioschemas.org] based ner extractor

it be great to have a (bioschemas)[bioschemas.org] based ner extractor that not only extracts bio/clinical but also pipelines/tools and other bioschema properties

error while download hebrewner

I am running the official docker image of nlp-server and trying to make ner on hebrew sentence but it failed to download the model, also i trying to download the model manually and it say to access denied

nlu.load function m1_chip parameter is not passed on correctly

The m1_chip parameter in nlu.load (in init.py) is passed on to get_open_source_spark_context and there used in sparknlp.start(gpu=gpu, m1=True). However, sparknlp.start takes only the parameter apple_silicon.

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self):
# Load BioBERT model (for sentence-type embeddings)
self.logger.info("Loading BioBERT model...")
start = time.time()
biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
end = time.time()
self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start)
return biobert

def get_biobert_embeddings(self, strings):
embedding_list = []
for string in strings:
self.logger.debug("...Generating embedding for: %s", string)
embedding_list.append(self.get_biobert_embedding(string))
return embedding_list

def get_biobert_embedding(self, string):
embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True)
return embedding.sentence_embedding_biobert.values[0]

johnsnowlabs / nlu Goto Github PK

nlu's Introduction

John Snow Labs: State-of-the-art NLP in Python

Features

Installation

Usage

Use Licensed Products

Usage

Helpful Resources

License

nlu's People

Contributors

Stargazers

Watchers

Forkers

nlu's Issues

sent_small_bert_L2_128 download started this may take some time. Approximate size to download 16.1 MB [OK!]

Recommend Projects

Recommend Topics

Recommend Org

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]