ray-project / llm-applications Goto Github PK

View Code? Open in Web Editor NEW

1.6K 17.0 207.0 29.99 MB

A comprehensive guide to building RAG-based LLM applications for production.

License: Creative Commons Attribution 4.0 International

Makefile 0.05% Python 4.30% Jupyter Notebook 95.37% Shell 0.28%

llms machine-learning ray anyscale fine-tuning llama2 openai serving

llm-applications's Introduction

LLM Applications

A comprehensive guide to building RAG-based LLM applications for production.

Blog post: https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1
GitHub repository: https://github.com/ray-project/llm-applications
Interactive notebook: https://github.com/ray-project/llm-applications/blob/main/notebooks/rag.ipynb
Anyscale Endpoints: https://endpoints.anyscale.com/
Ray documentation: https://docs.ray.io/

In this guide, we will learn how to:

💻 Develop a retrieval augmented generation (RAG) based LLM application from scratch.
🚀 Scale the major components (load, chunk, embed, index, serve, etc.) in our application.
✅ Evaluate different configurations of our application to optimize for both per-component (ex. retrieval_score) and overall performance (quality_score).
🔀 Implement LLM hybrid routing approach to bridge the gap b/w OSS and closed LLMs.
📦 Serve the application in a highly scalable and available manner.
💥 Share the 1st order and 2nd order impacts LLM applications have had on our products.

Setup

API keys

We'll be using OpenAI to access ChatGPT models like gpt-3.5-turbo, gpt-4, etc. and Anyscale Endpoints to access OSS LLMs like Llama-2-70b. Be sure to create your accounts for both and have your credentials ready.

Compute

Local

You could run this on your local laptop but a we highly recommend using a setup with access to GPUs. You can set this up on your own or on [Anyscale](http://anyscale.com/).

Anyscale

Start a new Anyscale workspace on staging using an g3.8xlarge head node, which has 2 GPUs and 32 CPUs. We can also add GPU worker nodes to run the workloads faster. If you're not on Anyscale, you can configure a similar instance on your cloud.
Use the default_cluster_env_2.6.2_py39 cluster environment.
Use the us-west-2 if you'd like to use the artifacts in our shared storage (source docs, vector DB dumps, etc.).

Repository

git clone https://github.com/ray-project/llm-applications.git .
git config --global user.name <GITHUB-USERNAME>
git config --global user.email <EMAIL-ADDRESS>

Data

Our data is already ready at /efs/shared_storage/goku/docs.ray.io/en/master/ (on Staging, us-east-1) but if you wanted to load it yourself, run this bash command (change /desired/output/directory, but make sure it's on the shared storage, so that it's accessible to the workers)

git clone https://github.com/ray-project/llm-applications.git .

Environment

Then set up the environment correctly by specifying the values in your .env file, and installing the dependencies:

pip install --user -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$PWD
pre-commit install
pre-commit autoupdate

Credentials

touch .env
# Add environment variables to .env
OPENAI_API_BASE="https://api.openai.com/v1"
OPENAI_API_KEY=""  # https://platform.openai.com/account/api-keys
ANYSCALE_API_BASE="https://api.endpoints.anyscale.com/v1"
ANYSCALE_API_KEY=""  # https://app.endpoints.anyscale.com/credentials
DB_CONNECTION_STRING="dbname=postgres user=postgres host=localhost password=postgres"
source .env

Now we're ready to go through the rag.ipynb interactive notebook to develop and serve our LLM application!

Learn more

If your team is investing heavily in developing LLM applications, reach out to us to learn more about how Ray and Anyscale can help you scale and productionize everything.
Start serving (+fine-tuning) OSS LLMs with Anyscale Endpoints ($1/M tokens for Llama-3-70b) and private endpoints available upon request (1M free tokens trial).
Learn more about how companies like OpenAI, Netflix, Pinterest, Verizon, Instacart and others leverage Ray and Anyscale for their AI workloads at the Ray Summit 2024 this Sept 18-20 in San Francisco.

llm-applications's People

Contributors

Stargazers

Watchers

Forkers

disiok siddharthuchil dhruba15 touristshaun guyvani codeaudit tomchapin haiyunsky qxzsilver1 techthiyanes manaranjanp hu-milize mindkhichdi princerumi mhylle chuukwudi seer-bi gridechelon jamesthesnake saifrahmed webstruck kinddevil afghani-iitkgp yosrixp hbcbh1999 mjtechguy saurabh11baghel ptzagk ully dogank01 hieufromwaterloo code360in tonywhite11 amitkml ai-nishikant avyukth nickgannon10 stjordanis cxz adi-kmt rohit5945 hzane jansystemic lgcyril eltociear sstrelnikov rkp64 complexchaos-ai cicimmmmm xiaoyi001yeye ajawebx hadarth matthewcintron reneje akfincode jonathancai sikhnerd jmhsieh tamizh-me manishn1202 wassimchouchen rioncarter hanyucui f901107 krrishdholakia jxzhangjhu aadehamid almutareb barryyin leunguu isayahc firobeid sabania acbanerj taltaf913 bennwei shao-shuai eddie-sun shuxiangzhang blendingbytes ooropuloo pratikfalke chuahanchong phucnsp future-outlier roshray akoshouta liuchaoxd zhutony 5l1v3r1 addkishore ssheff nseinturier uoeh alldebrid nnngoc johncleveland henry-zeng warutereian barangarooai

llm-applications's Issues

What is this library part of, or which version do we need?

can't run the notebook locally

Hello, very interested with this work I am trying to run it locally.

However I am stuck at the cell

# Extract sections
sections_ds = ds.flat_map(extract_sections)
sections_ds.count()

sections_ds.count() throws the following error, any idea about what may solve this issue?

{
	"name": "RayTaskError(FileNotFoundError)",
	"message": "ray::FlatMap(extract_sections)() (pid=153397, ip=192.168.1.82)
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py\", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py\", line 345, in __call__
    for data in iter:
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py\", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py\", line 245, in transform_fn
    for out_row in fn(row):
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py\", line 119, in fn
    return op_fn(item, *fn_args, **fn_kwargs)
  File \"/tmp/ray/session_2023-10-11_12-45-18_995895_152214/runtime_resources/working_dir_files/_ray_pkg_74b1a494592133c8/rag/data.py\", line 29, in extract_sections
    with open(record[\"path\"], \"r\", encoding=\"utf-8\") as html_file:
FileNotFoundError: [Errno 2] No such file or directory: 'docs.ray.io/en/master/tune.html'",
	"stack": "---------------------------------------------------------------------------
ObjectRefStreamEndOfStreamError           Traceback (most recent call last)
File python/ray/_raylet.pyx:345, in ray._raylet.StreamingObjectRefGenerator._next_sync()

File python/ray/_raylet.pyx:4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream()

File python/ray/_raylet.pyx:443, in ray._raylet.check_status()

ObjectRefStreamEndOfStreamError: 

During handling of the above exception, another exception occurred:

StopIteration                             Traceback (most recent call last)
File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py:80, in DataOpTask.on_waitable_ready(self)
     79 try:
---> 80     meta = ray.get(next(self._streaming_gen))
     81 except StopIteration:
     82     # The generator should always yield 2 values (block and metadata)
     83     # each time. If we get a StopIteration here, it means an error
   (...)
     86     # TODO(hchen): Ray Core should have a better interface for
     87     # detecting and obtaining the exception.

File python/ray/_raylet.pyx:300, in ray._raylet.StreamingObjectRefGenerator.__next__()

File python/ray/_raylet.pyx:351, in ray._raylet.StreamingObjectRefGenerator._next_sync()

StopIteration: 

During handling of the above exception, another exception occurred:

RayTaskError(FileNotFoundError)           Traceback (most recent call last)
/home/sylvain/Documents/471/LLM/ray_pgvector/llm-applications/ray_pgvector.ipynb Cell 20 line 4
      <a href='vscode-notebook-cell:/home/sylvain/Documents/471/LLM/ray_pgvector/llm-applications/ray_pgvector.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a> # Extract sections
      <a href='vscode-notebook-cell:/home/sylvain/Documents/471/LLM/ray_pgvector/llm-applications/ray_pgvector.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a> #ray.data.DataContext.get_current().execution_options.verbose_progress = True
      <a href='vscode-notebook-cell:/home/sylvain/Documents/471/LLM/ray_pgvector/llm-applications/ray_pgvector.ipynb#X20sZmlsZQ%3D%3D?line=2'>3</a> sections_ds = ds.flat_map(extract_sections)
----> <a href='vscode-notebook-cell:/home/sylvain/Documents/471/LLM/ray_pgvector/llm-applications/ray_pgvector.ipynb#X20sZmlsZQ%3D%3D?line=3'>4</a> sections_ds.count()

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/dataset.py:2498, in Dataset.count(self)
   2492     return meta_count
   2494 get_num_rows = cached_remote_fn(_get_num_rows)
   2496 return sum(
   2497     ray.get(
-> 2498         [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
   2499     )
   2500 )

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/dataset.py:4799, in Dataset.get_internal_block_refs(self)
   4780 @ConsumptionAPI(pattern=\"Time complexity:\")
   4781 @DeveloperAPI
   4782 def get_internal_block_refs(self) -> List[ObjectRef[Block]]:
   4783     \"\"\"Get a list of references to the underlying blocks of this dataset.
   4784 
   4785     This function can be used for zero-copy access to the data. It blocks
   (...)
   4797         A list of references to this dataset's blocks.
   4798     \"\"\"
-> 4799     blocks = self._plan.execute().get_blocks()
   4800     self._synchronize_progress_bar()
   4801     return blocks

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/plan.py:591, in ExecutionPlan.execute(self, allow_clear_input_blocks, force_read, preserve_order)
    589 else:
    590     executor = BulkExecutor(copy.deepcopy(context.execution_options))
--> 591 blocks = execute_to_legacy_block_list(
    592     executor,
    593     self,
    594     allow_clear_input_blocks=allow_clear_input_blocks,
    595     dataset_uuid=self._dataset_uuid,
    596     preserve_order=preserve_order,
    597 )
    598 # TODO(ekl) we shouldn't need to set this in the future once we move
    599 # to a fully lazy execution model, unless .materialize() is used. Th
    600 # reason we need it right now is since the user may iterate over a
    601 # Dataset multiple times after fully executing it once.
    602 if not self._run_by_consumer:

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py:119, in execute_to_legacy_block_list(executor, plan, allow_clear_input_blocks, dataset_uuid, preserve_order)
    112 dag, stats = _get_execution_dag(
    113     executor,
    114     plan,
    115     allow_clear_input_blocks,
    116     preserve_order,
    117 )
    118 bundles = executor.execute(dag, initial_stats=stats)
--> 119 block_list = _bundles_to_block_list(bundles)
    120 # Set the stats UUID after execution finishes.
    121 _set_stats_uuid_recursive(executor.get_stats(), dataset_uuid)

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py:357, in _bundles_to_block_list(bundles)
    355 blocks, metadata = [], []
    356 owns_blocks = True
--> 357 for ref_bundle in bundles:
    358     if not ref_bundle.owns_blocks:
    359         owns_blocks = False

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/executor.py:37, in OutputIterator.__next__(self)
     36 def __next__(self) -> RefBundle:
---> 37     return self.get_next()

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py:129, in StreamingExecutor.execute.<locals>.StreamIterator.get_next(self, output_split_idx)
    127         raise StopIteration
    128 elif isinstance(item, Exception):
--> 129     raise item
    130 else:
    131     # Otherwise return a concrete RefBundle.
    132     if self._outer._global_info:

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py:187, in StreamingExecutor.run(self)
    181 \"\"\"Run the control loop in a helper thread.
    182 
    183 Results are returned via the output node's outqueue.
    184 \"\"\"
    185 try:
    186     # Run scheduling loop until complete.
--> 187     while self._scheduling_loop_step(self._topology) and not self._shutdown:
    188         pass
    189 except Exception as e:
    190     # Propagate it to the result iterator.

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py:235, in StreamingExecutor._scheduling_loop_step(self, topology)
    230     logger.get_logger().info(\"Scheduling loop step...\")
    232 # Note: calling process_completed_tasks() is expensive since it incurs
    233 # ray.wait() overhead, so make sure to allow multiple dispatch per call for
    234 # greater parallelism.
--> 235 process_completed_tasks(topology)
    237 # Dispatch as many operators as we can for completed tasks.
    238 limits = self._get_or_refresh_resource_limits()

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py:333, in process_completed_tasks(topology)
    326     ready, _ = ray.wait(
    327         list(active_tasks.keys()),
    328         num_returns=len(active_tasks),
    329         fetch_local=False,
    330         timeout=0.1,
    331     )
    332     for ref in ready:
--> 333         active_tasks[ref].on_waitable_ready()
    335 # Pull any operator outputs into the streaming op state.
    336 for op, op_state in topology.items():

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py:88, in DataOpTask.on_waitable_ready(self)
     80     meta = ray.get(next(self._streaming_gen))
     81 except StopIteration:
     82     # The generator should always yield 2 values (block and metadata)
     83     # each time. If we get a StopIteration here, it means an error
   (...)
     86     # TODO(hchen): Ray Core should have a better interface for
     87     # detecting and obtaining the exception.
---> 88     ex = ray.get(block_ref)
     89     self._task_done_callback()
     90     raise ex

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     21 @wraps(fn)
     22 def auto_init_wrapper(*args, **kwargs):
     23     auto_init_ray()
---> 24     return fn(*args, **kwargs)

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    101     if func.__name__ != \"init\" or is_client_mode_enabled_by_default:
    102         return getattr(ray, func.__name__)(*args, **kwargs)
--> 103 return func(*args, **kwargs)

File ~/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/_private/worker.py:2547, in get(object_refs, timeout)
   2545     worker.core_worker.dump_object_store_memory_usage()
   2546 if isinstance(value, RayTaskError):
-> 2547     raise value.as_instanceof_cause()
   2548 else:
   2549     raise value

RayTaskError(FileNotFoundError): ray::FlatMap(extract_sections)() (pid=153397, ip=192.168.1.82)
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py\", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py\", line 345, in __call__
    for data in iter:
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py\", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py\", line 245, in transform_fn
    for out_row in fn(row):
  File \"/home/sylvain/miniconda3/envs/ray_pgvector/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py\", line 119, in fn
    return op_fn(item, *fn_args, **fn_kwargs)
  File \"/tmp/ray/session_2023-10-11_12-45-18_995895_152214/runtime_resources/working_dir_files/_ray_pkg_74b1a494592133c8/rag/data.py\", line 29, in extract_sections
    with open(record[\"path\"], \"r\", encoding=\"utf-8\") as html_file:
FileNotFoundError: [Errno 2] No such file or directory: 'docs.ray.io/en/master/tune.html'"
}```

Unable to import config

To reproduce :

from rag.config import ROOT_DIR

Error :

ModuleNotFoundError Traceback (most recent call last)
in
----> 1 from rag.config import ROOT_DIR

ModuleNotFoundError: No module named 'rag.config'

Example shown in ipynb

Credentials

ray.init(runtime_env={
"env_vars": {
"OPENAI_API_BASE": os.environ["OPENAI_API_BASE"],
"OPENAI_API_KEY": os.environ["OPENAI_API_KEY"],
"ANYSCALE_API_BASE": os.environ["ANYSCALE_API_BASE"],
"ANYSCALE_API_KEY": os.environ["ANYSCALE_API_KEY"],
"DB_CONNECTION_STRING": os.environ["DB_CONNECTION_STRING"],
},
"working_dir": str(ROOT_DIR)
})

Gives the output :
Python version: 3.10.8
Ray version: 2.7.0
Dashboard: http://session-5ljni527x7edt2q6px7nuaejct.i.anyscaleuserdata-staging.com/

The output i get :

Python version:	3.9.15
Ray version:	2.9.1

I am not able to access that dashboard.

Anyscale Platform access :
I am trying to access the Anyscale Platform but it does say it needs invitation to progress but fails to send invitation to email , can you please share how to setup in Anyscale Platform.
! export EFS_DIR=$(python -c "from rag.config import EFS_DIR; print(EFS_DIR)")
! wget -e robots=off --recursive --no-clobber --page-requisites
--html-extension --convert-links --restrict-file-names=windows
--domains docs.ray.io --no-parent --accept=html --retry-on-http-error=429
-P $EFS_DIR https://docs.ray.io/en/master/

Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'rag.config'
Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2024-02-01 19:48:54-- https://docs.ray.io/en/master/
Resolving docs.ray.io (docs.ray.io)... 104.18.1.163, 104.18.0.163
Connecting to docs.ray.io (docs.ray.io)|104.18.1.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
/mnt/shared_storage/ray-assistant-data/docs.ray.io/en/master: No such file or directory
/mnt/shared_storage/ray-assistant-data/docs.ray.io/en/master/index.html: No such file or directory

Cannot write to ‘/mnt/shared_storage/ray-assistant-data/docs.ray.io/en/master/index.html’ (Success).
Converted links in 0 files in 0 seconds.

Failed to unpickle serialized exception

Tried to run rag.ipynb.

Environment:

windows 10
python-3.11.5
ray-2.8.0
pydantic-1.10.13

{'GPU': 1.0,
 'node:__internal_head__': 1.0,
 'memory': 25100702516.0,
 'node:127.0.0.1': 1.0,
 'object_store_memory': 12550351257.0,
 'CPU': 6.0}

Got the error when I tried to embed chunks with openai embedding model. Made a tiny change in the code, which I embed sections instead of chunks as the section is small enough.

from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import numpy as np
from ray.data import ActorPoolStrategy

def get_embedding_model(embedding_model_name, model_kwargs, encode_kwargs):
    if embedding_model_name == "text-embedding-ada-002":
        embedding_model = OpenAIEmbeddings(
            model=embedding_model_name,
            openai_api_base="https://api.openai.com/v1",
            openai_api_key=os.environ["OPENAI_API_KEY"])
    else:
        embedding_model = HuggingFaceEmbeddings(
            model_name=embedding_model_name,  # also works with model_path
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs)
    return embedding_model

class EmbedChunks:
    def __init__(self, model_name):
        self.embedding_model = get_embedding_model(
            embedding_model_name=model_name,
            model_kwargs={"device": "cuda"},
            encode_kwargs={"device": "cuda", "batch_size": 100})
    def __call__(self, batch):
        embeddings = self.embedding_model.embed_documents(batch["text"])
        return {"text": batch["text"], "source": batch["source"], "embeddings": embeddings}
        
# Embed chunks
embedding_model_name = "text-embedding-ada-002"
embedded_chunks = sections_ds.map_batches(
    EmbedChunks,
    fn_constructor_kwargs={"model_name": embedding_model_name},
    batch_size=100, 
    num_gpus=1,
    compute=ActorPoolStrategy(size=1))
    
# Sample
sample = embedded_chunks.take(1)
print ("embedding size:", len(sample[0]["embeddings"]))
print (sample[0]["text"])

Here is the error

2024-01-09 16:14:42,335	INFO dataset.py:2383 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2024-01-09 16:14:42,339	INFO streaming_executor.py:104 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[FlatMap(extract_spec_from_patent)] -> ActorPoolMapOperator[MapBatches(EmbedChunks)] -> LimitOperator[limit=1]
2024-01-09 16:14:42,340	INFO streaming_executor.py:105 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-09 16:14:42,340	INFO streaming_executor.py:107 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-01-09 16:14:42,358	INFO actor_pool_map_operator.py:114 -- MapBatches(EmbedChunks): Waiting for 1 pool actors to start...
2024-01-09 16:14:47,419	ERROR serialization.py:406 -- Failed to unpickle serialized exception
Traceback (most recent call last):
  File "python\ray\_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python\ray\_raylet.pyx", line 4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python\ray\_raylet.pyx", line 447, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\interfaces\physical_operator.py", line 80, in on_data_ready
    meta = ray.get(next(self._streaming_gen))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python\ray\_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python\ray\_raylet.pyx", line 365, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-09 16:14:47,444	WARNING actor_pool_map_operator.py:271 -- To ensure full parallelization across an actor pool of size 1, the Dataset should consist of at least 1 distinct blocks. Consider increasing the parallelism when creating the Dataset.
(MapWorker(MapBatches(EmbedChunks)) pid=11460) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2904:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit
---------------------------------------------------------------------------
ObjectRefStreamEndOfStreamError           Traceback (most recent call last)
File python\ray\_raylet.pyx:347, in ray._raylet.StreamingObjectRefGenerator._next_sync()

File python\ray\_raylet.pyx:4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream()

File python\ray\_raylet.pyx:447, in ray._raylet.check_status()

ObjectRefStreamEndOfStreamError: 

During handling of the above exception, another exception occurred:

StopIteration                             Traceback (most recent call last)
File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\interfaces\physical_operator.py:80, in DataOpTask.on_data_ready(self, max_blocks_to_read)
     79 try:
---> 80     meta = ray.get(next(self._streaming_gen))
     81 except StopIteration:
     82     # The generator should always yield 2 values (block and metadata)
     83     # each time. If we get a StopIteration here, it means an error
   (...)
     86     # TODO(hchen): Ray Core should have a better interface for
     87     # detecting and obtaining the exception.

File python\ray\_raylet.pyx:302, in ray._raylet.StreamingObjectRefGenerator.__next__()

File python\ray\_raylet.pyx:365, in ray._raylet.StreamingObjectRefGenerator._next_sync()

StopIteration: 

During handling of the above exception, another exception occurred:

RaySystemError                            Traceback (most recent call last)
Cell In[23], line 2
      1 # Sample
----> 2 sample = embedded_chunks.take(1)
      3 print ("embedding size:", len(sample[0]["embeddings"]))
      4 print (sample[0]["text"])

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\dataset.py:2390, in Dataset.take(self, limit)
   2387 output = []
   2389 limited_ds = self.limit(limit)
-> 2390 for row in limited_ds.iter_rows():
   2391     output.append(row)
   2392     if len(output) >= limit:

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\iterator.py:219, in DataIterator.iter_rows.<locals>._wrapped_iterator()
    218 def _wrapped_iterator():
--> 219     for batch in batch_iterable:
    220         batch = BlockAccessor.for_block(BlockAccessor.batch_to_block(batch))
    221         for row in batch.iter_rows(public_row_format=True):

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\iterator.py:164, in DataIterator.iter_batches.<locals>._create_iterator()
    159 time_start = time.perf_counter()
    160 # Iterate through the dataset from the start each time
    161 # _iterator_gen is called.
    162 # This allows multiple iterations of the dataset without
    163 # needing to explicitly call `iter_batches()` multiple times.
--> 164 block_iterator, stats, blocks_owned_by_consumer = self._to_block_iterator()
    166 iterator = iter(
    167     iter_batches(
    168         block_iterator,
   (...)
    179     )
    180 )
    182 for batch in iterator:

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\iterator\iterator_impl.py:32, in DataIteratorImpl._to_block_iterator(self)
     24 def _to_block_iterator(
     25     self,
     26 ) -> Tuple[
   (...)
     29     bool,
     30 ]:
     31     ds = self._base_dataset
---> 32     block_iterator, stats, executor = ds._plan.execute_to_iterator()
     33     ds._current_executor = executor
     34     return block_iterator, stats, False

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\plan.py:548, in ExecutionPlan.execute_to_iterator(self, allow_clear_input_blocks, force_read)
    546 gen = iter(block_iter)
    547 try:
--> 548     block_iter = itertools.chain([next(gen)], gen)
    549 except StopIteration:
    550     pass

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\legacy_compat.py:54, in execute_to_legacy_block_iterator(executor, plan, allow_clear_input_blocks, dataset_uuid)
     50 """Same as execute_to_legacy_bundle_iterator but returning blocks and metadata."""
     51 bundle_iter = execute_to_legacy_bundle_iterator(
     52     executor, plan, allow_clear_input_blocks, dataset_uuid
     53 )
---> 54 for bundle in bundle_iter:
     55     for block, metadata in bundle.blocks:
     56         yield block, metadata

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\interfaces\executor.py:37, in OutputIterator.__next__(self)
     36 def __next__(self) -> RefBundle:
---> 37     return self.get_next()

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\streaming_executor.py:141, in StreamingExecutor.execute.<locals>.StreamIterator.get_next(self, output_split_idx)
    139         raise StopIteration
    140 elif isinstance(item, Exception):
--> 141     raise item
    142 else:
    143     # Otherwise return a concrete RefBundle.
    144     if self._outer._global_info:

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\streaming_executor.py:201, in StreamingExecutor.run(self)
    195 """Run the control loop in a helper thread.
    196 
    197 Results are returned via the output node's outqueue.
    198 """
    199 try:
    200     # Run scheduling loop until complete.
--> 201     while self._scheduling_loop_step(self._topology) and not self._shutdown:
    202         pass
    203 except Exception as e:
    204     # Propagate it to the result iterator.

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\streaming_executor.py:252, in StreamingExecutor._scheduling_loop_step(self, topology)
    247     logger.get_logger().info("Scheduling loop step...")
    249 # Note: calling process_completed_tasks() is expensive since it incurs
    250 # ray.wait() overhead, so make sure to allow multiple dispatch per call for
    251 # greater parallelism.
--> 252 process_completed_tasks(topology, self._backpressure_policies)
    254 # Dispatch as many operators as we can for completed tasks.
    255 limits = self._get_or_refresh_resource_limits()

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\streaming_executor_state.py:365, in process_completed_tasks(topology, backpressure_policies)
    363 state, task = active_tasks.pop(ref)
    364 if isinstance(task, DataOpTask):
--> 365     num_blocks_read = task.on_data_ready(
    366         max_blocks_to_read_per_op.get(state, None)
    367     )
    368     if state in max_blocks_to_read_per_op:
    369         max_blocks_to_read_per_op[state] -= num_blocks_read

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\interfaces\physical_operator.py:88, in DataOpTask.on_data_ready(self, max_blocks_to_read)
     80     meta = ray.get(next(self._streaming_gen))
     81 except StopIteration:
     82     # The generator should always yield 2 values (block and metadata)
     83     # each time. If we get a StopIteration here, it means an error
   (...)
     86     # TODO(hchen): Ray Core should have a better interface for
     87     # detecting and obtaining the exception.
---> 88     ex = ray.get(block_ref)
     89     self._task_done_callback()
     90     raise ex

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\auto_init_hook.py:24, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     21 @wraps(fn)
     22 def auto_init_wrapper(*args, **kwargs):
     23     auto_init_ray()
---> 24     return fn(*args, **kwargs)

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\client_mode_hook.py:103, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    101     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    102         return getattr(ray, func.__name__)(*args, **kwargs)
--> 103 return func(*args, **kwargs)

File D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\worker.py:2565, in get(object_refs, timeout)
   2563             raise value.as_instanceof_cause()
   2564         else:
-> 2565             raise value
   2567 if is_individual_id:
   2568     values = values[0]

RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "python\ray\_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python\ray\_raylet.pyx", line 4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python\ray\_raylet.pyx", line 447, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\data\_internal\execution\interfaces\physical_operator.py", line 80, in on_data_ready
    meta = ray.get(next(self._streaming_gen))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python\ray\_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python\ray\_raylet.pyx", line 365, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-09 16:14:52,924	ERROR serialization.py:406 -- Failed to unpickle serialized exception
Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-09 16:14:52,924	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

2024-01-09 16:14:52,924	ERROR serialization.py:406 -- Failed to unpickle serialized exception
Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-09 16:14:52,931	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

2024-01-09 16:14:52,932	ERROR serialization.py:406 -- Failed to unpickle serialized exception
Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-09 16:14:52,936	ERROR worker.py:406 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: APIStatusError.__init__() missing 2 required keyword-only arguments: 'response' and 'body'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 404, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\_private\serialization.py", line 293, in _deserialize_object
    return RayError.from_bytes(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 40, in from_bytes
    return RayError.from_ray_exception(ray_exception)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\JIA\miniconda3\envs\patrag\Lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
    raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Can we update the another type instance database object by the map_batches of ray?

I want to use map_batches update the database obtained by langchain FAISS, but I can not get correct answer.
Is it because the distributed approach is not suitable for this kind of update ？

from langchain.vectorstores import FAISS
db = FAISS.from_texts(["start"],embedding_model)
def update_db(batch):
    global db
    db.add_texts(batch['text'])
    # log the db 
    print(len(db.docstore._dict)) # 2
    return {}

demo_data.map_batches(
    update_db,
    batch_size=10,
    compute=ActorPoolStrategy(size=1)).count()
print("-----")
print(len(db.docstore._dict)) # 1

AttributeError: module 'pydantic._internal' has no attribute '_model_construction'

I was stepping through the guide, including the requirements.txt.

I got the attached error.
I have python 3.10.6, pydantic 2.5.0, and pydantic_core of 2.14.1, ray 2.8.0

setup-pgvector.sh is corrupting /etc/sudoers file

Running the script will result in the /etc/sudoers file containing only "ray ALL=(ALL:ALL) NOPASSWD:ALL"

need to modify the line "echo 'ray ALL=(ALL:ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers
"
to "echo 'ray ALL=(ALL:ALL) NOPASSWD:ALL' | sudo tee -a /etc/sudoers"

All cluster resources being claimed by actors ?

On the notebook, calling

# Embed chunks
embedding_model_name = "thenlper/gte-base"
embedded_chunks = chunks_ds.map_batches(
    EmbedChunks,
    fn_constructor_kwargs={"model_name": embedding_model_name},
    batch_size=100, 
    num_gpus=1,
    compute=ActorPoolStrategy(size=2))

# Sample
sample = embedded_chunks.take(1)

results to:

======== Autoscaler status: 2023-09-19 10:15:05.945390 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_39e554d28e4f63b9d3360ffdf267014a901a29d1601c039967717f26
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 1.0/32.0 CPU
 1.0/1.0 GPU
 0B/10.09GiB memory
 11.70MiB/5.05GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors
(autoscaler +2m17s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(autoscaler +2m52s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

Any solution ?
I have tried changing ActorPoolStrategy to size 1 and reducing batch_size yet the same old story.

How to start the ray server with serve.py script in CLI?

where is sql dump mentioned in notebook

In Notebook, you are running
export SQL_DUMP_FP="/efs/shared_storage/goku/sql_dumps/gte-base_300_50.sql"
not able to locate the file, Please help

Unable to connect to postgres

I get this error

psql: error: connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL: password authentication failed for user "postgres" connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL: password authentication failed for user "postgres" CompletedProcess(args=['sudo', '-u', 'postgres', 'psql', '-f', '/content/drive/MyDrive/llm/llm/llm-applications/data/sql_dumps/gte-base_300_50.sql'], returncode=1)

Missing instruction for BGE?

Cool work!

Are you using the instruction when embedding queries using BGE? You need to prepend Represent this sentence for searching relevant passages: to every query when embedding with BGE. (https://huggingface.co/BAAI/bge-large-en)

I can't find it anywhere in the code. This information is pretty important to interpret your results.

cc @pcmoritz @GokuMohandas

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.