Git Product home page Git Product logo

feature-store-api's Introduction

Hopsworks Feature Store

Hopsworks Community Hopsworks Feature Store Documentation python PyPiStatus Scala/Java Artifacts Downloads Ruff License

HSFS is the library to interact with the Hopsworks Feature Store. The library makes creating new features, feature groups and training datasets easy.

The library is environment independent and can be used in two modes:

  • Spark mode: For data engineering jobs that create and write features into the feature store or generate training datasets. It requires a Spark environment such as the one provided in the Hopsworks platform or Databricks. In Spark mode, HSFS provides bindings both for Python and JVM languages.

  • Python mode: For data science jobs to explore the features available in the feature store, generate training datasets and feed them in a training pipeline. Python mode requires just a Python interpreter and can be used both in Hopsworks from Python Jobs/Jupyter Kernels, Amazon SageMaker or KubeFlow.

The library automatically configures itself based on the environment it is run. However, to connect from an external environment such as Databricks or AWS Sagemaker, additional connection information, such as host and port, is required. For more information checkout the Hopsworks documentation.

Getting Started On Hopsworks

Get started easily by registering an account on Hopsworks Serverless. Create your project and a new Api key. In a new python environment with Python 3.8 or higher, install the client library using pip:

# Get all Hopsworks SDKs: Feature Store, Model Serving and Platform SDK
pip install hopsworks
# or minimum install with the Feature Store SDK
pip install hsfs[python]
# if using zsh don't forget the quotes
pip install 'hsfs[python]'

You can start a notebook and instantiate a connection and get the project feature store handler.

import hopsworks

project = hopsworks.login() # you will be prompted for your api key
fs = project.get_feature_store()

or using hsfs directly:

import hsfs

connection = hsfs.connection(
    host="c.app.hopsworks.ai", #
    project="your-project",
    api_key_value="your-api-key",
)
fs = connection.get_feature_store()

Create a new feature group to start inserting feature values.

fg = fs.create_feature_group("rain",
                        version=1,
                        description="Rain features",
                        primary_key=['date', 'location_id'],
                        online_enabled=True)

fg.save(dataframe)

Upsert new data in to the feature group with time_travel_format="HUDI"".

fg.insert(upsert_df)

Retrieve commit timeline metdata of the feature group with time_travel_format="HUDI"".

fg.commit_details()

"Reading feature group as of specific point in time".

fg = fs.get_feature_group("rain", 1)
fg.read("2020-10-20 07:34:11").show()

Read updates that occurred between specified points in time.

fg = fs.get_feature_group("rain", 1)
fg.read_changes("2020-10-20 07:31:38", "2020-10-20 07:34:11").show()

Join features together

feature_join = rain_fg.select_all()
                    .join(temperature_fg.select_all(), on=["date", "location_id"])
                    .join(location_fg.select_all())
feature_join.show(5)

join feature groups that correspond to specific point in time

feature_join = rain_fg.select_all()
                    .join(temperature_fg.select_all(), on=["date", "location_id"])
                    .join(location_fg.select_all())
                    .as_of("2020-10-31")
feature_join.show(5)

join feature groups that correspond to different time

rain_fg_q = rain_fg.select_all().as_of("2020-10-20 07:41:43")
temperature_fg_q = temperature_fg.select_all().as_of("2020-10-20 07:32:33")
location_fg_q = location_fg.select_all().as_of("2020-10-20 07:33:08")
joined_features_q = rain_fg_q.join(temperature_fg_q).join(location_fg_q)

Use the query object to create a training dataset:

td = fs.create_training_dataset("rain_dataset",
                                version=1,
                                data_format="tfrecords",
                                description="A test training dataset saved in TfRecords format",
                                splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})

td.save(feature_join)

A short introduction to the Scala API:

import com.logicalclocks.hsfs._
val connection = HopsworksConnection.builder().build()
val fs = connection.getFeatureStore();
val attendances_features_fg = fs.getFeatureGroup("games_features", 1);
attendances_features_fg.show(1)

You can find more examples on how to use the library in our hops-examples repository.

Usage

Usage data is collected for improving quality of the library. It is turned on by default if the backend is "c.app.hopsworks.ai". To turn it off, use one of the following way:

# use environment variable
import os
os.environ["ENABLE_HOPSWORKS_USAGE"] = "false"

# use `disable_usage_logging`
import hsfs
hsfs.disable_usage_logging()

The source code can be found in python/hsfs/usage.py.

Documentation

Documentation is available at Hopsworks Feature Store Documentation.

Issues

For general questions about the usage of Hopsworks and the Feature Store please open a topic on Hopsworks Community.

Please report any issue using Github issue tracking.

Please attach the client environment from the output below in the issue:

import hopsworks
import hsfs
hopsworks.login().get_feature_store()
print(hsfs.get_env())

Contributing

If you would like to contribute to this library, please see the Contribution Guidelines.

feature-store-api's People

Contributors

aversey avatar berthoug avatar bubriks avatar davitbzh avatar dependabot[bot] avatar dhananjay-mk avatar ermiasg avatar gibchikafa avatar javierdlrm avatar jimdowling avatar kennethmhc avatar kouzant avatar lovew-lc avatar maismail avatar manu-sj avatar maxxx-zh avatar mklepium avatar moritzmeister avatar o-alex avatar rktraz avatar robzor92 avatar siroibaf avatar smkniazi avatar tdoehmen avatar tkakantousis avatar vatj avatar yiksanchan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

feature-store-api's Issues

External client requests do not verify hostname

self._verify = self._get_verify(self._host, trust_store_path)

When sending a request, the bool parameter 'verify' indicates if the hostname has to be verified. This value is calculated using 'get_verify' method in the base client.

In case of external clients, this value is always false since the host name is passed as parameter to 'get_verify' and it's always different to the string "true". Instead, this parameter might be 'hostname_verification'.

Save write options for training datasets

We provide a way for users to pass write options to spark when writing a training dataset. We should store these options in Hopsworks and retrieve them when reading/appending to the training dataset.

Prefix for joining to avoid clashing

Allow providing a prefix in case there are feature clashing.

fg.select([column_name, column_name1])
  .join(fg1.select([alex1, alex2, column_name]), prefix="fg1_")

Add support for training datasets numpy/hd5 formats

Spark can't handle these formats, so in hops-util-py we were uploading them using Pydoop.

We need to port that concept fo the feature store api and make sure it works on Databricks and writing to S3 as well.

hopsworks client not working on pure python on hopsworks

FileNotFoundError: [Errno 2] No such file or directory: '/srv/hops/jupyter/Projects/demo_featurestore_admin000/demo_featurestore_admin000__meb10000/fc53a2e41a134bb458d2bb7b40d710fb74bd88dae215af9fd55766f72a20ef0f/certificates/demo_featurestore_admin000__meb10000__kstore.key'

Python client can't handle empty responses from the backend

exogenous_fg_meta = fs.get_feature_group('exogenous_fg', 1)
exogenous_fg_meta.delete()

This piece of code results in this exception:

An error was encountered:
Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/site-packages/hsfs/feature_group.py", line 164, in delete
    self._feature_group_engine.delete(self)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/site-packages/hsfs/core/feature_group_engine.py", line 103, in delete
    self._feature_group_api.delete(feature_group)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/site-packages/hsfs/core/feature_group_api.py", line 114, in delete
    _client._send_request("DELETE", path_params)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/site-packages/hsfs/decorators.py", line 35, in if_connected
    return fn(inst, *args, **kwargs)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/site-packages/hsfs/client/base.py", line 151, in _send_request
    return response.json()
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/srv/hops/anaconda/envs/theenv/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Expose method to get online storage connector to the user

Currently the user can request a storage connector from the feature store object. However that call does not fetch the online storage connector.

We should allow users to fetch it - However it might also make sense to unify the rest calls in the backend.

splits inode foreign key

we should add a foreign key to the inode of a split, so that metadata gets deleted when the user deletes a split from the Datasets browser.

Problem: Spark recreated the directory when writing, therefore creating a new inode. In append mode it wouldn't be a problem, however, tfrecords don't support append.

operations should check for duplicated columns

It's easy when joining feature groups together to end up with duplicated columns. Spark refuses to write if there are duplicated columns in the dataframe.

When we check the schema we should also check that there are no duplicated columns.

Don't set spark type as online type in Java API

Currently when we parse the features from a feature group, we set the spark type as online type. Online type should be either empty (Hopsworks takes care of converting it) or a valid online type

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.