feast-dev / feast Goto Github PK

Feature Store for Machine Learning

License: Apache License 2.0

Makefile 0.58% Go 6.48% Dockerfile 0.23% Shell 0.85% Python 74.38% HCL 1.07% Java 7.51% Mustache 0.15% HTML 0.11% CSS 0.03% TypeScript 8.37% JavaScript 0.23%

machine-learning features ml big-data feature-store python mlops data-engineering data-science data-quality

feast's Introduction

Overview

Feast (Feature Store) is an open source feature store for machine learning. Feast is the fastest path to manage existing infrastructure to productionize analytic data for model training and online inference.

Feast allows ML platform teams to:

Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.

Please see our documentation for more information about the project.

📐 Architecture

The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click here.

🐣 Getting Started

1. Install Feast

pip install feast

2. Create a feature repository

feast init my_feature_repo
cd my_feature_repo/feature_repo

3. Register your feature definitions and set up your feature store

feast apply

4. Explore your data in the web UI (experimental)

feast ui

5. Build a training dataset

from feast import FeatureStore
import pandas as pd
from datetime import datetime

entity_df = pd.DataFrame.from_dict({
    "driver_id": [1001, 1002, 1003, 1004],
    "event_timestamp": [
        datetime(2021, 4, 12, 10, 59, 42),
        datetime(2021, 4, 12, 8,  12, 10),
        datetime(2021, 4, 12, 16, 40, 26),
        datetime(2021, 4, 12, 15, 1 , 12)
    ]
})

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features = [
        'driver_hourly_stats:conv_rate',
        'driver_hourly_stats:acc_rate',
        'driver_hourly_stats:avg_daily_trips'
    ],
).to_df()

print(training_df.head())

# Train model
# model = ml.fit(training_df)

            event_timestamp  driver_id  conv_rate  acc_rate  avg_daily_trips
0 2021-04-12 08:12:10+00:00       1002   0.713465  0.597095              531
1 2021-04-12 10:59:42+00:00       1001   0.072752  0.044344               11
2 2021-04-12 15:01:12+00:00       1004   0.658182  0.079150              220
3 2021-04-12 16:40:26+00:00       1003   0.162092  0.309035              959

6. Load feature values into your online store

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

Materializing feature view driver_hourly_stats from 2021-04-14 to 2021-04-15 done!

7. Read online features at low latency

from pprint import pprint
from feast import FeatureStore

store = FeatureStore(repo_path=".")

feature_vector = store.get_online_features(
    features=[
        'driver_hourly_stats:conv_rate',
        'driver_hourly_stats:acc_rate',
        'driver_hourly_stats:avg_daily_trips'
    ],
    entity_rows=[{"driver_id": 1001}]
).to_dict()

pprint(feature_vector)

# Make prediction
# model.predict(feature_vector)

{
    "driver_id": [1001],
    "driver_hourly_stats__conv_rate": [0.49274],
    "driver_hourly_stats__acc_rate": [0.92743],
    "driver_hourly_stats__avg_daily_trips": [72]
}

📦 Functionality and Roadmap

The list below contains the functionality that contributors are planning to develop for Feast.

We welcome contribution to all items in the roadmap!
Data Sources
- Snowflake source
- Redshift source
- BigQuery source
- Parquet file source
- Azure Synapse + Azure SQL source (contrib plugin)
- Hive (community plugin)
- Postgres (contrib plugin)
- Spark (contrib plugin)
- Kafka / Kinesis sources (via push support into the online store)
Offline Stores
Online Stores
Feature Engineering
- On-demand Transformations (Beta release. See RFC)
- Streaming Transformations (Alpha release. See RFC)
- Batch transformation (In progress. See RFC)
Streaming
Deployments
- AWS Lambda (Alpha release. See RFC)
- Kubernetes (See guide)
Feature Serving
- Python Client
- Python feature server
- Java feature server (alpha)
- Go feature server (alpha)
Data Quality Management (See RFC)
- Data profiling and validation (Great Expectations)
Feature Discovery and Governance
- Python SDK for browsing feature registry
- CLI for browsing feature registry
- Model-centric feature tracking (feature services)
- Amundsen integration (see Feast extractor)
- DataHub integration (see DataHub Feast docs)
- Feast Web UI (Beta release. See docs)

🎓 Important Resources

Please refer to the official documentation at Documentation

👋 Contributing

Feast is a community project and is still under active development. Please have a look at our contributing and development guides if you want to contribute to the project:

✨ Contributors

Thanks goes to these incredible people:

feast's People

Contributors

Stargazers

Watchers

Forkers

zhilingc pradithya tims baskaranz feast-ci-bot mansiib sepam juanvera mbrukman shafiahmed lazynerd huaizhengzhang lovehoroscoper stevenlol mounirrouissi lavkesh utopiazh kiminh devjyotipatra thirteen37 steven-tjg stevencasey firdasafridi m9t diffblue-benchmarks lewisbakkero gauravkumar37 budi yangzz945 redwizard100 kmarquardsen tspannhw arjunmantri fusubacon 42eric allenzyx romanwozniak himanshubhardwaj scnakandala gabrielwen cloudtwter jmelching davidheryanto oopsoutofmemory developerswithpassion goldminer09 gappelgren manoj2411 peterjrichens accraze pmatern smadarasmi abhishekms1047 devendratomar ananth3010 ttanay adamjm souravbadami gongyanchao humasak khorshuheng bcui6611 mfaarabi zulily fcambarieri kiranvajja smithla02 kaihendry lumiere-ml woop rukminivs mikenyaga cloga nicholaistaalung wzpy jeffwan geota david30907d postmates hypertensorai lgvital yanson hopesdad mitakora pjaselin yennanliu mailmahee ashwin-kapes gracechristina imjuanleonard agoda-com chanukyapekala pmjacinto miguelperalvo ashwinath joostrothweiler sabarnwa cloudbow akansh97531 fabito

feast's Issues

Add build process for docker images

To begin with, we can just push the docker images to a public gcr repository. Later we can see about migrating it to Dockerhub.

Will use google cloud build for this.

Vulnerability in dependency (webpack-dev-server)

vulnerability found in ui/package-lock.json : CVE-2018-14732

More information
low severity
Vulnerable versions: < 3.1.11
Patched version: 3.1.11
An issue was discovered in lib/Server.js in webpack-dev-server before 3.1.11. Attackers are able to steal developer's code because the origin of requests is not checked by the WebSocket server, which is used for HMR (Hot Module Replacement). Anyone can receive the HMR message sent by the WebSocket server via a ws://127.0.0.1:8080/ connection from any origin.

Possible Solution

Upgrade webpack-dev-server to version 3.1.11 or later

How does Feast handle feature creation?

Since this will be an important topic to cover, I thought it would be wise to create an issue where we could both explain and discuss how feature creation fits in to Feast.

The current state is that feature creation is left up to the producer of features. Feast allows these features to be ingested for training and serving, but the act of engineering features is left to upstream processes. Feast does support referring to feature transformation code from feature specifications.

The question then becomes, how are users able to trace features that are ingested back to raw data sources? What would be the standard approach to engineering this upstream system and how do feature specifications connect to code and data?

Removal of storing historical value of feature in serving storage

Is your feature request related to a problem? Please describe.
Storing historical value of feature in serving store consume a lot of storage space. It becomes a real problem for limited storage such as Redis.

Describe the solution you'd like
Remove the storing feature as time series data functionality and only store the latest value of feature (which currently we already have)

Impact

Ingestion
Removal of code which writes feature as time series data in Redis and Bigtable.
Serving
Serving currently has 1 API which can request LAST feature or LIST of feature.
If the time series data is removed then LIST request will always return empty feature.
Currently, LAST request ignore time range filter, it has to be changed to respect it to allow filtering stale feature.
Client
Client will have to always use LAST request. (currently it uses LIST when time range filtering is not requested)

To be discussed further

Should we add new API to serving service which simplify request and response since we know it will always return LAST? Currently the request and response is complex to cater the needs LAST and LIST.

Thoughts?
@tims @zhilingc @woop @baskaranz @mansi @budi

Remove feature "granularity" and relegate to metadata

[edit] Granularities are no longer required in FeatureRow, or FeatureSpecs, as we have removed history from serving store and the serving api. Thus there is also no requirement for it to be in the warehouse store. Additionally the notion of granularity has proven to be confusing to end users. History of issue kept below:

I'd like to discuss feature granularities.

What is granularity

Currently we have a fixed set of feast granularities {seconds, minutes, hours, days}.
It is not always obvious what the feast granularity refers to.

In general a feature is handled by a few different datetimes throughout it's lifecycle:

the window duration of an aggregation (this is upstream to feast)
the trigger frequency that an event is emitted per key, likely irregular if more than once per window (this is upstream to feast)
the ingestion event timestamp that Feast receives during ingestion, determined by the feature creator
the storage event timestamp used to store and retrieve features in Feast, determined by feast.

The storage event timestamp is derived by rounding the ingestion event timestamp to start of the granularity for all the features in a feature row. Eg: for a granularity of 1 hour, we round the ingestion timestamp to the start of the enclosing hour.

For example, say we have a feature that is aggregated over a 1 hour fixed windows and triggered every one minute. Each minute an update of the 1 hour window aggregation is provided. We would naturally use a 1 hour granularity for this. The ingestion event timestamp should be within the one hour window. The storage event timestamp would be the start of the window.

Another example, say we have a feature that is aggregated over a 10 minute sliding window, and triggered only once at the end of every window. In this case, the feast granularity actually needs to be 1 minute. Which can seem confusing.

Limitations of current approach

Feast rounds the ingested timestamps to a granularity provided by creation, this seemed a convenience, but it hinders the use of custom granularities and it can cause confusion.

For example: because the granularities are an enum and there is not 5 minute option. If we wanted to store and overwrite a new key every five minutes, we would need to use a finer granularity and manually round the ingestion timestamps to the 5 minute marks during feature creation.

Another example: Lets say we have a feature called "product.day.sold". As it is updated throughout the day, it could represent the number of products sold on that day so far, or just as easily it could represent the number of products sold in the last 24 hours at the time it was updated. It could also represent the last 7 days of sold products as it stood on that particular day. Basically the meaning of this feature is determined by how the feature was created. The feature granularity is not enough information, and could be misleading when feature creators are forced to workaround it's limitations.

I suggest that instead of attempting to handle granularities, we should just require that rounding the timestamps should always happen during feature creation, not within Feast, and we should simply store features against the event timestamp provided.

The problem of how to serve keys if do not have a fixed granularity, is not as bad as it sounds.

firstly, it is only an issue at all when a feature is requested across a time range, not "latest". And "latest" is the most common request.
secondly, our currently supported stores, BigTable and Redis, both support scans across key date ranges (Redis via our bucketing approach).

Another problem is how do we prevent feature creators from over polluting a key space with far too granular timestamps? We will still have this problem regardless, as a feature creator can always use the "seconds" granularity.

My proposal

The storage event timestamp should be the same thing as ingestion event timestamp.
We should drop granularity from FeatureRow and ignore it for ingestion and storage purposes.
We should drop the requirement that granularity is part of the featureId. So instead of {entityName}.{granularity}.{featureName}, it should just be {entityName}.{featureName}.
BigQuery tables (which are currently separated by granularities, should instead be separated by a feature's group)

We would be committing to a requirement that timely short scans across a key range are supported by all stores.

Benefits

An easier to understand data model.
Enables storing at custom granularities.
Simplified code

What do people think?
Is there an issue with serving I have missed?

Jackson dependency issues

Current Behavior

Running the core application fails with error:

Factory method 'httpPutFormContentFilter' threw exception; nested exception is java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/exc/InvalidDefinitionException","message":"Failed to instantiate [org.springframework.boot.web.servlet.filter.OrderedHttpPutFormContentFilter]: Factory method 'httpPutFormContentFilter' threw exception; nested exception is java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/exc/InvalidDefinitionException","name":"org.springframework.beans.BeanInstantiationException","cause":{"commonElementCount":0,"localizedMessage":"com/fasterxml/jackson/databind/exc/InvalidDefinitionException","message":"com/fasterxml/jackson/databind/exc/InvalidDefinitionException","name":"java.lang.NoClassDefFoundError"
...

As of the python sdk merge. Tests run and pass fine.

Specifications

Version: master

Possible Solution

Exclude Jackson dependencies from Jinjava and pin Jackson to a specific version.

Bump Apache Beam SDK version

Currently ingestion uses Apache Beam V2.5.0 which is compatible with Flink runner 1.4.x.
This version of flink doesn't have REST API to access metrics, e.g: /jobs/<jobid>/metrics which is added in Flink 1.5.x.
In order to use it we have to upgrade ingestion's BEAM SDK to either 2.8.0, 2.7.0 or 2.6.0

Spring boot CLI logs show up as JSON

Expected Behavior

CLI logs show up in a human-readable format

Current Behavior

CLI logs show up as JSON:

Steps to reproduce

mvn clean compile
java -jar ./core/target/feast-core-0.1.0.jar

Specifications

Version: 0.1.0
Platform: Linux
Subsystem: Debian

Thanks!

Deduplicate list of storages in specs service

Expected Behavior

Ingestion should retrieve a set of storages from core, not a list with multiples.

Current Behavior

Given an import job with features sharing stores, the ingestion job will query for the full list of storage ids with duplicates. This will fail the job since the output list of storage specs will be smaller than the list of ids.

Possible Solution

Deduplicating the list of required storage specs will fix this problem (in Specs.java)

Python SDK

Creating an issue to track the addition of a Python SDK for Feast:

RFC: Python SDK
PR: #47

Create helm package repo to host helm charts

Add ability in the repo to host helm charts

Create a release

I think it's important that we separate any new development from bug fixes. We should have a release out of Feast that is stable and can be used in production.

If we want to develop new breaking changes then that should be prioritized for a future release, and development should maybe continue on a separate branch.

@tims @pradithya @zhilingc

Add build/test triggering for every PR and on the master branch

It would be nice to have automated build/test run for every GitHub PR to ensure that new proposed changes aren't breaking the build.

It's also not clear what are the set of commands needed to run to ensure that a proposed change is not breaking anything, particularly given that there is code in Java, Go, and Python in a single repo.

Running the following command from an unmodified master branch:

make build-java

fails with:

[ERROR] Errors: 
[ERROR]   KafkaFeatureRowDeserializerTest.feast.source.kafka.deserializer.KafkaFeatureRowDeserializerTest » SessionExpired
[ERROR] Tests run: 112, Failures: 0, Errors: 1, Skipped: 0

If we had a build status badge on the README.md file that I could see the latest build status, it would help me to understand that this is an existing and known flaky/broken test, and it would incentivize others to help fix or disable this test to get to a green build.

Option to add service account to core deployment.

The feast deployment requires access to the following components:

gcs: read/write

As well as the following, depending on your usecase:

bigquery: read/write
bigtable: read/write/tableAdmin

There should be an option to add service account to core deployment with required roles.

Add support for ingesting from BigQuery SQL statement / View

Currently Feast only support ingesting feature from BigQuery Table.

Get ingestion metrics when running on Flink runner

Is your feature request related to a problem? Please describe.
Currently pulling metrics from Flink Runner is not implemented yet.

Describe the solution you'd like
Use Flink REST API to pull metrics from

Removal of ingestion's profile for different runner

To simplify deployment process, packaging ingestion as different profile based on the runner will be removed.

Kafka source

The kafka IO PR #19 needs some follow ups.

It appears to have broken the build? Or another merge removed an import. IOException symbol not found in ImportJob.
Import job gets google application default creds and uses options.setGcpCredential? Why? I've not needed to do that before if I set them in my environment. It would also seems odd to get an error logged about it if you're using the direct or flink runner and not using GCP at all.
ImportJobOptions should not inherit FlinkPipelineOptions and GcpOptions, as that suggests it should always use Flink and GCP. These should already be registered by default, so you can just make use of them using options.as(GcpOptions.class) if needed.

Allow for registering feature that doesn't have warehouse store

Registering feature that doesn't have warehouse store will currently return error.
There is a use case that some feature doesn't need warehouse store, so the restriction could be removed.
Additionally, ingestion will need to be updated to do no operation for warehouse storing operation.

Create CI

Allow for disabling of metrics pushing

Is your feature request related to a problem? Please describe.
Without a valid statsd endpoint, the entire deployment will fail. While useful, we shouldn't make it mandatory for users to deploy the tick stack together with Feast. Core still maintains the latest metrics retrieved even in the absence of a separate db.

Describe the solution you'd like
Allow for users to toggle metrics pushing to an external db on/off.

Errors during kafka deserializer (passing) test execution

Below error is thrown while running the kafka deserializer test, even though the test passes, which could break the build
Caused by: java.io.FileNotFoundException: /var/folders/kd/yyfwdyk15b32xdjs1j9j3qsr0000gp/T/kafka-5510291321770207219/topic-3198cdeb-d825-4092-8c6d-bd27db077c5c-0/00000000000000000001.snapshot (No such file or directory)

Runtime Dependency Error After Upgrade to Beam 2.9.0

During job submission following error is thrown

java.lang.NoSuchMethodError: com.google.api.client.http.HttpRequest.setWriteTimeout(I)Lcom/google/api/client/http/HttpRequest;
        at org.apache.beam.sdk.util.RetryHttpRequestInitializer.initialize(RetryHttpRequestInitializer.java:227)
        at com.google.cloud.hadoop.util.ChainingHttpRequestInitializer.initialize(ChainingHttpRequestInitializer.java:52)
        at com.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:93)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:381)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
        at com.google.cloud.hadoop.util.ResilientOperation$AbstractGoogleClientRequestExecutor.call(ResilientOperation.java:163)
        at com.google.cloud.hadoop.util.ResilientOperation.retry(ResilientOperation.java:64)

This is due to dependency to com.google.http-client was not updated accordingly

Branch conflicts with tag when using cloud build trigger

We should remove the branch field in the Terraform script for creating a cloud build trigger.

Go packages in protos use incorrect repo

The go packages in the protos are still pointing at gojektech.
option go_package = "github.com/gojektech/..."

Should be:
option go_package = "github.com/gojek/..."

Job Execution Failure with NullPointerException

Expected Behavior
Ingestion job execution should complete successfully

Current Behavior
Running a job using Dataflow Runner throw following exception


java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$1.typedApply(IntrinsicMapTaskExecutorFactory.java:193)
	at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$1.typedApply(IntrinsicMapTaskExecutorFactory.java:164)
	at org.apache.beam.runners.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:63)
	at org.apache.beam.runners.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:50)
	at org.apache.beam.runners.dataflow.worker.graph.Networks.replaceDirectedNetworkNodes(Networks.java:87)
	at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory.create(IntrinsicMapTaskExecutorFactory.java:124)
	at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:337)
	at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:291)
	at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
	at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
	at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at org.apache.beam.sdk.values.TupleTag.hashCode(TupleTag.java:162)
	at org.apache.beam.runners.dataflow.worker.repackaged.com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:72)
	at org.apache.beam.runners.dataflow.worker.repackaged.com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:300)
	at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory.createParDoOperation(IntrinsicMapTaskExecutorFactory.java:268)
	at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory.access$000(IntrinsicMapTaskExecutorFactory.java:85)
	at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$1.typedApply(IntrinsicMapTaskExecutorFactory.java:182)

Steps to reproduce
Run any ingestion job on commit 11114c5

Possible Cause
The job is successful if commit 11114c5 is excluded from build. So, probably it has something to do with it. (note: PR #45 has to be included when trying in order to complete job submission step)

Include ui into core's build

Is your feature request related to a problem? Please describe.
Currently building UI is outside core's component build process even though it should be packaged as one

Describe the solution you'd like
Build UI and core together and include the UI artifact into core's jar file

Helm chart for deploying feast using Flink as runner

Describe the solution you'd like
Update helm chart to support deploying Feast with Flink runner

Change config to yaml

Is your feature request related to a problem? Please describe.
Configuration for feast is currently done using environment variables only. As the number of parameters grows in size, this method of configuring the deployment becomes increasingly unwieldy.

Describe the solution you'd like
We want to move away from KVs, which are limited when it comes to nested config, eg. for job options. Currently we set it as a json string.

Describe alternatives you've considered
If we use yaml, it would potentially look something like this:

---
# CORE CONFIG #
grpc.port: 6565
http.port: 8080
feast.jobs:
  coreUri: localhost:6565
  runner: DataflowRunner
  options:
    project: my-dataflow-runner-project
    region: asia-east1
    tempLocation: gs://feast-temp-bucket #also required for bq ingestion
    subnetwork: regions/asia-east1/subnetworks/default
  executable: feast-ingestion.jar
  errorsStore:
    type: file.json
    options:
      path: gs://feast-errors
  monitoring:
    period: 5000
    initialDelay: 600000
    statsd:
      host: localhost
      port: 8125

# DB CONFIG #
spring.jpa.properties.hibernate.format_sql: true
spring.datasource.url: "jdbc:postgresql://localhost:5432/postgres"
spring.datasource.username: postgres
spring.datasource.password: password
spring.jpa.hibernate.naming.physical-strategy: org.hibernate.boot.model.naming.PhysicalNamingStrategyStandardImpl
spring.jpa.hibernate.ddl-auto: update

# APP METRICS #
management.metrics.export.simple.enabled: false
management.metrics.export.statsd.enabled: true
management.metrics.export.statsd.host: localhost
management.metrics.export.statsd.port: 8125

We can still continue to support env vars for overriding config.

Update CLI to reflect recent changes in API

Namely: add the new UIService methods to the CLI.

[FlinkRunner] Core should not follow remote flink runner job to completion

Expected Behavior

Core should submit the jar to the flink cluster and then the subprocess should terminate, instead of following the job to completion.

Current Behavior

Running a job using the FlinkRunner is a blocking call that runs indefinitely (streaming) or until completion (batch)

Steps to reproduce

Run a job on a flink cluster

Possible Solution

Should follow the same process as DataflowRunner: submit job -> return job id -> allow the jobmanager to keep track of status of the job

We do need to retrieve the Flink job id from the Flink cluster in that case.

Ingestion should fail immediately when there are no valid stores

Expected Behavior

Starting a job with invalid stores (e.g. using redis as a warehouse store) should not be allowed, and should fail quickly - ideally at the graph building step of ingestion.

Current Behavior

Starting a job with invalid stores will successfully send the job to the runner, which will run to completion (or indefinitely, in the case of streaming jobs). The errors will be logged, but the pipeline will run with no problems.

Steps to reproduce

Register a feature with its warehouse sink pointing to a serving store (e.g. redis)
Run a job (direct runner is the best way to view errors)
Pipeline runs successfully, a successful response is returned to the caller

Possible Solution

PR #11 is a band-aid solution to this problem: it checks the store types at registration, ensuring that a feature is unable to use a serving store for warehousing, but ideally ingestion error out properly during graph building.

Create CONTRIBUTING documentation

Create proper OWNERS files for each sub-component

In order to speed up reviews and getting PRs merged, we should have proper OWNERS files defined. These YAML files are used by Prow to assign reviewers and approvers for specific sub-directories of the project. This means the appropriate people are assigned to PRs faster, and they can review and approve the PR faster because they are also more familiar with that area of the code

The OWNERS files also allow us to give localized /lgtm and /approve access to new contributors.

Please see this document for more details.

Create Getting Started documentation

Is your feature request related to a problem? Please describe.
The Getting Started doc will make it easier for new developers to come in, test out, and contribute to Feast.

This doc might include:

Installation
Usage
Dev Environment Setup

Error Store should not require a storage spec

The error store should be defined on deployment (of core), and should not require a spec.

Ability to pass runner option during ingestion job submission

Runner option is currently hardcoded in deployment (specifically core's deployment). Some ingestion jobs might need to override those value without redeploying feast.
One way to do it is by adding runnerOptions field in the SubmitImportJobRequest as follow:

 message SubmitImportJobRequest {
        feast.specs.ImportSpec importSpec = 1;
        string name = 2; 
        map<string, string> runnerOptions = 3; // NEW, optional
    }

This runnerOptions is then used to override default runner option and then passed to the runner.

thought? @woop @zhilingc

[FlinkRunner] Ingestion job tries to connect to every store available in core

Expected Behavior

Should only connect to the relevant stores specified in the FeatureSpec.

Current Behavior

Will pull and attempt to connect to all the stores once the job is sent to the remote Flink cluster. If there are stores it cannot connect to, the entire job will fail.

Steps to reproduce

Register a storage that cannot be reached (e.g. 1.2.3.4:6379)
Run a job (not using that store) on a remote flink cluster

Possible Solution

Since we only pull specs at the graph building stage now, ingestion shouldn't have to retrieve and cache all available stores, only the stores it needs.

Fix unit tests script

I've set up Prow to run unit tests. It clones Feast and then runs make test. The Makefile is here

Currently the tests are implemented, but they throw an error when running mvn test. This needs to be fixed, and then the test command in the Makefile should be updated.

Go tests failing for CLI

Current Behavior

Go tests are failing for the CLI because the test cases are hardcoded to the author's timezone (mine).

Steps to reproduce

Run test on printer_test.go: go test printer_test.go

Possible Solution

Either fix the timezone to UTC or make the expected test output timezone sensitive as well.

TF file for cloud build trigger broken

Current Behavior

Getting the following error:

Error: Error parsing .../projects/feast/testing/tf/cloud-build/cloud-build.tf: At 5:176: illegal char escape

Due to the github tag regex.

Add python tests to unit testing script

Currently only java code is tested before being merged into master. Might want to add checks for any python code thats added to the sdk.

Move source types into their own package and discover them using java.util.ServiceLoader

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
The ingestion sources are currently mingled in with all the other ingestion code in the transforms package. This makes it hard to distinguish what all the sources are and makes it trickier for people to see how they might extend feast to support new sources.

Describe the solution you'd like
I'd like to create a source package in the ingestion module, and create common factory type class that can create FeatureIO.Read instances and have a function for returning their type string. New source implementations need only implement this interface and use an @AutoService annotation to make it available.

The import job can then load all available source factories from the class path and find the one with the matching type string against the import spec.

This follows the same pattern we use for storage types and allows us a path for including new source types as external dependencies, or moving existing ones into their own module. And then making them available simply by adding their jars on the class path.

Option to create a resource without overwriting existing records

The core API currently only supports apply methods, which, if there is an existing resource with the same id, will update the existing record. There is no distinction between the response after creating a new resource vs. updating one.

There should be a way to create a resource - this action should fail if there is an existing resource already registered in the system.

We can either alter the existing apply method to take in an argument or add separate create methods.

FeatureRow proto to wrap FeatureRowKey and FeatureRowMessage

Since Feast should support the ingestion of data from Kafka topics (in the format of KV) using KafkaIO, the FeatureRow proto should be wrapped to have FeatureRowKey (K) and FeatureRowMessage (V).

Coalesce FeatureRows for improved "latest" value consistency in serving stores

Is your feature request related to a problem? Please describe.
Now that we plan to store features in the serving store as Latest only, instead of the full history. We need to add logic to ensure writes of feature values from older rows don't clobber new ones. This is especially a problem for batch, but we'd like to solve it generally for streaming as well.

Describe the solution you'd like
For batch streams, we will combine feature rows by key. Where the key is entity name + entity key.

For streaming, we will also do this, but we emit rows with a timer, keeping the previous value in state for future combines. We will evict state with another timeout timer, this will allow for very large key infrequently updated key spaces to be less demanding.

The feature rows should be stored in the warehouse store in the combined form, not as their raw separate rows, because we want it to reflect the rows as they are written to the serving stores. This ensures that a the equivalent of a Select * AS OF, query would reflect what was available in the serving stores.

We should also make sure that we can turn this feature off as an option in each import spec, as this could introduce performance hits in very high throughput streams with large backlogs. And is also not required for batch jobs that know they only have one row per entity key.

Describe alternatives you've considered
Alternatives are to keep each feature row separately in the serving stores and combine them in the serving api (which is what we started with). This is not as performant as many use cases require, and we'd prefer to optimize for serving speed. We are willing to accept that we cannot guarantee true consistency using the above approach, as it's possible that data already in the store may conflict with data being ingested in a new job. But we will assume that incoming feature data is eventually consistent and will overwrite any issues.

Additional context
The PR I'm working on for this issue is WIP and will require #87 to be finished first, as we should be grouping on the {entityName}+{entityKey}, not with the event timestamp + granularity. It would be more work to deal with the latest values in the stores store separately, so it's best we wait.

Newest (latest?) value of a feature

In an effort to make one of my creation pipeline bearable for Feast's ingestion, I tried to merge feature rows into one row if they have the same entity key, entity id, granularity, and event timestamp (basically everything in FeatureRowKey), but then I realized that there are no notion of "newer feature row" in Feast.

Try 1: Merge by FeatureRowKey.
Suppose I want to merge Feature Row by creating a KV of FeatureRowKey and FeatureRow.. This will work only if there are no conflicting feature ids within a feature row. If there are, we can't compare them because we can't tell which one is the "newest".

Try 2: None granularity.
By messing around on the creation part, I can disregard Granularity entirely, group by entity key and entity name, and produce a FeatureRow every time a new feature arrives (or even windowed). This will eliminate the need of comparing for "newest" feature if there are conflicting feature ids, but also takes up more resources. Not to mention that it'll bloat result counts which defeats the above effort purpose in the first place.

Are there any workaround on this?

I think if we disregard Granularity and think of event timestamp as processing timestamp, we can leave the creation part with more room to the way it produces FeatureRow.

I myself would rather have my FeatureRow produced less frequent, fat, and full of values if my resources allowed it, rather than very frequent, lean, and full of nulls.

Related : #53 #17

Toggle data stores using flags in feature specifications

I'm considering whether it is cleaner to have options within feature specifications like useFeatureWarehouse: false to enable or disable a specific type of storage option (serving, warehouse).

Thoughts?

Related to: #5

Ingestion is not ignoring unknown feature in streaming source

Expected Behavior

Ingestion should ignore feature ID in FeatureRow that was not specified in Import spec

Current Behavior

Ingestion tried to ingest the unknown feature and throw following exception:

"transform":"Convert feature types","message":"Unknown feature myentity.none.unknown_feature, spec was not initialized","stackTrace":"java.lang.IllegalArgumentException: Unknown feature myentity.none.unknown_feature, spec was not initialized\n\tat com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)\n\tat feast.ingestion.model.Specs.getFeatureSpec(Specs.java:148)\n\tat feast.ingestion.transform.fn.ConvertTypesDoFn.processElementImpl(ConvertTypesDoFn.java:44)\n\tat feast.ingestion.transform.fn.BaseFeatureDoFn.baseProcessElement(BaseFeatureDoFn.java:41)\n\tat feast.ingestion.transform.fn.ConvertTypesDoFn$DoFnInvoker.invokeProcessElement(Unknown Source)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)\n\tat org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)\n\tat org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)\n\tat org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)\n\tat org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)\n\tat org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)\n\tat org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)\n\tat

Steps to reproduce

Run ingestion with streaming source (PubSub / Kafka) and publish FeatureRow with unknown feature inside the stream.

Vulnerability in dependency (core - jackson-databind )

8 com.fasterxml.jackson.core:jackson-databind vulnerabilities found in core/pom.xml a day ago
https://github.com/gojek/feast/network/alert/core/pom.xml/com.fasterxml.jackson.core:jackson-databind/open

Remediation

Upgrade com.fasterxml.jackson.core:jackson-databind to version 2.9.8 or later. For example:

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>[2.9.8,)</version>
</dependency>

feast-dev / feast Goto Github PK

feast's Introduction

Overview

📐 Architecture

🐣 Getting Started

1. Install Feast

2. Create a feature repository

3. Register your feature definitions and set up your feature store

4. Explore your data in the web UI (experimental)

5. Build a training dataset

6. Load feature values into your online store

7. Read online features at low latency

📦 Functionality and Roadmap

🎓 Important Resources

👋 Contributing

✨ Contributors

feast's People

Contributors

Stargazers

Watchers

Forkers

feast's Issues

Possible Solution

What is granularity

Limitations of current approach

My proposal

Benefits

Current Behavior

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Expected Behavior

Current Behavior

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Possible Solution

Current Behavior

Steps to reproduce

Possible Solution

Current Behavior

Expected Behavior

Current Behavior

Steps to reproduce

Remediation

Recommend Projects

Recommend Topics

Recommend Org