Git Product home page Git Product logo

aws-solutions / improving-forecast-accuracy-with-machine-learning Goto Github PK

View Code? Open in Web Editor NEW
42.0 19.0 24.0 829 KB

The Improving Forecast Accuracy with Machine Learning solution generates, tests, compares, and iterates on Amazon Forecast forecasts. The solution automatically produces forecasts and generates visualization dashboards for Amazon QuickSight or Amazon SageMaker Jupyter Notebooks—providing a quick, easy, drag-and-drop interface that displays time series input and forecasted output.

Home Page: https://aws.amazon.com/solutions/implementations/improving-forecast-accuracy-with-machine-learning

License: Apache License 2.0

Python 99.18% Shell 0.22% Jupyter Notebook 0.60%
amazon-forecast generating-forecasts multiple-forecasts developing-forecasts

improving-forecast-accuracy-with-machine-learning's Introduction

Deprecation Notice

As of 01/01/2024, Improving Forecast Accuracy with Machine Learning has been deprecated and will not be receiving any additional features or updates.

Improving Forecast Accuracy with Machine Learning

The Improving Forecast Accuracy with Machine Learning solution is designed to help organizations that rely on generating accurate forecasts and store historical demand time series data. Whether organizations are developing forecasts for the first time, or optimizing their current pipeline, this solution will reduce the overhead cost of generating forecasts from time series data, related time series data, and item metadata.

This solution supports multiple forecasts and per-forecast parameter configuration in order to reduce the repetitive task of generating multiple forecasts. The use of an AWS Step Function eliminates the undifferentiated heavy lifting of creating Amazon Forecast datasets, dataset groups, predictors, and forecasts—allowing developers and data scientists to focus on the accuracy of their forecasts. Amazon Forecast predictors and forecasts can be updated as item demand data, related timeseries data, and item metadata are refreshed, which allows for A/B testing against different sets of related timeseries data and item metadata.

After forecast exports and Athena tables are created, the solution automatically creates an interactive and shareable analysis in Amazon QuickSight with pre-set visuals and configurations to analyze your forecasts.

To better capture and alert users of data quality issues, a configurable alert function can also be deployed with Amazon Simple Notification Service (Amazon SNS). This notifies the user on success and failure of the automated forecasting job, reducing the need for users to monitor their forecast workflow.

This guide provides infrastructure and configuration information for planning and deploying the solution in AWS.

Architecture

The following describes the architecture of the solution:

architecture

The AWS CloudFormation template deploys the resources required to automate your Amazon Forecast usage and deployments. Based on the capabilities of the solution, the architecture is divided into three parts: Data Preparation, Forecasting, and Data Visualization. The template includes the following components:

  • An Amazon Simple Storage Service (Amazon S3) bucket for Amazon Forecast configuration where you specify configuration settings for your dataset groups, datasets, predictors and forecasts, as well as the datasets themselves.
  • An Amazon S3 event notification that triggers when new datasets are uploaded to the related Amazon S3 bucket.
  • An AWS Step Functions State Machine. This combines a series of AWS Lambda functions that build, train. and deploy your Machine Learning (ML) models in Amazon Forecast.
  • An Amazon Simple Notification Service (Amazon SNS) topic and email subscription that notify an administrator user with the results of the AWS Step Function.
  • An optional Amazon SageMaker Notebook Instance that data scientists and developers can use to prepare and process data, and evaluate your Forecast output.
  • An AWS Glue Job that consolidates your input data and forecast data into a consolidated view that can be queried with standard SQL queries using Amazon Athena.
  • An Amazon QuickSight analysis is deployed to help visualize forecast output.

Note: Upgrading to 1.4.0 from earlier versions is not supported. Please redeploy the stack and copy your your configuration and data to the newly created forecast data bucket.

Note: From v1.2.0, all AWS CloudFormation template resources are created by the AWS CDK and AWS Solutions Constructs. Stateful CloudFormation template resources maintain the same logical ID comparing to v1.2.0, making the solution upgradable in place.

AWS CDK Constructs

AWS CDK Solutions Constructs make it easier to consistently create well-architected applications. All AWS Solutions Constructs are reviewed by AWS and use best practices established by the AWS Well-Architected Framework. This solution uses the following AWS CDK Solutions Constructs:

Getting Started

You can launch this solution with one click from AWS Solutions Implementations.

To customize the solution, or to contribute to the solution, follow the steps below:

Prerequisites

The following procedures assumes that all the OS-level configuration has been completed. They are:

1. Build the solution

Clone this git repository

git clone https://github.com/aws-solutions/improving-forecast-accuracy-with-machine-learning

2. Build the solution for deployment

Follow the steps in source/infrastructure/README.md (here) to use CDK to deploy the solution

3. Test/ Demo the Solution

To test the solution, or provide a demo - you can follow the synthetic data generation instructions in source/synthetic/README.md (here).

Collection of operational metrics

This solution collects anonymous operational metrics to help AWS improve the quality of features of the solution. For more information, including how to disable this capability, please see the implementation guide.


Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

improving-forecast-accuracy-with-machine-learning's People

Contributors

aassadza avatar amazon-auto avatar bios6 avatar dscpinheiro avatar fhoueto-amz avatar knihit avatar mahedi99 avatar pwrmiller avatar saccomi avatar tabdunabi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

improving-forecast-accuracy-with-machine-learning's Issues

RTS features must be numeric (demo data has strings)

Describe the bug

As per the Amazon Forecast Related Time Series validation doc:

  • Related time series feature data must be of the int or float datatypes.

The pre-prepared nyctaxi_weather_auto demo data tries to use a string field day_hour_name - which produces the following error notification (email):

There was an error running the forecast job for dataset group nyctaxi_weather_auto

Message: An error occurred (InvalidInputException) when calling the CreatePredictor operation: The attribute(s) [day_hour_name] present in the RELATED_TIME_SERIES schema should be of numeric type such as integer or float, or be added as a forecast dimension

To Reproduce

Deploy the solution with the optional pre-prepared NYC taxi demo data (and manually download/re-upload the datasets to kick off the pipeline if #9 is not yet fixed).

Expected behavior

Pipeline should deploy and create forecasts without errors on the demo dataset.

Please complete the following information about the solution:

  • Version: v1.4.0

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0123) Improving Forecast Accuracy with Machine Learning v1.3.0[...]".

  • Region: us-east-1
  • Was the solution modified from the version published on this repository? No
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs?

Screenshots

If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context

This field is not present in the TTS schema, so I'm not sure it would be possible to add it to the list of forecast dimensions even if it made sense? (Which for this field I believe it might not)

Data Import Lambda function

In the import data step, the documentation states that "To determine if your data must be imported, the solution checks whether data is present, and if its specification has changed, or if the .csv being imported has a different number of rows". I am assuming this refers to createdatasetimportjob/handler.py. But I don’t see which part of the code does these checks like the number of rows and specification?

In the code, I only see a check for whether the dataset exists as follows:

if dataset_import.status == Status.DOES_NOT_EXIST: dataset_import.create()

Thanks in advance

On uploading retail csv files, step function triggers but get stuck at predictor creation and never ends

Describe the bug
On uploading retail csv files, step function triggers but get stuck at predictor creation and never ends

To Reproduce
Deploy via cloud formation stack, upload forecast config file for retail and upload demand csv files in s3 data bucket, step function will trigger but will stuck forever.

Expected behavior
Step fucntion should have been completed

Please complete the following information about the solution:

  • [Latest

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0123) Improving Forecast Accuracy with Machine Learning v1.3.0[...]".

  • Region: us-east-1
  • Was the solution modified from the version published on this repository? - No
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses? - Yes
  • Were there any errors in the CloudWatch Logs? - Lambda logs says that resource creation pending and it keep going in loops

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Screenshot 2023-09-01 at 1 16 29 PM Screenshot 2023-09-01 at 1 34 43 PM

Additional context
Add any other context about the problem here.

missing key or value for Dataset.Domain error occurs.

Thanks for this nice project.
By using this project, I think I would be able to forecast the future, and I'm really excited!

By the way, when I was trying to use this software, I met an error and I couldn't solve it.
I describe how I got the error below.

I created my Environment through the link in this page.

I uploaded forecast-defaults.yaml which was located at source/example/1-defaults in this project to s3 bucket which starts from forecast-stack-data-bucket-.

After that, I created train folder, and uploaded ts.csv and ts.metadata.csv which was created by following this README.md.
The command I generated ts.csv, and ts.metadata.csv was ./create_synthetic_data.py --start 2000-01-01 --length 105120.

Then, step function started automatically, though Create-DatasetGroup Lambda function raise an error shown below.

"statesError": {
    "Error": "ValueError",
    "Cause": "{\"errorMessage\": \"configuration item missing key or value for Dataset.Domain\", \"errorType\": \"ValueError\", \"stackTrace\": [\"  File \\\"/var/task/shared/helpers.py\\\", line 62, in wrapper\\n    (status, output) = f(event, context)\\n\", \"  File \\\"/var/task/handler.py\\\", line 30, in createdatasetgroup\\n    dataset_groups = config.dataset_groups(dataset_file)\\n\", \"  File \\\"/var/task/shared/config.py\\\", line 249, in dataset_groups\\n    ds = self.dataset(dataset_file)\\n\", \"  File \\\"/var/task/shared/config.py\\\", line 149, in dataset\\n    \\\"dataset_domain\\\": self.dataset_domain(dataset_file),\\n\", \"  File \\\"/var/task/shared/config.py\\\", line 98, in dataset_domain\\n    domain = self.config_item(dataset_file, \\\"Dataset.Domain\\\")\\n\", \"  File \\\"/var/task/shared/config.py\\\", line 88, in config_item\\n    raise ValueError(f\\\"configuration item missing key or value for {item}\\\")\\n\"]}"
  }

Actually I don't know if it is a bug or some configuration issue...
Anyways, I couldn't find any document about this error, and I had no ideas to fix it.
If you know how to fix it, please let me know.

Thanks.

Quicksight Analyses creation failed in step function

Describe the bug
Quicksight Analyses creation failed in step function

To Reproduce
Upload sample csv file, step function will trigger but fails at quicksight analyses creation phase.

Expected behavior
Should have created quicksight dashboard

Please complete the following information about the solution:

  • Version: Latest

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0123) Improving Forecast Accuracy with Machine Learning v1.3.0[...]".

  • [ us-east-1] Region: us-east-1
  • [ No] Was the solution modified from the version published on this repository?
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • [ Yes] Have you checked your service quotas for the sevices this solution uses?
  • [Yes ] Were there any errors in the CloudWatch Logs?

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).
Screenshot 2023-09-01 at 4 11 32 PM
Screenshot 2023-09-01 at 4 11 07 PM

Additional context
Add any other context about the problem here.

RTS & Metadata error when used in 'Default' config

Describe the bug

When using the special Default configuration (for example, as we might if our deployment has only one config), setting up an RTS or Item Metadata dataset causes errors because the solution tries to start import for these datasets without/before creating the dataset.

To Reproduce

  • Set up forecast_defaults.yaml with only a Default configuration, and in this default configuration specify RTS and Metadata datasets as well as TTS.
  • Upload some set of testadata.csv, testdata.related.csv, testdata.metadata.csv to S3 (all at once)

The state machine will error, because the solution tries to import testdata.related.csv and testdata.metadata.csv before actually creating the datasets in the DSG.

Expected behavior

The solution should create all 3 datasets, import data, and train predictors: As it would if these same settings were made in an explicitly named configuration instead of using the Default.

Please complete the following information about the solution:

  • Version: v1.4.0

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0123) Improving Forecast Accuracy with Machine Learning v1.3.0[...]".

  • Region: ap-southeast-1
  • Was the solution modified from the version published on this repository? No
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs?

Screenshots

N/A

Additional context

This seems to be caused by the override in Config.required_datasets(), which sets the list of "required" dataset types to [TTS] when the Default configuration is being used... Regardless of what is actually specified in the Default configuration.

add SupplementaryFeatures options (such as public holidays)

The solution does currently not allow to add SupplementaryFeatures such as public holidays.
It overwrites InputDataConfig when creating the Predictor without taking into account any SupplementaryFeatures passed in the forecast-default config file.

Deploy with data downloader not automatically kicking off pipeline

In the past I've mainly used the Amazon Forecast pre-PoC workshop templates to deploy this solution with demo data, but recently thought since the upgrade came out I'd try the direct method from the canonical template listed on the solution page.

Unfortunately, seems like (because of the dependency order of resources) the standard install template doesn't wait for state machine triggers to be configured before loading the demo data to the data bucket... And therefore doesn't automatically kick off the first round of forecasting.

For me this is not ideal - it's helpful for the one-click deploy with demo data to also trigger the Forecast pipeline for easy demonstration of the functionality. At the moment it seems we need to manually kick the pipeline by e.g. downloading and re-uploading one of the S3 data files.

s3 bucket upload not triggering step functions and AttributeError: NoneType

Describe the bug

Hi! I was able to create the stack without any errors, but when a modified version of the csv file is uploaded to the s3 bucket, the step functions are not automatically triggered and the aws forecast autopredictor is not trained. I get the following error when I manually click start execution in the step functions.

[ERROR] AttributeError: 'NoneType' object has no attribute 'endswith'
Traceback (most recent call last):
File "/opt/python/shared/helpers.py", line 68, in wrapper
(status, output) = f(event, context)
File "/var/task/handler.py", line 29, in createdatasetgroup
dataset_file = DatasetFile(event.get("dataset_file"), event.get("bucket"))
File "/opt/python/shared/Dataset/dataset_file.py", line 36, in init
if key.endswith(".related.csv")

It seems to say there is no related time series csv file, but that's optional according to the documentation. Thus, I included the related time series dataset csv file, but it still gives the error. I provided both target and related time series datasets in the Cloud Formation stack, but it still has the value 'None'.

I believe this error is occurring because there was no event that triggered the steps function (I click start execution button instead). AWS instructions regarding deploying AWS forecast models with AWS cloud formation stacks say that I should just upload a file to s3 and it will automatically trigger the step functions that deploy the AWS forecast model we created. When I upload a modified version of the csv file to s3, nothing happens and I am not given any error. Could you please help me figure out why the stack is not deploying automatically? Thx

To Reproduce
https://docs.aws.amazon.com/solutions/latest/improving-forecast-accuracy-with-machine-learning/automated-deployment.html
Using the improving-forecast-accuracy-with-machine-learning.template and the following yaml configuration file:

Default:
DatasetGroup:
Domain: METRICS

Datasets:
- Domain: METRICS
DatasetType: TARGET_TIME_SERIES
DataFrequency: 15min
TimestampFormat: yyyy-MM-dd HH:mm:ss
Schema:
Attributes:
- AttributeName: timestamp
AttributeType: timestamp
- AttributeName: metric_name
AttributeType: string
- AttributeName: metric_value
AttributeType: float

AutoPredictor:
PredictorName: predictor_v1
MaxAge: 100
ForecastHorizon: 3
ForecastFrequency: 15min

Forecast:
ForecastTypes:
- "0.10"
- "0.50"
- "0.90"

No additional advanced settings added when creating the stack.

Please complete the following information about the solution:

  • Version: [e.g. v1.4.0]

[Feature] Use existing (or explicitly-named new?) data bucket

Is your feature request related to a problem? Please describe.

On a recent project implementing this solution, we wanted to be able to deploy the stack against an existing S3 data bucket... Or possibly set a particular name for the newly created data bucket. But doing so required customization of the CDK source.

Describe the feature you'd like

It would be great if the stack could expose an optional CFn parameter for users to set the name of a pre-existing data bucket to hook in to instead of creating new.

As a second (less important) possibility, would be nice if the stack supported optionally explicitly naming a data bucket to be newly created.

Additional context

N/A

Stack deployment fails if stack name starts with `AWS`

Describe the bug
The stack deployment fails if the stack name starts with 'AWS'. Cloudformation throws an error when creating the AppRegistry construct. This might be a problem for customers that wish to use 'AWS' to deploy the solutions.

To Reproduce

  • Go to CloudFormation console
  • Click Create stack dropdown and select With new resources
  • Specify the template with the appropriate s3 Url and click next
  • In the stack name field give any name starting with 'AWS', for example Aws-Improving-Forecast-ML
  • Fill out the rest of the fields as appropriate and select Create Stack

This will start deploying the stack, and eventually will throw an error while creating the AppRegistry resource, causing the stack deployment to fail.

Expected behavior
The stack is expected to deploy successfully regardless of the stack name given by the customer.

Please complete the following information about the solution:

  • Version: v1.5.1
  • Region: [e.g. us-east-1]
  • Was the solution modified from the version published on this repository?
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs?

build-and-test dependencies install issues

Trying to build the CDK from source today I encountered a couple of (small) issues:

(1) - In the source/infrastructure/README we seem to be inconsistent about what folder the commands are run from? first cding to source/infrastructure but then running pip install -r source/requirements-build-and-test.txt.

I found that in order for pip to resolve the editable dependencies like -e cdk_solution_helper_py/helpers_cdk, I had to be inside the source folder and run pip install -r requirements-build-and-test.txt.

(2) - For some reason I don't quite understand, pip errored out due to conflict between [email protected] requiring click>=8.0.0... Even though in the requirements-build-and-test.txt, the version for black is left unspecified. Not sure what's going on here with the dependency resolution as I'd have expected it to just pick a version of black that satisfies the other constraints... But for now I worked around it by explicitly requesting black==21.12b0.

Fails to create Jupiter Notebook instance: 'import boto3' not found

Describe the bug
The stack deployment fails if it deploys with Jupiter Notebook. Cloudformation throws an error when creating the Notebook instance. This might be a problem for customers when they deploy the stack with Notebook instance.

To Reproduce

  1. Create new stack from here: https://s3.amazonaws.com/solutions-reference/improving-forecast-accuracy-with-machine-learning/latest/improving-forecast-accuracy-with-machine-learning.template
  2. Set parameter Deploy Jupyter Notebook to Yes
  3. Deploy stack
  4. Stack deployment will fail with an error of failed to create: [NotebookInstance]

Expected behavior
A clear and concise description of what you expected to happen.

Please complete the following information about the solution:

  • Version: [v1.5.2]
  • Region: [us-east-1]
  • Was the solution modified from the version published on this repository?
  • If the answer to the previous question was yes, are the changes available on GitHub?
  • Have you checked your service quotas for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs?

Notebook Instance Lifecycle Config 'arn:aws:sagemaker:us-east-1:<account-no>:notebook-instance-lifecycle-config/notebooklifecycleconfig48b89718-5ummmwwwzwqr' for Notebook Instance 'arn:aws:sagemaker:us-east-1:<account-no>:notebook-instance/forecast-stack-deployment-issue-aws-forecast-visualization' took longer than 5 minutes. Please check your CloudWatch logs for more details if your Notebook Instance has Internet access.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.