Git Product home page Git Product logo

anomaly-detection's People

Contributors

jykfan avatar pims avatar

Watchers

 avatar

anomaly-detection's Issues

Bug - historical results cannot be replicated

For example:

On 2016-03-21, log messages say account 51846 has an anomaly, but rerunning the same anomaly detection algorithm with the same parameters produces a "not anomaly result." Note: current run of dbscan says that NO date in march is an anomaly, which rules out off-by-one timezone confusions.

 73294 2016-03-21 09:20:07,534 - root - INFO - anom_date=2016-03-20 account_id=51846
 73295 2016-03-21 09:20:07,554 - root - INFO - Success insertion into anomaly_results_raw for account_id=51846 target_date=2016-03-21 00:00:00 alg_id=1 row_id=1469045.
 73296 2016-03-21 09:20:07,558 - root - INFO - anom_date=2016-03-20 account_id=51846
 73297 2016-03-21 09:20:07,566 - root - INFO - Success insertion into anomaly_results_raw for account_id=51846 target_date=2016-03-21 00:00:00 alg_id=4 row_id=1469046.
 73298 2016-03-21 09:20:07,569 - root - INFO - anom_date=2016-03-20 account_id=51846
 73299 2016-03-21 09:20:07,579 - root - INFO - Success insertion into anomaly_results_raw for account_id=51846 target_date=2016-03-21 00:00:00 alg_id=3 row_id=1469047.
 73300 2016-03-21 09:20:24,611 - root - INFO - one_data_length=176

This was found by rerunning algorithms for user responses in the past

On 2016-03-21

actu: [1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
pred: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] <= This should be all 1's. Bug here.

On 2016-03-22

actu: [0, 0, 1, 0, 1, 0, 0, 0]

pred: [0, 0, 0, 0, 1, 1, 0, 0]
pred: [0, 0, 1, 0, 0, 0, 1, 0] <= minus 1 day
pred: [0, 1, 1, 0, 0, 0, 0, 0] <= minus 2 day

Problem seems to disappear on 2016-03-28 which may be a reason for the sudden improvement in accuracy

On 2016-03-28

actu: [1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1]
pred: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Duplicates

The current query for pulling data is as follows

SELECT SUM(ambient_light), count(1), date_trunc('hour', local_utc_ts) AS hour
  FROM prod_sense_data 
  WHERE account_id = 26173 
  AND local_utc_ts > '2016-01-25' 
  AND local_utc_ts < '2016-01-26' 
  AND extract('hour' from local_utc_ts) < 6
  GROUP BY hour
  ORDER BY hour ASC;

The problem is that there are duplicate rows, resulting in count(1) being > 60 for a single hour. This is also uneven for a user across one hour, which means that sum(ambient_light) can be biased towards random points.

Some options to combat this:

  • Currently DBSCAN is running based on sum(ambient_light) rather than avg() for convenience under the assumption that there are no duplicates. Normalizing the sum() by counts will help with de-biasing.
  • Can we add a unique id to redshift to make sure there are no duplicates?
  • Can we edit the above sql query to filter out duplicates?

supervisor error debugging

SUMMARY

  • When supervisor attempts start while on branch jyfan/multi_alg I get a spawn error
  • When running with python run.py configs/prod.yml instead of with supervisor, process is successful
  • When supervisor attempts start while on branch master, process is successful

So the cause of the supervisor error is some change I made in the branch jyfan/multi_alg - trying to figure out which change did it, but I don't have an understanding of the differences between running the python script via supervisor versus not.

cc @pims

ubuntu@ip-10-0-0-47:~/anomaly-detection$ git branch
* jyfan/multi_alg
  master
ubuntu@ip-10-0-0-47:~/anomaly-detection$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl start anomaly:*
anomaly: started
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl status
anomaly                          RUNNING   pid 15331, uptime 0:00:21
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl stop anomaly:*                                                                                                                         
anomaly: stopped
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl status                                                                                                                                 
anomaly                          STOPPED   Jan 13 07:19 PM
ubuntu@ip-10-0-0-47:~/anomaly-detection$ git checkout jyfan/multi_alg
Switched to branch 'jyfan/multi_alg'
Your branch is up-to-date with 'origin/jyfan/multi_alg'.
ubuntu@ip-10-0-0-47:~/anomaly-detection$ git pull
Username for 'https://github.com': jykfan
Password for 'https://[email protected]': 
Already up-to-date.
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl status
anomaly                          STOPPED   Jan 13 07:19 PM
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl start anomaly:*
anomaly: ERROR (spawn error)
ubuntu@ip-10-0-0-47:~/anomaly-detection$ supervisorctl status
anomaly                          FATAL     Exited too quickly (process log may have details)
ubuntu@ip-10-0-0-47:~/anomaly-detection$ python run.py configs/prod.yml 
2016-01-13 19:24:07,558 - __main__ - INFO - test
2016-01-13 19:24:07,781 - root - DEBUG - Found 17500 account_ids
2016-01-13 19:24:16,727 - root - INFO - 2016-01-01 is an anomaly for account 32769
2016-01-13 19:24:16,727 - root - INFO - query: INSERT INTO anomaly_results_raw (account_id, target_date, anomaly_days, alg_id) VALUES ('32769', '2016-01-13T00:00:00'::timestamp, ARRAY['2016-01-01T00:00:00'::timestamp], '1') RETURNING id
2016-01-13 19:24:16,734 - root - INFO - Success insertion into anomaly_results_raw for account_id=32769 target_date=2016-01-13 00:00:00 alg_id=1 row_id=281150.
^CTraceback (most recent call last):
  File "run.py", line 47, in <module>
    main()
  File "run.py", line 41, in main
    app.run(account_id, conn_sensors, conn_anomaly, dbscan_params[dbscan_params_i])        
  File "/home/ubuntu/anomaly-detection/app/logic.py", line 100, in run
    ORDER BY hour ASC""", dict(account_id=account_id, start=thirty_days_ago, end=now))
  File "<string>", line 8, in __new__
KeyboardInterrupt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.