dpcs-team / dpcs Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 17.0 1.48 MB

Data Powered Crash Solver

Python 55.68% HTML 17.17% Nginx 0.67% Shell 0.07% Makefile 1.34% JavaScript 22.78% CSS 2.29%

dpcs's People

Contributors

Stargazers

Watchers

Forkers

radek-p michalaq patrikos94 grzetzp bdfhjk inexxt mihal277 konradczechowski gajczix cywinskikamil ignacy130 logvinovleon matimath wisniak199 rudolfkral swarzkopf314 vickzhang

dpcs's Issues

Client/ticket 9/Settings panel

ISSUE TAKEN

Server/ticket 13/Database specification and postgres script

PEP8 errors

I need to fix them ASAP to unlock CI testing.

ETA: 29.03, will try earlier

Research/ticket 57/The possibility of use a Stack Overflow crawler

During a meeting with Wojciech Jaworski PhD, we got an idea that it may be really helpful to create a Stack Overflow crawler, that will try to match log parts from user's questions with captured log.

Try to find automatically answer
Help with logs clustering

Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.

Client/Catch loader's errors

Reward: 6p

Catch errors produced by ld.so (or ld-linux.so). Should be implemented as a service.

Server/ticket 11/Creation of server API

Server should use Python and Rest micro framework. API is available here: http://docs.dpcs.apiary.io/

Research/ticket 50/HDFS or some other system

Given the huge number of potential machines (2e6 active users contributing data), check if the HDFS is the right choice for the backend. Maybe there are some other options that should be taken into consideration. Think of the possible paths we can possibly choose in the future - memory-expensive system logs, computationally-expensive deep learning algorithms, set of heuristics, NLP algorithms (shared knowledge) etc. Research current trends, write about 0.5 pages summing up your ideas and present it during the meeting.

Research/ticket 68/Paper - data preprocessing

Description of the data preprocessing techniques we plan to use in the project:

Normalization of system paths (~home), /opt/bin, /bin/ etc - heuristics
Lowercase, 's, timestamps, PII removal (emails, passwords) (library?)
Optional translation
Stopwords (do we need them?)

Server/Ticket 63/Investigate and create REST server.

Research/ticket 70/Paper - classification

Write a description of classification algorithms we plan to use in the project:

Main algorithm: Neural network
Supporting algorithms: Multiclass logistic regression

Research/ticket 47/Explore heuristic algorithms (server-side classification)

Maybe a simple set of heuristics would solve the problem efficiently and fast without using any fancy tools from ML? Keep in mind we will be dealing with a massive data and will also have to create an offline of the algorithm version for client-side classification, so it's worth thinking how can we transfer the knowledge gathered into a more compact form. Think about it, research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

Client/ticket 53/Creating a python setup.py or alternative

I think the cleanest way to install a python application into a system, even from a Debian package, is to use python's setup.py script.

We need to create it and plug into the Debian package installation script to be executed during dpkg -i process.

Research/ticket 71/Paper - solution matching - stack overflow

Following #35 - Write a description of a method for finding solution using stackoverflow data - how exactly would it be done, is it really possible. Some experiment would be advisable - a piece of code to check if it is really possible.

unsupported media type for the example from the apiary

from urllib2 import Request, urlopen

values = """
  {
    "crash_report": {
      "application": {
        "name": "Google Chrome",
        "version": "48.0.2564.116"
      },
      "system_info": {
        "version": "14.04.1 LTS"
      },
      "exit_code": 1,
      "stderr_output": "Lines from stdErr."
    }
  }
"""

headers = {
  'Content-Type': 'text/json'
}
request = Request('http://54.93.105.103:8000/vd1/crash-reports/', data=values, headers=headers)

response_body = urlopen(request).read()
print response_body

results in

HTTPError: HTTP Error 415: Unsupported Media Type

See http://docs.dpcs.apiary.io/#reference/crashes/crash-report-collection/send-a-new-report

The bug can be also reproduced when you use our client.

Research/ticket 65/Paper - introduction

Key points:

Where did the idea come from? (idea from Canonical + us)
UW ML RG overview, we're doing it to to learn ML techniques, focus on teaching, large group of people, some of them are inexperienced
Use cases
new linux users
admins, manual scripting replacement
normal users (google + stackoverflow)
Plans to incorporate the app in Ubuntu 17.04

Research/ticket 48/Explore heuristic algorithms (server-side clustering)

Think about heuristics that can help us cluster crashes - what information we need and how to process it. Research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

Research/ticket 72/Paper - solution matching - history logs

Following #37 - write description of that method. Some experiment would be advisable - a piece of code to get it working on a real data, to check if it is really possible in a simple case.

Research/ticket 73/Paper - use cases and examples

Describe a couple possible use cases (who's using it, how?, why is it needed?) and usage examples (how does the program work? what is the data pipeline?).

Server/Ticket 61/Configuring SSH access and security features on the server

Client will crash if no ~/.dpcs/.dpcsconfig file was found.

Research/ticket 67/Paper - technical overview

Technical overview of the project - short description of the program itself.

Server
a) Communication with client, security - detecting spam, loops, attacks, etc
b) HDFS, Spark
c) Cleaning the database
Client
a) REST, problems with reading from terminals
b) offline classification
Pipeline description

Research/ticket 46/Explore heuristic algorithms (client-side classification)

We will have to release our app in a offline version - meaning that we need to be able to do classification on the user's machine. Maybe a simple set of heuristics, constructed from knowledge gathered on server, would solve the problem efficiently and fast without using any fancy tools from ML? Think about it, research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

Research/ticket 52/Cleaning the database

There is a problem of having outdated solutions in the database. How do we prevent that? In other words: we get the information about packages versions installed on client's machine - how do we check, if the solution is still applicable? If it haven't worked, how do we save that info? Deleting solutions doesn't seem like the optimal path, since there will still be users using old software, but maybe it will be the thing that has to be done, because there's no easy way to differentiate between cases in our algorithm, or simply our resources won't be able to handle that much of a historical data. Think of the possible paths we can possibly choose in the future - memory-expensive system lifelong logs, computationally-expensive deep learning algorithms, set of heuristics, NLP algorithms (shared knowledge) etc. write about 0.5 pages summing up your ideas and present it during the meeting.

Client/ticket 10/Acceptation test

ISSUE TAKEN

Research/ticket 54/Create an initial algorithm summary 1

In the previous iteration, we have collected a lot of ideas. Now the plan is to merge these conceptions into a single scientific document (let's say, 4-5 A4 pages), polish and send it to review to a few machine learning experts to get feedback.

This process probably will be iterative, so after the first version of the document, we will have to apply feedback or new ideas. After a few iterations, we get receive positive feedback, it will be time to start coding! ☕

Research/ticket 51/Evaluating algorithm's efficiency

There is a trivial metric of "correctly solved problems", but as we are lacking data in the beginning phases of the project, so perhaps we would like to share it between algorithms.
There is also a problem of weighting false positive vs true negative - probably we want to choose an approach which will value not breaking anything more than fixing more errors (or not?). And how do we know we broke something?
Apart from being and important feedback on how to develop the program and which approaches work best, it may be also used to generate some stats about "overall saved time" etc.
Research, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

The PATHS command is not implemented on the server.

Client/ticket 8/Creating a debian package.

Reward: 3p

The package should install the client application and all its requirements. So far, we use Python2, gi and requests

Research/ticket 49/Comparing lifelong system logs

After 1-2 years, a big part of collected logs and discovered bugs will be fixed and not important anymore.

However, some of the bugs may appear again or be important for people on older versions of the system.

When should we delete a log from our database? Should we create a separate database for older bugs?

Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.

Client/ticket 6/Client application

Connect docker contenters

https://github.com/DPCS-team/DPCS/tree/master/server/frontend

and

https://github.com/DPCS-team/DPCS/tree/master/server/alpha

Smart renaming / moving folders will be appreciated.

Server/ticket 14/Alpha server deployment and maintentance

Server/Ticket 60/Making server deployment-ready (docker/package/etc)

Research/ticket 69/Paper - clustering

Features:
Word count, word bigram, TF, IDF, package name, package version, basic system info (possible size limitations?)
Main algorithm: Affinity propagation
Supporting algorithms :Spectral clustering
Heuristics: thefuck, ideas from microsoft

Research/Ticket 58/History trees approach

During Wednesday's meeting Marek explained his idea for DPCS logic: to compare "systems' lifelong logs" - data structures consisting of all changes made to the system from the from the installation to present. It would require us to:

Create these structures - creation during system installation, then maybe daemon running in the background the whole time? Or maybe scanning the system once in a while and saving changes from the last scan? Or adding a "DPCS protected" tag to specific directories and files and scanning these only? There are many ideas to explore.
Manage them - updating, compressing, saving, uploading to server etc.
Thing of some comparing techiques - the original idea was: having a set of trees from people who didn't have investigated error, compare them to the user's log and find the place they differ; finally
Find a solution - given "correct" logs, how to construct a solution starting from "incorrect" one.
Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.

Research/ticket 66/Paper - algorithms overview

Short overview of the algorithms used - both clustering and classification.
Can be done after completion of other parts of the paper.

Server/ticket 12/Server's user interface

Client/Reading from all terminals

Reward: 6p

Implement a solution catching input from all terminals (pts, tty). Should be implemented as a service/daemon.

General/CI system

Reward: 2p

Integrate our application with a free CI system. Additionally, plug in the PEP8 check.

Bug: posting crash report through http://54.93.105.103:8000/vd1/crash-reports/

Reproduction steps:

Go to: http://54.93.105.103:8000/vd1/crash-reports/
Paste sample crash report from Apiary to field at the bottom:

{ "crash_report": { "application": { "name" : "Google Chrome", "version": "48.0.2564.116" }, "system_info": { "version": "14.04.1 LTS" }, "exit_code": 1, "stderr_output": "Lines from stdErr." } }

Get Server Error (500) :(

Client/ticket 59/tests coverage

We should write unit tests for the client application. In perfect scenario, we'd like to have 100% test coverage.

Please use pytest as the framework.

Creation of organization

I've read that there is a way to create and manage repos as organization, thus having a possibility to create teams and better manage permission levels: https://help.github.com/articles/permission-levels-for-an-organization/ . What do you think about it?

General/Security

Reward: 5p

Implement a robust solution for guarding our API and data transferred from the client. Take care of protecting our applications from the most popular attacks.

Research/ticket 56/Consider redundncies among the logs.

python-pgi not found

ubuntu@ip-172-31-23-235:~/DPCS$ dpcs-settings
Traceback (most recent call last):
File "/usr/local/bin/dpcs-settings", line 4, in
import pgi as gi
ImportError: No module named pgi

Repro:
clean 14.04 ubuntu
sudo apt install debhelper python-requests python-gi automake cdbs

cd DPCS/client
make builddeb

cd ..
dpkg -i dpcs_0.1_all.deb
dpcs-settings

Research/ticket 64/Paper - write abstract

General usage: problem description, proposed solution.
Can be done after completing other parts of the paper.

Client/Bash plugin

Reward: 6p.

Modify source code of bash, so that it catches stderr of applications with non-zero exit codes. Then, it should send a signal to our client service (a part of this issue as well).

Organization/Ticket 62/ Find a guest for our hackaton

Create a list of potential guest from Europe who can give interesting speach on 23/24 April. Write an email to this person.

Definition of done:
send to board members list of guest and your proposition of inviting message.

Research/ticket 74/Paper - similar projects

Describe similar projects, how do they work, what are similarities and differences?
Take into account:
Red Hat Access
Pulse
Entropy
Cluebox

Research/ticket 55/Multiuser ipython notebooks on the server?

We will soon release an alpha version and start collecting the data.

Our research team should be able to look at it and test their algorithms on the server, without downloading this data (because it's sensitive).

The proposed solution is to use a mulituser ipython notebook, so everyone can load data from database, run an algorithm (using scikit-learn, tensorflow) and check the results. This will let us get some practical knowledge about both prototyped algorithms and collected data.

But we are not sure if it's
a) safe
b) easy to set-up
c) the best possible way.

The scope of this ticket is to explore this area and create a small (0.5 A4) summary about advantages / disadvantages of this solution, possible alternatives and recommended way of resolving this problem.