dpcs-team / dpcs Goto Github PK
View Code? Open in Web Editor NEWData Powered Crash Solver
Data Powered Crash Solver
ISSUE TAKEN
I need to fix them ASAP to unlock CI testing.
ETA: 29.03, will try earlier
During a meeting with Wojciech Jaworski PhD, we got an idea that it may be really helpful to create a Stack Overflow crawler, that will try to match log parts from user's questions with captured log.
Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.
Reward: 6p
Catch errors produced by ld.so (or ld-linux.so). Should be implemented as a service.
Server should use Python and Rest micro framework. API is available here: http://docs.dpcs.apiary.io/
Given the huge number of potential machines (2e6 active users contributing data), check if the HDFS is the right choice for the backend. Maybe there are some other options that should be taken into consideration. Think of the possible paths we can possibly choose in the future - memory-expensive system logs, computationally-expensive deep learning algorithms, set of heuristics, NLP algorithms (shared knowledge) etc. Research current trends, write about 0.5 pages summing up your ideas and present it during the meeting.
Description of the data preprocessing techniques we plan to use in the project:
Write a description of classification algorithms we plan to use in the project:
Main algorithm: Neural network
Supporting algorithms: Multiclass logistic regression
Maybe a simple set of heuristics would solve the problem efficiently and fast without using any fancy tools from ML? Keep in mind we will be dealing with a massive data and will also have to create an offline of the algorithm version for client-side classification, so it's worth thinking how can we transfer the knowledge gathered into a more compact form. Think about it, research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.
I think the cleanest way to install a python application into a system, even from a Debian package, is to use python's setup.py
script.
We need to create it and plug into the Debian package installation script to be executed during dpkg -i
process.
Following #35 - Write a description of a method for finding solution using stackoverflow data - how exactly would it be done, is it really possible. Some experiment would be advisable - a piece of code to check if it is really possible.
from urllib2 import Request, urlopen
values = """
{
"crash_report": {
"application": {
"name": "Google Chrome",
"version": "48.0.2564.116"
},
"system_info": {
"version": "14.04.1 LTS"
},
"exit_code": 1,
"stderr_output": "Lines from stdErr."
}
}
"""
headers = {
'Content-Type': 'text/json'
}
request = Request('http://54.93.105.103:8000/vd1/crash-reports/', data=values, headers=headers)
response_body = urlopen(request).read()
print response_body
results in
HTTPError: HTTP Error 415: Unsupported Media Type
See http://docs.dpcs.apiary.io/#reference/crashes/crash-report-collection/send-a-new-report
The bug can be also reproduced when you use our client.
Key points:
Think about heuristics that can help us cluster crashes - what information we need and how to process it. Research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.
Following #37 - write description of that method. Some experiment would be advisable - a piece of code to get it working on a real data, to check if it is really possible in a simple case.
Describe a couple possible use cases (who's using it, how?, why is it needed?) and usage examples (how does the program work? what is the data pipeline?).
Technical overview of the project - short description of the program itself.
Server
a) Communication with client, security - detecting spam, loops, attacks, etc
b) HDFS, Spark
c) Cleaning the database
Client
a) REST, problems with reading from terminals
b) offline classification
Pipeline description
We will have to release our app in a offline version - meaning that we need to be able to do classification on the user's machine. Maybe a simple set of heuristics, constructed from knowledge gathered on server, would solve the problem efficiently and fast without using any fancy tools from ML? Think about it, research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.
There is a problem of having outdated solutions in the database. How do we prevent that? In other words: we get the information about packages versions installed on client's machine - how do we check, if the solution is still applicable? If it haven't worked, how do we save that info? Deleting solutions doesn't seem like the optimal path, since there will still be users using old software, but maybe it will be the thing that has to be done, because there's no easy way to differentiate between cases in our algorithm, or simply our resources won't be able to handle that much of a historical data. Think of the possible paths we can possibly choose in the future - memory-expensive system lifelong logs, computationally-expensive deep learning algorithms, set of heuristics, NLP algorithms (shared knowledge) etc. write about 0.5 pages summing up your ideas and present it during the meeting.
ISSUE TAKEN
In the previous iteration, we have collected a lot of ideas. Now the plan is to merge these conceptions into a single scientific document (let's say, 4-5 A4 pages), polish and send it to review to a few machine learning experts to get feedback.
This process probably will be iterative, so after the first version of the document, we will have to apply feedback or new ideas. After a few iterations, we get receive positive feedback, it will be time to start coding! โ
There is a trivial metric of "correctly solved problems", but as we are lacking data in the beginning phases of the project, so perhaps we would like to share it between algorithms.
There is also a problem of weighting false positive vs true negative - probably we want to choose an approach which will value not breaking anything more than fixing more errors (or not?). And how do we know we broke something?
Apart from being and important feedback on how to develop the program and which approaches work best, it may be also used to generate some stats about "overall saved time" etc.
Research, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.
Reward: 3p
The package should install the client application and all its requirements. So far, we use Python2
, gi
and requests
After 1-2 years, a big part of collected logs and discovered bugs will be fixed and not important anymore.
However, some of the bugs may appear again or be important for people on older versions of the system.
When should we delete a log from our database? Should we create a separate database for older bugs?
Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.
https://github.com/DPCS-team/DPCS/tree/master/server/frontend
and
https://github.com/DPCS-team/DPCS/tree/master/server/alpha
Smart renaming / moving folders will be appreciated.
Features:
Word count, word bigram, TF, IDF, package name, package version, basic system info (possible size limitations?)
Main algorithm: Affinity propagation
Supporting algorithms :Spectral clustering
Heuristics: thefuck, ideas from microsoft
During Wednesday's meeting Marek explained his idea for DPCS logic: to compare "systems' lifelong logs" - data structures consisting of all changes made to the system from the from the installation to present. It would require us to:
Short overview of the algorithms used - both clustering and classification.
Can be done after completion of other parts of the paper.
Reward: 6p
Implement a solution catching input from all terminals (pts, tty). Should be implemented as a service/daemon.
Reward: 2p
Integrate our application with a free CI system. Additionally, plug in the PEP8 check.
Reproduction steps:
{ "crash_report": { "application": { "name" : "Google Chrome", "version": "48.0.2564.116" }, "system_info": { "version": "14.04.1 LTS" }, "exit_code": 1, "stderr_output": "Lines from stdErr." } }
We should write unit tests for the client application. In perfect scenario, we'd like to have 100% test coverage.
Please use pytest
as the framework.
I've read that there is a way to create and manage repos as organization, thus having a possibility to create teams and better manage permission levels: https://help.github.com/articles/permission-levels-for-an-organization/ . What do you think about it?
Reward: 5p
Implement a robust solution for guarding our API and data transferred from the client. Take care of protecting our applications from the most popular attacks.
ubuntu@ip-172-31-23-235:~/DPCS$ dpcs-settings
Traceback (most recent call last):
File "/usr/local/bin/dpcs-settings", line 4, in
import pgi as gi
ImportError: No module named pgi
Repro:
clean 14.04 ubuntu
sudo apt install debhelper python-requests python-gi automake cdbs
cd DPCS/client
make builddeb
cd ..
dpkg -i dpcs_0.1_all.deb
dpcs-settings
General usage: problem description, proposed solution.
Can be done after completing other parts of the paper.
Reward: 6p.
Modify source code of bash, so that it catches stderr of applications with non-zero exit codes. Then, it should send a signal to our client service (a part of this issue as well).
Create a list of potential guest from Europe who can give interesting speach on 23/24 April. Write an email to this person.
Definition of done:
send to board members list of guest and your proposition of inviting message.
Describe similar projects, how do they work, what are similarities and differences?
Take into account:
Red Hat Access
Pulse
Entropy
Cluebox
We will soon release an alpha version and start collecting the data.
Our research team should be able to look at it and test their algorithms on the server, without downloading this data (because it's sensitive).
The proposed solution is to use a mulituser ipython notebook, so everyone can load data from database, run an algorithm (using scikit-learn, tensorflow) and check the results. This will let us get some practical knowledge about both prototyped algorithms and collected data.
But we are not sure if it's
a) safe
b) easy to set-up
c) the best possible way.
The scope of this ticket is to explore this area and create a small (0.5 A4) summary about advantages / disadvantages of this solution, possible alternatives and recommended way of resolving this problem.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.