Git Product home page Git Product logo

dpcs's People

Contributors

bdfhjk avatar dabler avatar dzjkb avatar gajczix avatar ignacy130 avatar inexxt avatar konradczechowski avatar logvinovleon avatar lukrecjajestbe avatar matimath avatar msusik avatar patrikos94 avatar patrykp2222 avatar qiubit avatar radek-p avatar staronj avatar sylwekqaz avatar szymonpajzert avatar wisniak199 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dpcs's Issues

PEP8 errors

I need to fix them ASAP to unlock CI testing.

ETA: 29.03, will try earlier

Research/ticket 57/The possibility of use a Stack Overflow crawler

During a meeting with Wojciech Jaworski PhD, we got an idea that it may be really helpful to create a Stack Overflow crawler, that will try to match log parts from user's questions with captured log.

  1. Try to find automatically answer
  2. Help with logs clustering

Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.

Research/ticket 50/HDFS or some other system

Given the huge number of potential machines (2e6 active users contributing data), check if the HDFS is the right choice for the backend. Maybe there are some other options that should be taken into consideration. Think of the possible paths we can possibly choose in the future - memory-expensive system logs, computationally-expensive deep learning algorithms, set of heuristics, NLP algorithms (shared knowledge) etc. Research current trends, write about 0.5 pages summing up your ideas and present it during the meeting.

Research/ticket 68/Paper - data preprocessing

Description of the data preprocessing techniques we plan to use in the project:

  1. Normalization of system paths (~home), /opt/bin, /bin/ etc - heuristics
  2. Lowercase, 's, timestamps, PII removal (emails, passwords) (library?)
  3. Optional translation
  4. Stopwords (do we need them?)

Research/ticket 70/Paper - classification

Write a description of classification algorithms we plan to use in the project:

Main algorithm: Neural network
Supporting algorithms: Multiclass logistic regression

Research/ticket 47/Explore heuristic algorithms (server-side classification)

Maybe a simple set of heuristics would solve the problem efficiently and fast without using any fancy tools from ML? Keep in mind we will be dealing with a massive data and will also have to create an offline of the algorithm version for client-side classification, so it's worth thinking how can we transfer the knowledge gathered into a more compact form. Think about it, research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

Client/ticket 53/Creating a python setup.py or alternative

I think the cleanest way to install a python application into a system, even from a Debian package, is to use python's setup.py script.

We need to create it and plug into the Debian package installation script to be executed during dpkg -i process.

unsupported media type for the example from the apiary

from urllib2 import Request, urlopen

values = """
  {
    "crash_report": {
      "application": {
        "name": "Google Chrome",
        "version": "48.0.2564.116"
      },
      "system_info": {
        "version": "14.04.1 LTS"
      },
      "exit_code": 1,
      "stderr_output": "Lines from stdErr."
    }
  }
"""

headers = {
  'Content-Type': 'text/json'
}
request = Request('http://54.93.105.103:8000/vd1/crash-reports/', data=values, headers=headers)

response_body = urlopen(request).read()
print response_body

results in

HTTPError: HTTP Error 415: Unsupported Media Type

See http://docs.dpcs.apiary.io/#reference/crashes/crash-report-collection/send-a-new-report

The bug can be also reproduced when you use our client.

Research/ticket 65/Paper - introduction

Key points:

  1. Where did the idea come from? (idea from Canonical + us)
  2. UW ML RG overview, we're doing it to to learn ML techniques, focus on teaching, large group of people, some of them are inexperienced
  3. Use cases
  4. new linux users
  5. admins, manual scripting replacement
  6. normal users (google + stackoverflow)
  7. Plans to incorporate the app in Ubuntu 17.04

Research/ticket 67/Paper - technical overview

Technical overview of the project - short description of the program itself.

  1. Server
    a) Communication with client, security - detecting spam, loops, attacks, etc
    b) HDFS, Spark
    c) Cleaning the database

  2. Client
    a) REST, problems with reading from terminals
    b) offline classification

  3. Pipeline description

Research/ticket 46/Explore heuristic algorithms (client-side classification)

We will have to release our app in a offline version - meaning that we need to be able to do classification on the user's machine. Maybe a simple set of heuristics, constructed from knowledge gathered on server, would solve the problem efficiently and fast without using any fancy tools from ML? Think about it, research present trends, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

Research/ticket 52/Cleaning the database

There is a problem of having outdated solutions in the database. How do we prevent that? In other words: we get the information about packages versions installed on client's machine - how do we check, if the solution is still applicable? If it haven't worked, how do we save that info? Deleting solutions doesn't seem like the optimal path, since there will still be users using old software, but maybe it will be the thing that has to be done, because there's no easy way to differentiate between cases in our algorithm, or simply our resources won't be able to handle that much of a historical data. Think of the possible paths we can possibly choose in the future - memory-expensive system lifelong logs, computationally-expensive deep learning algorithms, set of heuristics, NLP algorithms (shared knowledge) etc. write about 0.5 pages summing up your ideas and present it during the meeting.

Research/ticket 54/Create an initial algorithm summary 1

In the previous iteration, we have collected a lot of ideas. Now the plan is to merge these conceptions into a single scientific document (let's say, 4-5 A4 pages), polish and send it to review to a few machine learning experts to get feedback.

This process probably will be iterative, so after the first version of the document, we will have to apply feedback or new ideas. After a few iterations, we get receive positive feedback, it will be time to start coding! โ˜•

Research/ticket 51/Evaluating algorithm's efficiency

There is a trivial metric of "correctly solved problems", but as we are lacking data in the beginning phases of the project, so perhaps we would like to share it between algorithms.
There is also a problem of weighting false positive vs true negative - probably we want to choose an approach which will value not breaking anything more than fixing more errors (or not?). And how do we know we broke something?
Apart from being and important feedback on how to develop the program and which approaches work best, it may be also used to generate some stats about "overall saved time" etc.
Research, write about 0.5 pages summing up your ideas and present it during the meeting + write some specs for future implementation.

Research/ticket 49/Comparing lifelong system logs

After 1-2 years, a big part of collected logs and discovered bugs will be fixed and not important anymore.

However, some of the bugs may appear again or be important for people on older versions of the system.

When should we delete a log from our database? Should we create a separate database for older bugs?

Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.

Research/ticket 69/Paper - clustering

Features:
Word count, word bigram, TF, IDF, package name, package version, basic system info (possible size limitations?)
Main algorithm: Affinity propagation
Supporting algorithms :Spectral clustering
Heuristics: thefuck, ideas from microsoft

Research/Ticket 58/History trees approach

During Wednesday's meeting Marek explained his idea for DPCS logic: to compare "systems' lifelong logs" - data structures consisting of all changes made to the system from the from the installation to present. It would require us to:

  1. Create these structures - creation during system installation, then maybe daemon running in the background the whole time? Or maybe scanning the system once in a while and saving changes from the last scan? Or adding a "DPCS protected" tag to specific directories and files and scanning these only? There are many ideas to explore.
  2. Manage them - updating, compressing, saving, uploading to server etc.
  3. Thing of some comparing techiques - the original idea was: having a set of trees from people who didn't have investigated error, compare them to the user's log and find the place they differ; finally
  4. Find a solution - given "correct" logs, how to construct a solution starting from "incorrect" one.
    Think about it, research, write about 0.5 pages summing up your ideas and present it during the meeting.

General/CI system

Reward: 2p

Integrate our application with a free CI system. Additionally, plug in the PEP8 check.

Client/ticket 59/tests coverage

We should write unit tests for the client application. In perfect scenario, we'd like to have 100% test coverage.

Please use pytest as the framework.

General/Security

Reward: 5p

Implement a robust solution for guarding our API and data transferred from the client. Take care of protecting our applications from the most popular attacks.

python-pgi not found

ubuntu@ip-172-31-23-235:~/DPCS$ dpcs-settings
Traceback (most recent call last):
File "/usr/local/bin/dpcs-settings", line 4, in
import pgi as gi
ImportError: No module named pgi

Repro:
clean 14.04 ubuntu
sudo apt install debhelper python-requests python-gi automake cdbs

cd DPCS/client
make builddeb

cd ..
dpkg -i dpcs_0.1_all.deb
dpcs-settings

Client/Bash plugin

Reward: 6p.

Modify source code of bash, so that it catches stderr of applications with non-zero exit codes. Then, it should send a signal to our client service (a part of this issue as well).

Organization/Ticket 62/ Find a guest for our hackaton

Create a list of potential guest from Europe who can give interesting speach on 23/24 April. Write an email to this person.

Definition of done:
send to board members list of guest and your proposition of inviting message.

Research/ticket 55/Multiuser ipython notebooks on the server?

We will soon release an alpha version and start collecting the data.

Our research team should be able to look at it and test their algorithms on the server, without downloading this data (because it's sensitive).

The proposed solution is to use a mulituser ipython notebook, so everyone can load data from database, run an algorithm (using scikit-learn, tensorflow) and check the results. This will let us get some practical knowledge about both prototyped algorithms and collected data.

But we are not sure if it's
a) safe
b) easy to set-up
c) the best possible way.

The scope of this ticket is to explore this area and create a small (0.5 A4) summary about advantages / disadvantages of this solution, possible alternatives and recommended way of resolving this problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.