Git Product home page Git Product logo

fcc_r3_dataanalysis's Introduction

Fccr3_DA

This is part of a larger project. The tentative full name of this project is FreeCodeCamp (fCC) Community Resources Review. Although it borrows the name from fCC as source of inspiration and origin, this project is NOT currently in direct connection with the organization (www.freecodecamp.org).

Description

The project mission is to offer users, in principle new developers, a curated list of relevant resources to learn programming.

This project is a Proof of Concept.

This section includes the code for the data mining and the application of machine learning techniques for classifying the gathered resources (platforms) - online content mentioned by fCC social media users.

The main sources from where the data is being currently extracted are fCC chatrooms.

The scripts in this section are used to:

  • collect the resources from the chatrooms
  • extract information about its use in the chatrooms
  • add information about the collected platforms visiting their main pages (a bot)
  • an effort to assign categories by exploring machine learning techniques
  • collect and organize information about the subjects fCC curriculum (https://beta.freecodecamp.com/en/map)
  • information retrieval: assign weights for ranking based on similarity of platform-specific content with fCC subjects keywords
  • generate tests
  • solve some ETL issues
  • save data into Firebase

This project is being managed using Kanban methodology (https://realtimeboard.com/blog/choose-between-agile-lean-scrum-kanban/#.WW5nlh9Nybk). This repository of the project shows the advances of the data mining and machine learning work. There exists another repository to shows some of the advances in the rendering of the project (in Angular).

Installation

This repository and its content is still under construction and it is not ready for downloads.

However we can mention that:

  • it is run under UNIX-like operating system (Ubuntu 16.04)
  • Python 3.5.2, IPython 4.2.0, both in Anaconda 4.1.1

For privacy reasons the project won't include:

  • local directories
  • critical access information (databases, API's); some of the API's are public and the code can be replicated if you get an API for the corresponding platform; access to database is restricted: only reading is public

Related projects

For more information about the associated advances in rendering of this project, please visit this repository: https://github.com/evaristoc/fCC_R3 (work-in-progress)

fcc_r3_dataanalysis's People

Contributors

evaristoc avatar

Watchers

James Cloos avatar

Forkers

chikezie122

fcc_r3_dataanalysis's Issues

Calculate ranking and save that in database

Problem description

There is added code to calculate ranking for EACH of the subjects. However it is inefficient: the ranking is strongly based on relevance to topic, which would hardly change. Better to load it immediately instead.

If saved in database there are many other operations that can be done directly using the existing ranking without having to download the whole database and repeat the calculations.

User Story

  • (...)

Technical Specs

  • The ranking is extracted directly from the database
  • A different record of the data should be found that allows to make different kinds of queries over the database based on the values of the ranking

Possible Solutions

  • In the Subject section of each platform add the value that will be used for ranking.
  • An apart list with the ranking per subject.

Milestones

  • (...)

Warnings and Notes

  • By doing so, if indeed a change occurs that might affect ranking, there would not be a general solution, all the data should be reused.
  • Relates to evaristoc/fCC_R3#11

References

  • (...)

Data Revision

Problem Description

This is a control of several issues already opened and the milestone number 1.

User Story

  • (...)

Technical Specs

  • add the wikipedia info to the current db
  • correct subject word lists
  • work on the ranking per subject
  • include a list of keywords based on subject relevance not a good idea
  • similar platforms: implement kNN/cosine similarity for comparisons between platforms based on ranking per subject; 10 per subject
  • other people's selection: identify other mentions of platforms and find similarities as above (?); 10 per subject
  • update last date not at this iteration

Possible Solutions

  • (...)

Milestones


Warnings and Notes:

References:

  • (...)

Checking the use of commoncrawl datasets; crawling references

Refactoring Code: Possible Structure

Steps //MAKES POINT PER TABLE, PREPARE A DATA FLOW//

  • Social Data Extraction: INPUT: different sources; OUTPUT: different raw data files
  • Platformtable Database pre: INPUT: different raw data files, affecting only those links that are included in rules; OUTPUT: a dataset that resembles the database in different aspects, keeping some areas empty for filling in
  • Crawled Platformtable Database pre: INPUT: a database-like dataset; OUTPUT: a database-like dataset with crawled data
  • Platformtable Dataset for classification: INPUT: a database-like dataset with crawled data required for the classification procedure; OUTPUT: a first dataset for classification
  • Crawled Platformtable Database pre with Classes: INPUT: a first dataset for classification + operational definition of classes dataset; OUTPUT: collating a full dataset with ALL the links classified, currently called botdata + Platformtable Database pre filled, including regex forms
  • Crawled Platformtable Database pre with Classes and Link Comparisons: INPUT: the dataset currently known as botdata + Crawled Platformtable Database pre with Classes; OUTPUT: collating data calculated based on what was found in botdata
  • Curriculum Data Extraction: INPUT: freecodecamp website or similar; OUTPUT: a dictionary per section with some data transforms + subjects, usually called cv
  • Crawled Platformtable Database pre with Classes and Link Comparisons with Subject Metrics added INPUT: CV + BOW + Crawled Platformtable Database pre with Classes and Link Comparisons; OUTPUT: collating of data into the dataset pre filled with metrics of subject relevance
  • ETL of Crawled Platformtable Database pre with Classes and Link Comparisons with Subject Metrics added: verify compliance before loading, OBS: also includes other tables!!
  • Database file: platformstable: INPUT: ETL of Crawled Platformtable Database pre with Classes and Link Comparisons with Subject Metrics added; OUTPUT: updates in the firebase database
  • Database file: fcc_subjects: INPUT: ETL of Crawled Platformtable Database pre with Classes and Link Comparisons with Subject Metrics added; OUTPUT: extraction of unique fcc_subjects as captured from the whole list (regex ones)
  • Database file: cv: INPUT: CV dict file; OUTPUT: cv dict table

Keep a single format for data files

Currently they are all csv but delimiters and quotchars are not similar between them. The reason is the information being gathered is varied and hard to standardise. However a simple standard for the information gathered might be found.

'platform' component: Insert similarity rating between platforms based on relevance to selected subject

Problem Description

(...)

User Story

  • The top10 "Similar Platforms" are those in the same category that are similar AND have the highest ranking for the selected subject: Scat_mat x Rankingsubj_diag
  • The top10 "Mentioned by" are those of any category mentioned by other users that are similar AND have the highest ranking for the selected subject: Sment_mat x Rankingsubj_diag

Technical Specs

  • (...)

Possible Solutions

  • Implement on the fly using Lambda functions No, at least not now. Introduce several operational and technological issues that might not be practical
  • Look for a cloud service This could be a good option too: to be explored...
  • Simple cosine similarity rating, likely kNN; sklearn implementation: all before data is loaded. Some reasons why the solutions above were discharged are:
    • Firebase not good for querying
    • They introduce additional, complex async behaviour and waiting
    • This project is currently based on client browser so it is slow and no much memory
    • The section for which this is implemented is not considered as critical as other areas right now

Milestones

  • Prepare a simple demo: top10 "Similar Platforms" the first 10 of same category that are similar; top10 "Mentioned by" just the first 10 more similar no matter the category

Warnings and Notes:

References

A list of data errors in the database to correct!

platformstable:

  • urls as text? Eg. "Code Conventions for the JavaScript Programming Language"
  • urls as numbers
  • localhosts?
  • urls as no addressing anywhere or wrong format
  • empty / imcomplete urls
  • us of "wrong" characters in urls, eg. '*'
  • duplicate domains (finding canonical better!)
  • a standard should be found for crawl_errors showing messages like "403:Forbidden" or "521: Web server is down" or "Site not found · GitHub Page"

Data didn't update users in the platform details section

Problem Description

When getting data point users from any platform in platformstable, users was a empty list

User Story

  • (...)

Technical Specs

  • (...)

Possible Solutions

  • (...)

Milestones

  • (...)

Warnings and Notes:

  • the problem probably a missing update in a function like completing_db_with_data_from_botandcv or similar in the main.py

References:

  • (...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.