Git Product home page Git Product logo

bacondegree's Introduction

About BaconDegree2

The parser and reader-class is from my teacher at uni, Henrik Bergström. Uses files and formats from: ftp://ftp.fu-berlin.de/pub/misc/movies/database

The task was performed with classmate Arpad Szell

Purpouse

Optimize reading large text files (1gig plus), and create a graph search connection between two nodes. The dataset used is IMDB actors, connections are made for common movies and colleagues. https://oracleofbacon.org/ provides more info.

What can you use it for?

Its mainly a fun example of algorithms and reading larger files in Java. Most likely Python, Perl, R and C++ is more relevant for this type of work, but this was a fun school-exercise that just expanded more than I thought when I started writing it.

Progress

Version Run-time
V1 Ca 3 hours
V2 1 min creating hash-map <actors,>, 2 min graph-search
V3 1 min creating hash-map <actors,>, 24 seconds graph-search

A Divide and Conquer Bidirectional Search: First Results

V1

Used a very innefficient "Build-Graph first"-technique. It took it about 3 hours to connect a second-degree Bacon Connection.

V2

Builds the graph as it searches for connected actors. Maps only about 30%-50% of actors each run until connection/path is found. Also, much quicker than V1, needs about 2-4 minutes to find a connection.

V3

Uses a Bi-Directional search that expands nodes both from goal, and from The presentation of connected actors is a little messy, but its a lot quicker, about 75% earlier running time. From ca 2 min graph-expansion to find a 4-degree bacon connection in V2, to 24 seconds graph-expansion. However, the reading time of the file, and creating an internal list in the temporary memory still takes about a minute.

Future Improvements

1. Memory Efficiency

Implement writing graph directly from file, making runnable on weaker processors. Both "wave-fronts" of Bi-Directional search expanded while reading. At the moment my Macbook but not my Windows stationary can manage reading my 1.2gig "actors.list" file. Implement "A Divide and Conquer Bidirectional Search: First Results" by Korf (1999) https://www.semanticscholar.org/paper/A-Divide-and-Conquer-Bidirectional-Search%3A-First-Korf/c250bd477f966b15b0296ab7c2e01a4a8928c279 if results are correct increase in time only slight, great decrease in temporary memory use.

Graph Bi-Directional Divide and Conquer (Image credits go to Korf 1999).

Another ad-hoc idea is to create a table of movie frequency first. Actors with a high-mean movie frequency, movies that are frequent in the set, will likely have more connections and are prioritized.

2. Internet access

Not having to access the file on secondary storage will make it easier for other users to use, and to run the program on any platform.

3. UI

Making the program runnable through a user-interface would be great. The data-set used is typical for fun endeavours, but it could be interesting to create a app measuring algorithmics efficiencies too "run algo 1, memory used X, time passed Y". We will see, maybe it can be an educational tool(run in advanced mode), as well as a fun game for guessing the connection-level of different actors.

3.2 Extracting usage data

Extracting the memory used and creating a graph with R java would be neat.

bacondegree's People

Contributors

andreasaar avatar

Watchers

James Cloos avatar  avatar Arpad Szell avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.