Git Product home page Git Product logo

cmc-csci143's Introduction

CSCI143: Big Data

Career information

About the Instructor

Name Mike Izbicki (call me Mike)
Email [email protected]
Office Adams 216
Office Hours See Issue #243
Zoom Link See Issue #227
Webpage izbicki.me
Research Machine Learning (see izbicki.me/research.html for some past projects)

Fun facts:

  1. grew up in San Clemente (~1hr south of Claremont, on the beach)
  2. 7 years in the navy
    1. nuclear submarine officer, personally converted >10g of uranium into pure energy
    2. worked at National Security Agency (NSA)
    3. left Navy as a conscientious objector
  3. phd/postdoc at UC Riverside
  4. taught in DPRK (i.e. North Korea)
  5. my wife is pregnant and due to have a baby April 18th
    1. I'll be taking 2 weeks paternity leave when the baby comes

About the Course

What is big data?

Depends entirely on the person who is talking

  1. Most non-computer scientists (muggles) think anything bigger than 1G is big data
  2. Facebook considers "tens of petabytes" to be a "SMALL data problem"
  3. One of the biggest problems in industry is people apply tools for "Facebook big data" to "muggle big data", and a major goal of this course is to teach you why this is bad and how to avoid it
  4. For us, "big data" means:
    1. managing a cluster of computers to solve a computational problem; if it can be solved on a single computer, it's SMALL data
    2. all the interesting/applied parts of upper division computer science compressed into a single course

We will work with the following three datasets:

  1. All geolocated tweets sent from 2017-today, 4 terabytes
  2. The common crawl of the web since 2008, >1 petabyte
  3. The internet archive, >50 petabytes as of 2014

By the end of this course, you will build your own "google" search engine. You will manage a cluster of machines that work together to:

  1. download all the data from the internet
  2. extract key information from the HTML
  3. store it in a format suitable for sub 200ms queries
  4. and serve the data in a webpage

In order to make your search engine scalable, we will use the following technologies:

  1. Docker containers

    1. used to easily deploy code to thousands of computers
    2. requires concepts from operating systems, networks, architecture; closely related to "virtual machines"
    3. widely used in industry, see https://stackshare.io/docker
  2. Databases

    1. stores and accesses the data efficiently
      1. application and database on same computer (SQLite, covered in CS40)
      2. application and database on different computers (Postgres), our focus
      3. database on a cluster of computers in the same datacenter (Postgres + extensions like Citus)
      4. database on a cluster of computers spread throughout the world (YugabyteDB, CocroachDB)
    2. SQL to manipulate data, python to build applications
    3. NoSQL (e.g. MongoDB, CouchDB) sucks and you should probably never use it (strongly held personal opinion)
    4. Postgres implements full text search in 70+ languages using custom libraries I've written
    5. Postgres widely used in industry, see https://stackshare.io/postgresql
  3. With these technologies, you can create a fully functioning, highly scalable web business

    1. former CMC student Biniyam Asnake created the business NextDorm as his senior thesis (slightly different tech stack, but same ideas)

Who should take this course?

This course is designed for data science majors, not computer science majors. I'm happy to have CS majors in this course (and I think you'll find this course fun), but know that:

  1. you probably have not fully met the prereqs for this course
  2. some material in this course will duplicate material in your other CS courses
    1. this is especially true of CSCI133 Databases
    2. the course number CSCI143 comes from the fact that all CMC upper division CS courses start with CSCI14, and the 3 is for databases

Prerequisites:

  1. Discrete math: CSCI055 or MATH055

    1. Basic probability / counting
    2. Basic graph theory
  2. Foundations of data science: CSCI 036, ECON 122, or ECON 160

    1. Basic machine learning
    2. Basic SQL (also covered in CSCI040 Computing for the Web; not covered in any computer science class except CSCI133 Databases, which you should not take if you take this course)
    3. Regular expressions (for CS majors, typically covered in a theory of computing or compilers class)
  3. Data structures: CSCI046 or CSCI70 (Mudd) or CSCI62 (Pomona)

    1. All courses cover:
      1. Big-oh notation
      2. Balanced binary search trees
    2. CSCI046 covers:
      1. Basic Unix shell commands
      2. Advanced git
      3. Vim text editor
      4. Analyzing multi-gigabyte Twitter datasets
    3. Data structures pre-req CSCI040:
      1. Markdown
      2. HTML / CSS
      3. Basic SQL
      4. Programming web servers with the flask library
      5. Web scraping with the requests and bs4 libraries
  4. Takeaway:

    1. I am expecting that you have basic familiarity with the Linux terminal, git, and SQL joins.
    2. If you haven't seen those concepts before, expect to spend extra time those weeks catching up.
    3. There are also extra assignments that certain people will have to complete depending on your background.

Relation to other CS courses:

One purpose of this course is to provide DS majors with an overview of CS concepts. Therefore, there is a lot of material in this course that is covered in other upper division CS courses required for CS majors.

  1. Overlapping concepts

    1. CSCI105 Computer Systems (10% overlap)
      1. types of storage: tape vs HDD vs SDD vs NVME vs RAM
      2. RAID
      3. parallel vs distributed architectures
    2. CSCI135 Operating Systems (10% overlap)
      1. permissions systems
      2. processes vs threads
      3. virtual machines vs containers
    3. CSCI125 Networking (10% overlap)
      1. private vs public networks
      2. IP addresses
      3. TCP ports
      4. virtual networks
    4. CSCI121 Software Development (10% overlap)
      1. version control systems (i.e. git)
      2. test driven development / continuous integration
      3. microservices vs monolithic architectures
      4. 12 factor applications
    5. CSCI133 Databases (50% overlap)
      1. SQL
      2. ACID/MVCC/transactions
      3. indexing techniques
    6. A lot of the concepts we'll be covering "should" be covered in other CS courses, but because CS professors are often more theory minded than practice minded, they don't get covered. In that sense, this course is similar to the Missing Semester of Your CS Education course taught at MIT.
  2. Concepts we don't cover from CSCI133 Databases

    1. relational algebra
    2. technical implementation details / C programming
    3. relationship between the database and operating system
  3. BigData concepts from a CS perspective that we will not talk about:

    1. Frameworks for distributed computation (e.g. Apache Hadoop, Apache Spark)
    2. Distributed Filesystems (e.g. HDFS, IPFS); we will talk about S3
    3. Geo-distributed databases

Textbook:

Big data is a rapidly changing field, and all currently printed textbooks are both incomplete and already out of date. Therefore, we won't be using a textbook. Instead, we will be using online documentation. The main references we will use are given below, but I will provide more specific links each week.

  1. Docker documentation

  2. Postgresql documentation

  3. SQLite documentation

  4. SQLAlchemy documentation

  5. 12 Factor Web Apps

Grades

Assignments:

  1. Weekly labs (worth 2**1 points)
  2. Weekly quizzes (worth 2**2 or 2**3 or 2**4 points)
  3. Weekly projects (worth 2**3 or 2**4 or 2**5 points)
  4. 2 exams (worth 2**6 points each)
    1. Non-graduating students will complete a final project due during finals week.
  5. Occasional extra credit assignments

Late Work Policy:

You lose 2**i points on every assignment, where i is the number of days late. It is usually better to submit a correct assignment late than an incorrect one on time.

Grade Schedule:

Your final grade will be computed according to the following standard table, with the caveats described below.

If your grade satisfies then you earn
95 ≤ grade A
90 ≤ grade < 95 A-
87 ≤ grade < 90 B+
83 ≤ grade < 87 B
80 ≤ grade < 83 B-
77 ≤ grade < 80 C+
73 ≤ grade < 77 C
70 ≤ grade < 73 C-
67 ≤ grade < 70 D+
63 ≤ grade < 67 D
60 ≤ grade < 63 D-
60 > grade F

Caveats:

There are 2 "caveat tasks" in this course. These tasks should be easy, and everyone will get full credit on the task just for completing the task. If you don't complete one of the tasks, however, your grade (from the table above) will be docked 10%. (For example, an A- grade would become a B- grade.) You have the entire semester (until I submit grades) to complete these tasks.

You can find the details about the caveat tasks at:

  1. caveat_tasks/typespeed.md
  2. caveat_tasks/culture.md

Academic Integrity

Technology Policy:

  1. You MUST complete all programming assignments on the lambda server.

  2. You MUST use either vim or emacs to complete all programming assignments. In particular, you may not use the GitHub text editor, VSCode, IDLE, or PyCharm for any reason.

    In particular: You MAY NOT use the GitHub interface to edit files for a pull request.

  3. You MAY NOT share your lambda server credentials with anyone else.

Violations of any of these policies will be treated as academic integrity violations.

Collaboration Policy

  1. There are no restrictions on what you can post to GitHub Issues. In particular, you are highly encouraged to post detailed questions/answers/comments with lots of code.

  2. You are highly encouraged to collaborate with students

    1. in class/lab,

    2. in the QCL,

    3. and in office hours.

  3. You MAY NOT look at another student's code (or have another human look at your code) in any other context.

  4. You MAY NOT look at another student's code on github.

    All projects are developed as open source projects, and so the code is published openly online. The benefits of this model include: (1) you actually learn how to develop/contribute to open source projects; (2) future employers see you have github activity. Please do not abuse this privilege.

Accommodations

I've tried to design the course to be as accessible as possible for people with disabilities. (We'll talk a bit about how to design accessible software in class too!) If you need any further accommodations, please ask.

I want you to succeed and I'll make every effort to ensure that you can.

cmc-csci143's People

Contributors

mikeizbicki avatar dustin-lind avatar afrocoderhanane avatar axelahdritz avatar curtissalinger avatar hfmandell avatar sophiahuangg avatar joeybodoia avatar amytam01 avatar tennisoctocat avatar kingeddy11 avatar torvalds avatar erdsal4 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.