Git Product home page Git Product logo

data-science-introduction's Introduction

Data Science Introduction

This repository provides an introduction to the field of data science, covering fundamental concepts, processes, and tools. It is designed to be a valuable resource for individuals at all levels of expertise, whether you are a novice data enthusiast, a budding data scientist, or an experienced professional looking to refresh your knowledge.

Data Science

Data science encompasses the integration of coding, statistics, and domain expertise to extract valuable insights and knowledge from data. The key components of data science include:

  • Coding: Gathering and preparing data, performing statistical analysis using programming languages such as R and Python, working with databases using SQL, utilizing command line tools, and leveraging search engines through regular expressions.
  • Math: Understanding mathematical concepts such as probability, algebra, and regression to analyze data effectively.
  • Domain Expertise: Possessing expertise in a specific field, understanding the goals, methods, and constraints related to that domain.

Different roles in data science require varying skills and knowledge. Some of these roles include:

  • Data Engineer: Focused on managing data pipelines, databases, and back-end hardware and software.
  • Big Data Engineer: Focused on computer science, mathematics, and working with machine learning algorithms and data products.
  • Research Scientist: Specializing in a specific domain, with a strong background in statistics and mathematics.
  • Analyst: Focused on business analytics, web analytics, and working with structured data using SQL.
  • Business Intelligence: Focused on extracting insights from business-relevant data.
  • Entrepreneur: Combining data and business knowledge to drive creative solutions.
  • Full-Stack Unicorn: Possessing a diverse skill set across multiple areas of data science.

Data science is a diverse field with various roles, skills, and goals.

Big Data

Big data involves handling data that is characterized by its volume, velocity, and variety. It requires expertise in coding, statistics, and domain knowledge to extract meaningful insights from massive datasets.

Data Science vs. Statistics

While data science and statistics share the common goal of analyzing data, they have distinct differences:

  • Most data scientists are not statisticians. Data science involves additional components such as coding and working with big data that are not typically emphasized in traditional statistics.
  • Machine learning and big data are areas that are not commonly associated with traditional statistics.
  • Data science is not merely a subset of statistics; it encompasses a broader range of skills and methodologies.

Data Gathering

Data gathering involves various methods, including utilizing existing data, APIs, and web scraping, as well as creating your own data through surveys and experiments.

Mathematical foundations play a crucial role in data science and include topics such as linear algebra, calculus, probability, Bayes' theorem, and understanding Big O notation.

Existing Data Sourcing

There are three primary sources for existing data:

  1. In-house: Data that is readily available within an organization. Considerations include data format, documentation, quality, and any restrictions on its use.
  2. Public: Data available from public sources, such as government repositories, research organizations, and open data platforms. Public data offers a wide range of topics, but it may have biases and privacy/confidentiality concerns.
  3. Commercial: Third-party data vendors provide access to a variety of data sets for a fee. While commercial data is well-formatted and documented, it can be expensive and subject to restrictions on use and privacy/confidentiality issues.

APIs

Application Programming Interfaces (APIs) provide a standardized way to access and retrieve data. Different types of APIs include REST APIs, social media APIs, and visual APIs. APIs can be accessed using various programming languages such as R

  • REST: Representational state transfer

    • Access to data via HTTP
    • JSON format
  • Social media APIs

    • Twitter, Facebook, Instagram, LinkedIn, etc.
  • Visual APIs

    • Google Maps, Flickr, YouTube, etc.
    • Accu Weather, Open Weather Map
  • Programming Languages

    • R, Python, Bash

Make Your Own Data

  1. Interviews

    • Structured interviews
      • Same questions, same order
      • PRO: Easy to analyze
      • CON: Not flexible
    • Unstructured interviews
      • PRO: Flexible
      • CON: Hard to analyze
    • Time-consuming, training, analysis
    • Best for new topics & audiences
  2. Card Sorting

    • Mental model (how people think about a topic intuitively)
    • Generative card sorting
      • Respondents create their own categories
      • Used to create a website
    • Evaluative card sorting
      • Respondents sort cards into pre-defined categories
      • Used to evaluate a website
    • Dendrograms (hierarchical clustering)
    • Digital card sorting tools
      • Optimal Workshop
      • UserZoom
      • UX Suite
  3. Lab Experiments

    • Cause and effect
    • Researcher controls the environment
    • Eye tracking in web design
    • Expensive, time-consuming, labor-intensive, requires expertise and training
  4. A/B Testing

    • Compare two versions of a website
    • Randomly assign users to one of two versions
    • Measure performance, response rate (clicks, purchases, etc.)
    • Implement the best version
    • Software: Optimizely, VWO
    • PRO: Easy, fast, cheap
    • CON: Limited to website, limited to two versions, limited to performance
  5. Surveys

    • Questionnaires
    • Closed-ended questions
    • Open-ended questions
    • In person, phone, mail, email, web
    • Survey platforms: SurveyMonkey, Google Forms, Typeform, Qualtrics
    • PRO: Easy, fast, cheap
    • CON: Limited to questions, limited to responses, watch out for bias

Coding (Manipulate Data)

Data tools != data science

  • Spreadsheets, Tableau (visualization), Web data
  • Programming languages: R, Python, SQL
  • Other languages: C, C++, Java, Bash, Regex
  1. Applications

    • Excel, Google Sheets

      • Data transfer (CSV)
      • Data browsing, sorting, rearranging, find and replace
      • Formatting, transposing, tracking changes, pivot tables, arranging output
      • Tidy data principles
    • Tableau and Tableau Public

      • Visualization
      • Drag and drop interface
    • SPSS

      • Statistical package for the social sciences
      • Automate data analysis with syntax
    • JASP

      • Open-source alternative to SPSS
      • Share analysis with OSF.io
    • Other statistical software

      • SAS, JMP, Stata, Minitab, Matlab, Mathematica, Wolfram Alpha, Data mining tools
    • HTML

      • Defines the structure of web pages
    • XML

      • Semi-structured data
      • Markup language, allows commenting and metadata
    • JSON

      • JavaScript Object Notation
      • Semi-structured data
      • Design for data interchange
  2. Coding Languages

    • R

      • Free & open source
      • Extensive package ecosystem
      • Integrated Development Environments: RStudio, Jupyter
    • Python

      • General-purpose language for data science
      • Libraries: NumPy, SciPy, Pandas, Matplotlib, Seaborn, Scikit-learn
    • SQL

      • Structured Query Language
      • Relational databases, structured data
      • RDBMS: MySQL, PostgreSQL, SQLite, Oracle, Microsoft SQL Server
    • C, C++, Java

      • Low-level, high-performance languages
      • Used for data processing and analysis
    • Bash

      • Command-line interface scripting
      • Built-in utilities for data manipulation
      • Installable packages for extended functionality
    • Regex

      • Pattern matching for finding specific data

Mathematical Foundations

  • Linear algebra
  • Systems of linear equations
  • Calculus
  • Big O notation
  • Probability
  • Bayes' theorem

Statistics in Data Science

  • Descriptive statistics
  • Inferential statistics
  • Hypothesis testing
  • Estimation

data-science-introduction's People

Contributors

ujstor avatar

Stargazers

jist avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.