Data Science Introduction

This repository provides an introduction to the field of data science, covering fundamental concepts, processes, and tools. It is designed to be a valuable resource for individuals at all levels of expertise, whether you are a novice data enthusiast, a budding data scientist, or an experienced professional looking to refresh your knowledge.

Data Science

Data science encompasses the integration of coding, statistics, and domain expertise to extract valuable insights and knowledge from data. The key components of data science include:

Coding: Gathering and preparing data, performing statistical analysis using programming languages such as R and Python, working with databases using SQL, utilizing command line tools, and leveraging search engines through regular expressions.
Math: Understanding mathematical concepts such as probability, algebra, and regression to analyze data effectively.
Domain Expertise: Possessing expertise in a specific field, understanding the goals, methods, and constraints related to that domain.

Different roles in data science require varying skills and knowledge. Some of these roles include:

Data Engineer: Focused on managing data pipelines, databases, and back-end hardware and software.
Big Data Engineer: Focused on computer science, mathematics, and working with machine learning algorithms and data products.
Research Scientist: Specializing in a specific domain, with a strong background in statistics and mathematics.
Analyst: Focused on business analytics, web analytics, and working with structured data using SQL.
Business Intelligence: Focused on extracting insights from business-relevant data.
Entrepreneur: Combining data and business knowledge to drive creative solutions.
Full-Stack Unicorn: Possessing a diverse skill set across multiple areas of data science.

Data science is a diverse field with various roles, skills, and goals.

Big Data

Big data involves handling data that is characterized by its volume, velocity, and variety. It requires expertise in coding, statistics, and domain knowledge to extract meaningful insights from massive datasets.

Data Science vs. Statistics

While data science and statistics share the common goal of analyzing data, they have distinct differences:

Most data scientists are not statisticians. Data science involves additional components such as coding and working with big data that are not typically emphasized in traditional statistics.
Machine learning and big data are areas that are not commonly associated with traditional statistics.
Data science is not merely a subset of statistics; it encompasses a broader range of skills and methodologies.

Data Gathering

Data gathering involves various methods, including utilizing existing data, APIs, and web scraping, as well as creating your own data through surveys and experiments.

Mathematical foundations play a crucial role in data science and include topics such as linear algebra, calculus, probability, Bayes' theorem, and understanding Big O notation.

Existing Data Sourcing

There are three primary sources for existing data:

In-house: Data that is readily available within an organization. Considerations include data format, documentation, quality, and any restrictions on its use.
Public: Data available from public sources, such as government repositories, research organizations, and open data platforms. Public data offers a wide range of topics, but it may have biases and privacy/confidentiality concerns.
Commercial: Third-party data vendors provide access to a variety of data sets for a fee. While commercial data is well-formatted and documented, it can be expensive and subject to restrictions on use and privacy/confidentiality issues.

APIs

Application Programming Interfaces (APIs) provide a standardized way to access and retrieve data. Different types of APIs include REST APIs, social media APIs, and visual APIs. APIs can be accessed using various programming languages such as R

REST: Representational state transfer
- Access to data via HTTP
- JSON format
Social media APIs
- Twitter, Facebook, Instagram, LinkedIn, etc.
Visual APIs
- Google Maps, Flickr, YouTube, etc.
- Accu Weather, Open Weather Map
Programming Languages
- R, Python, Bash

Make Your Own Data

Interviews
- Structured interviews
  - Same questions, same order
  - PRO: Easy to analyze
  - CON: Not flexible
- Unstructured interviews
  - PRO: Flexible
  - CON: Hard to analyze
- Time-consuming, training, analysis
- Best for new topics & audiences
Card Sorting
- Mental model (how people think about a topic intuitively)
- Generative card sorting
  - Respondents create their own categories
  - Used to create a website
- Evaluative card sorting
  - Respondents sort cards into pre-defined categories
  - Used to evaluate a website
- Dendrograms (hierarchical clustering)
- Digital card sorting tools
  - Optimal Workshop
  - UserZoom
  - UX Suite
Lab Experiments
- Cause and effect
- Researcher controls the environment
- Eye tracking in web design
- Expensive, time-consuming, labor-intensive, requires expertise and training
A/B Testing
- Compare two versions of a website
- Randomly assign users to one of two versions
- Measure performance, response rate (clicks, purchases, etc.)
- Implement the best version
- Software: Optimizely, VWO
- PRO: Easy, fast, cheap
- CON: Limited to website, limited to two versions, limited to performance
Surveys
- Questionnaires
- Closed-ended questions
- Open-ended questions
- In person, phone, mail, email, web
- Survey platforms: SurveyMonkey, Google Forms, Typeform, Qualtrics
- PRO: Easy, fast, cheap
- CON: Limited to questions, limited to responses, watch out for bias

Coding (Manipulate Data)

Data tools != data science

Spreadsheets, Tableau (visualization), Web data
Programming languages: R, Python, SQL
Other languages: C, C++, Java, Bash, Regex

Applications
- Excel, Google Sheets
  - Data transfer (CSV)
  - Data browsing, sorting, rearranging, find and replace
  - Formatting, transposing, tracking changes, pivot tables, arranging output
  - Tidy data principles
- Tableau and Tableau Public
  - Visualization
  - Drag and drop interface
- SPSS
  - Statistical package for the social sciences
  - Automate data analysis with syntax
- JASP
  - Open-source alternative to SPSS
  - Share analysis with OSF.io
- Other statistical software
  - SAS, JMP, Stata, Minitab, Matlab, Mathematica, Wolfram Alpha, Data mining tools
- HTML
  - Defines the structure of web pages
- XML
  - Semi-structured data
  - Markup language, allows commenting and metadata
- JSON
  - JavaScript Object Notation
  - Semi-structured data
  - Design for data interchange
Coding Languages
- R
  - Free & open source
  - Extensive package ecosystem
  - Integrated Development Environments: RStudio, Jupyter
- Python
  - General-purpose language for data science
  - Libraries: NumPy, SciPy, Pandas, Matplotlib, Seaborn, Scikit-learn
- SQL
  - Structured Query Language
  - Relational databases, structured data
  - RDBMS: MySQL, PostgreSQL, SQLite, Oracle, Microsoft SQL Server
- C, C++, Java
  - Low-level, high-performance languages
  - Used for data processing and analysis
- Bash
  - Command-line interface scripting
  - Built-in utilities for data manipulation
  - Installable packages for extended functionality
- Regex
  - Pattern matching for finding specific data

Mathematical Foundations

Linear algebra
Systems of linear equations
Calculus
Big O notation
Probability
Bayes' theorem

Statistics in Data Science

Descriptive statistics
Inferential statistics
Hypothesis testing
Estimation

ujstor / data-science-introduction Goto Github PK