Git Product home page Git Product logo

learn_spark's Introduction

Explore Spark SQL

I need to play with collect_set and expand my familiarity with both "group by" and "over" windowing.

Here is some python code to do set up a small dataset and demo some of these features.

  • create.py creates a table of people with some duplicate names, but distinct ids. This creates a situation where you can play with grouping the data by name allowing for explorations of the contrast between group by and partition by.
  • distinct.py uses the same data to consider how distinct mixes in or can be replaced.
  • aggregate_not.py shows what any_value() and first() do. They don't really aggregate, but they work in that context, picking a single value from the group sort-of like the aggregating functions min,max, avg etc. do. In that sense, they are aggregators.
  • parition.py This is a demo of rank() over different partitions. Like group-by in that you're dividing the result set into groups based on certain fields, over() is different in that there is no requirement on aggregating the variables that are not in the group-by phrase. You get every row, so there's no need to aggregate anything. Behind all this is an intuition that the columns listed in the group-by are keys to the result-set. Since you only get one row per "key", you have to aggregate. The key idea holds for the partitions created by over(), but you don't need to aggregate since you get all rows back.
  • both.py Using group-by and rank/over at the same time requires a clear understanding of the two and how they fit into the SQL query pipeline.

partitioning vs grouping

This page, https://towardsdatascience.com/the-6-steps-of-a-sql-select-statement-process-b3696a49a642 describes 6 stages to SQL processing.

  1. Getting Data (From, Join)
  2. Row Filter (Where)
  3. Grouping (Group by)
  4. Group Filter (Having)
  5. Return Expressions (Select)
  6. Order & Paging (Order by & Limit / Offset)

Group-by is known to happend after inital result set creating with aggregating functions and then the having filter. Learning rank/over was confusing for me because I wanted to see it like group-b. It is, and it is not. The initial result set has to be fetched, and it is grouped, but the aggregation and distillation to a fewer number of rows doesn't happen. Functions like rank() and row_Number() (?) are applied to every row within the context of the partition before the aggregation step. After this step, is when group-by is applied and it lets you do some mind-stretchings like partion on one set of columns and group-by a different set!

Challenge: find some creative uses or examples of grouping and paritioning to show a use of both in the same query...

Actually, with a lack of good keys in a dataset like people, and a denormalized table that includes many entities for each person, you might use rank() to create an artificial key for each person, and then want to count those rows for each.

TODO

  • I have figured out how to drop a table from within spark, so the run.sh script removes the file from the file system.
  • Explore how to see the impacts of nulls and open-joins on non-key attributes mistakenly included in the key

learn_spark's People

Contributors

chrisroederucdenver avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.