Git Product home page Git Product logo

postgresql_cursor's Introduction

PostgreSQLCursor for handling large Result Sets

Gem Version

PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries returning a large number of rows, and allows you to sort your result set.

In PostgreSQL, a cursor runs a query, from which you fetch a block of (say 1000) rows, process them, and continue fetching until the result set is exhausted. By fetching a smaller chunk of data, this reduces the amount of memory your application uses and prevents the potential crash of running out of memory.

Supports Rails/ActiveRecord v3.1 (v3.2 recommended) higher (including v5.0) and Ruby 1.9 and higher. Not all features work in ActiveRecord v3.1. Support for this gem will only be for officially supported versions of ActiveRecord and Ruby; others can try older versions of the gem.

Using Cursors

PostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program to declare a cursor to run a given query returning "chunks" of rows to the application program while retaining the position of the full result set in the database. This overcomes all the disadvantages of using find_each and find_in_batches.

Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the instantiated models. An informal benchmark showed that returning instances is a factor of 4 times slower than returning hashes. If you are can work with the data in this form, you will find better performance.

With PostgreSQL, you can work with cursors as follows:

Product.where("id>0").order("name").each_row { |hash| Product.process(hash) }

Product.where("id>0").each_instance { |product| product.process! }
Product.where("id>0").each_instance(block_size:100_000) { |product| product.process }

Product.each_row { |hash| Product.process(hash) }
Product.each_instance { |product| product.process }

Product.each_row_by_sql("select * from products") { |hash| Product.process(hash) }
Product.each_instance_by_sql("select * from products") { |product| product.process }

Cursors must be run in a transaction if you need to fetch each row yourself

Product.transaction do
  cursor = Product.all.each_row
  row = cursor.fetch                       #=> {"id"=>"1"}
  row = cursor.fetch(symbolize_keys:true)  #=> {:id =>"2"}
  cursor.close
end

All these methods take an options hash to control things more:

block_size:n      The number of rows to fetch from the database each time (default 1000)
while:value       Continue looping as long as the block returns this value
until:value       Continue looping until the block returns this value
connection:conn   Use this connection instead of the current Product connection
fraction:float    A value to set for the cursor_tuple_fraction variable.
                  PostgreSQL uses 0.1 (optimize for 10% of result set)
                  This library uses 1.0 (Optimize for 100% of the result set)
                  Do not override this value unless you understand it.
with_hold:boolean Keep the cursor "open" even after a commit.
cursor_name:string Give your cursor a name.

Notes:

  • Use cursors only for large result sets. They have more overhead with the database than ActiveRecord selecting all matching records.
  • Aliases each_hash and each_hash_by_sql are provided for each_row and each_row_by_sql if you prefer to express what types are being returned.

PostgreSQLCursor is an Enumerable

If you do not pass in a block, the cursor is returned, which mixes in the Enumerable libary. With that, you can pass it around, or chain in the awesome enumerable things like map and reduce. Furthermore, the cursors already act as lazy, but you can also chain in lazy when you want to keep the memory footprint small for rest of the process.

Product.each_row.map {|r| r["id"].to_i } #=> [1, 2, 3, ...]
Product.each_instance.map {|r| r.id }.each {|id| p id } #=> [1, 2, 3, ...]
Product.each_instance.lazy.inject(0) {|sum,r| sum +  r.quantity } #=> 499500

PostgreSQLCursor and collection rendering

You can render cursor collection, using enumeration as collection attribute.

render partial: "some_partial", collection: Product.each_instance
render partial: "some_partial", collection: Product.each_row
render partial: "some_partial", collection: Product.each_hash

Hashes vs. Instances

The each_row method returns the Hash of strings for speed (as this allows you to process a lot of rows). Hashes are returned with String values, and you must take care of any type conversion.

When you use each_instance, ActiveRecord lazily casts these strings into Ruby types (Time, Fixnum, etc.) only when you read the attribute.

If you find you need the types cast for your attributes, consider using each_instance insead. ActiveRecord's read casting algorithm will only cast the values you need and has become more efficient over time.

Select and Pluck

To limit the columns returned to just those you need, use .select(:id, :name) query method.

Product.select(:id, :name).each_row { |product| product.process }

Pluck is a great alternative instead of using a cursor. It does not instantiate the row, and builds an array of result values, and translates the values into ruby values (numbers, Timestamps. etc.). Using the cursor would still allow you to lazy load them in batches for very large sets.

You can also use the pluck_rows or pluck_instances if the results won't eat up too much memory.

Product.newly_arrived.pluck(:id) #=> [1, 2, 3, ...]
Product.newly_arrived.each_row { |hash| }
Product.select(:id).each_row.map {|r| r["id"].to_i } # cursor instead of pluck
Product.pluck_rows(:id) #=> ["1", "2", ...]
Product.pluck_instances(:id, :quantity) #=> [[1, 503], [2, 932], ...]

Associations and Eager Loading

ActiveRecord performs some magic when eager-loading associated row. It will usually not join the tables, and prefers to load the data in separate queries.

This library hooks onto the to_sql feature of the query builder. As a result, it can't do the join if ActiveRecord decided not to join, nor can it construct the association objects eagerly.

Locking and Updating Each Row (FOR UPDATE Queries)

When you use the AREL lock method, a "FOR UPDATE" clause is added to the query. This causes the block of rows returned from each FETCH operation (see the block_size option) to be locked for you to update. The lock is released on those rows once the block is exhausted and the next FETCH or CLOSE statement is executed.

This example will run through a large table and potentially update each row, locking only a set of rows at a time to allow concurrent use.

Product.lock.each_instance(block_size:100) do |p|
  p.update(price: p.price * 1.05)
end

Also, pay attention to the block_size you request. Locking large blocks of rows for an extended time can cause deadlocks or other performance issues in your application. On a busy table, or if the processing of each row consumes a lot of time or resources, try a block_size <= 10.

See the PostgreSQL Select Documentation for more information and limitations when using "FOR UPDATE" locking.

Background: Why PostgreSQL Cursors?

ActiveRecord is designed and optimized for web performance. In a web transaction, only a "page" of around 20 rows is returned to the user. When you do this

Product.where("id>0").each { |product| product.process }

The database returns all matching result set rows to ActiveRecord, which instantiates each row with the data returned. This function returns an array of all these rows to the caller.

Asynchronous, Background, or Offline processing may require processing a large amount of data. When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby does not return that memory after processing the array, and the causes your process to "bloat". If you don't have enough memory, it will cause an exception.

ActiveRecord.find_each and find_in_batches

To solve this problem, ActiveRecord gives us two alternative methods that work in "chunks" of your data:

Product.where("id>0").find_each { |model| Product.process }

Product.where("id>0").find_in_batches do |batch|
  batch.each { |model| Product.process }
end

Optionally, you can specify a :batch_size option as the size of the "chunk", and defaults to 1000.

There are drawbacks with these methods:

  • You cannot specify the order, it will be ordered by the primary key (usually id)
  • The primary key must be numeric
  • The query is rerun for each chunk (1000 rows), starting at the next id sequence.
  • You cannot use overly complex queries as that will be rerun and incur more overhead.

How it works

Under the covers, the library calls the PostgreSQL cursor operations with the pseudo-code:

SET cursor_tuple_fraction TO 1.0;
DECLARE cursor_1 CURSOR WITH HOLD FOR select * from widgets;
loop
  rows = FETCH 100 FROM cursor_1;
  rows.each {|row| yield row}
until rows.size < 100;
CLOSE cursor_1;

Meta

Author

Allen Fair, @allenfair, github://afair

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don't break it in a future version unintentionally.
  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
  • Send me a pull request. Bonus points for topic branches.

Code of Conduct

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.

Copyright

Copyright (c) 2010-2017 Allen Fair. See (MIT) LICENSE for details.

postgresql_cursor's People

Contributors

aaronbroadintrohive avatar afair avatar amatsuda avatar bbonamin avatar bibendi avatar biinari avatar bryant1410 avatar dmitrybochkarev avatar fgarces avatar pedrofcuba avatar prashantvithani avatar ravbaker avatar redterror avatar simi avatar toothrot avatar ukolovda avatar vzvu3k6k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

postgresql_cursor's Issues

Publish version 0.5 to Rubygems

Hello there,

Great work!

I want to use the version for Rails 4 in another gem. Its not possible, since gemspec doesnt' support git branches as dependencies, but only dependencies from Rubygems.

Any chance version 0.5 will be published to Rubygems soon?

uninitialized constant ActiveRecord::ConnectionAdapters::PostgreSQLAdapter

For anyone seeing this error, I encountered it when adding postgresql_cursor to my Gemfile.

I had to use gem "postgresql_cursor", require: false and then require "postgresql_cursor" manually where I needed it.

This would be good to document, I think.

Rails 4.2.11, Ruby 2.5.3p105, Bundler 1.17.2

SQL Syntax Errors referencing a $1 placeholder

When running a query using an association, ActiveRecord has changed to use prepared statements with bind parameters, something like:

"select * from  comments where post_id = $1", [["post_id", 123]]

so when to_sql is called on the relation, the $1 is not yet interpolated.

Solution: use connection.unprepared_statement { to_sql } syntax to skip preparation. Cursor declaration includes preparation anyway.

Different column types and enum issues

each_instance incorrectly sets the column types, for example

class User < ActiveRecord::Base
  enum state: {
    active: 1,
  }
end

pp User.first
# #<User:0x00007f884e1bee40
#  id: 1,
#  created_at: Wed, 17 Apr 2019 14:33:34 AEST +10:00
#  state: "active">

pp User.each_instance.first
# #<User:0x00007f884e1bee40
#  id: 1,
#  created_at: 2019-04-16 22:24:59 UTC
#  state: 1>

Expected.
in the example above objects from both queries should be identical

rails -v => Rails 6.0.3.4
ruby -v => ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin19]

each_instance doesn't handle enum attributes?

First of all, great gem, love it ๐Ÿ‘

I've only run into a problem with enum attributes in ActiveRecord. If I define a model with an enum attribute, .each() outputs the symbolic enum value, but .each_instance() outputs the underlying numeric value. e.g.

class User < ApplicationRecord
enum gender: [:male, :female]
end

User.all.each {|u| puts u.gender} # Outputs "male", "male", "female"
User.all.each_instance {|u| puts u.gender} # Outputs 0, 0, 1

cursor_tuple_fraction doesn't get set

Looking at https://github.com/afair/postgresql_cursor/blob/master/lib/postgresql_cursor/cursor.rb#L298, the comment explains that the gem prefers to override PostgreSQL's default cursor_tuple_fraction = 0.1 to 1.0.

However because the arg frac=1.0 and there's an default on the options lookup @cursor_tuple_fraction ||= @options.fetch(:fraction) { 1.0 }, I believe the early return return @cursor_tuple_fraction if frac == @cursor_tuple_fraction will always fire unless you set a custom value not equal to 1.0.

This means both that the 1.0 default never gets applied, and also that it's impossible to apply a configuration value of 1.0.

I think the early return probably is an unnecessary optimization in terms of speed, and having any early return will possibly cause bugs since the default config on a given Postgres install might not be 0.1 anyway, so it's not possible to fully fix by changing to only return early if the desired value is 0.1.

Finally, the gem doesn't reset the value after the cursor is done, so it's potentially poisoning the connection settings for any other use not going through the gem.

Is it intentional that v0.6.5 does not support Rails 6?

Hi,

Thanks for the recent release. I've just tried to install 0.6.5 but it's not compatible with Rails 6 - was this intentional? We're likely to be a few months away from upgrading to Rails 7 but would really like to keep using this gem. Was the restriction intentional?

The following minimal Gemfile shows the problem when using bundle install:

source "https://rubygems.org"

gem "rails", "~> 6.1.7"
gem "postgresql_cursor", "~> 0.6.5"
Bundler could not find compatible versions for gem "activerecord":
  In Gemfile:
    postgresql_cursor (~> 0.6.5) was resolved to 0.6.5, which depends on
      activerecord (>= 7.0.0)

    rails (~> 6.1.7) was resolved to 6.1.7, which depends on
      activerecord (= 6.1.7)

Thanks,
Owen.

PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK

Our app opens a cursor which can run for hours at a time.
On the longer runs (7+ hours) the cursor occasionally errors out with:

ActiveRecord::StatementInvalid
PG::ConnectionBad: PQsocket() can't get socket descriptor: ROLLBACK

I am trying to rule this gem out as the culprit. I'm not sure if this is masking a different error.
Should we be passing with_hold: true in the options?

Here is a snippet of the cursor loop:

# BATCH_SIZE = 10_000
 records.each_instance(block_size: BATCH_SIZE) { |record| @handler.call(record) }

Couple things that may or may not be pertinent:

  • We tried upgrading to 0.6.7 and see the same behavior
  • The records returned by the cursor have associations on them.
  • App makes other read only ActiveRecord calls within the block passed to #each_instance
    • these read only calls do not access the same table as the cursor

Question - Return a collection

I'am trying to use this gem in the following way:

sum = 0
Product.for_sale.each_instance {|p| p.sum = sum + p.balance ; sum += p.balance }
And return that as a collection. From the documentation, it says that each_instance returns the cursor as an enumerable. Is tthere a way to get a collection of Products?

Thanks!

Why is FOR UPDATE not supported?

The README says that this gem is not intended to support FOR UPDATE to process and update each row in place. But why not? I scanned the functionality of this gem and the PostgreSQL documentation for a bit but it's not obvious why FOR UPDATE is not supported.

each_instance does not respect ActiveRecord serialize?

I can pull together a more descriptive example if it's useful, but I have a JSON type in the db where I store arrays of stringified integers. I use serialize to serialize arrays of integers to arrays of strings upon save, and to deserialize back into arrays of integers upon load from the db. each_instance doesn't appear to do this; a standard ActiveRecord load does.

http://api.rubyonrails.org/classes/ActiveRecord/AttributeMethods/Serialization/ClassMethods.html

Deprecation warnings in Rails 6

I use #each_instance_batch method to iterate over batches, and when it tries to find column types for postgresql enum columns it warns

types[fname] = @connection.get_type_map.fetch(ftype, fmod) do |oid, mod|

/gems/2.7.0/gems/activerecord-6.0.3.4/lib/active_record/connection_adapters/abstract_adapter.rb:698: warning: deprecated Object#=~ is called on Integer; it always returns nil
/gems/2.7.0/gems/activerecord-6.0.3.4/lib/active_record/connection_adapters/abstract_adapter.rb:694: warning: deprecated Object#=~ is called on Integer; it always returns nil

ruby -v => ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-linux]
rails -v => Rails 6.0.3.4

Rendering cursor as collection in RoR give error `undefined method 'size' for #<PostgreSQLCursor::Cursor`

Hello!

I try render a collection from cursor in Ruby on Rails (it may be useful for rendering big collections).
But got an error
undefined method 'size' for #<PostgreSQLCursor::Cursor:0x00007fcc4edd4dd0

Steps to reproduce:

Create user and database:

$ psql

postgres=# create user books with password 'books';
CREATE ROLE

postgres=# create database books_development owner books;
CREATE DATABASE

Create test application:

rails new books --database postgresql
cd books

Fix development block in config/database.yml file for connect to database.

Add controller and run server:

bundle add postgresql_cursor
rails g scaffold book name:string
rails db:migrate
rails s

Testing

Open in a browser: http://localhost:3000/books
It should show book list (now empty).

Append one or several books to list.

Modify view file for use render of collection

# app/views/books/index.html.erb
<p style="color: green"><%= notice %></p>

<h1>Books</h1>

<div id="books">
<%= render partial: "books/book", collection: @books %>
</div>

<%= link_to "New book", new_book_path %>

Link http://localhost:3000/books should show book list.

Modify view index.html.erb for use postgresql_cursor enum

Append .each_instance to books collection:

<%= render partial: "books/book", collection: @books.each_instance %>

Refresh book list in browser. It show an error:

NoMethodError in Books#index

Showing books/app/views/books/index.html.erb where line #6 raised:

undefined method `size' for #<PostgreSQLCursor::Cursor:0x00007fcc4edd4dd0 @sql="SELECT \"books\".* FROM \"books\"",  ...

Return enumerators when no block given

Won't it be more natural for the each_instance method (as well as other iterators) w/o a block to return a enumerator instead of the "raw" cursor?

Now I can do this explicitly:

MyModel.all.each_instance.to_enum.with_object([]) { |record, obj| obj << record }

I think this to_enum can be applied to cursor under the hood, namely here:
https://github.com/afair/postgresql_cursor/blob/master/lib/postgresql_cursor/active_record/sql_cursor.rb#L34

This would make API a bit more natural IMHO:

MyModel.all.each_instance.with_object([]) { |raw, obj| ... }

Bug: SyntaxError when using ActiveRecord none

We have discovered that this gem doesn't handle properly the ActiveRecord none method.

subscriptions = Subscription.none
subscriptions.each_row { |row| p row }

The above code raises:

     ActiveRecord::StatementInvalid:
       PG::SyntaxError: ERROR:  syntax error at end of input
       LINE 1: ...ursor_a7205fd123ab9169d9a3105 no scroll cursor for 
                                                                                                                     ^

The expected behavior is that no error is raised, because the above code is perfectly valid. For example the following code that uses ActiveRecord each does not raise any error:

subscriptions = Subscription.none
subscriptions.each { |row| p row }

I think that the problem is that this gem uses the to_sql method, which returns an empty string when none is used (somewhere in the filter chain).

Probably the best solution is to check if to_sql returns an empty string and in that case return an empty array of results.

Underlying SQL Errors Masked When Transaction Has Been Aborted

To reproduce run the following:

ActiveRecord::Base.each_row_by_sql('select * from bad_table') { }

This leads to the following exception:

ActiveRecord::StatementInvalid: PG::InFailedSqlTransaction: ERROR:  current transaction is aborted, commands ignored until end of transaction block
: close cursor_1

It would be much easier to troubleshoot issues if the original exception was thrown:

ActiveRecord::StatementInvalid: PG::UndefinedTable: ERROR:  relation "bad_table" does not exist
LINE 1: declare cursor_1 cursor for select * from bad_table
                                                  ^
: declare cursor_1 cursor for select * from bad_table

Rails 5 compatibility?

Does postgresql_cursor work with current Rails 5 pre-releases or is it better to expect that this gem will not be compatible with Rails 5 and will need major reworking? As the last commit was 14 months ago, have people moved on to alternative solutions, and if yes, which?

Declaring cursors with NO SCROLL?

The DECLARE command allows the NO SCROLL argument, which hints to PostgreSQL that you will only be scrolling forward, not backwards. If you do not specify this argument then PostgreSQL tries to figure out for you whether it should allow scrolling or not, although the documentation does not explain what that means.

It seems this gem is only designed for scrolling forward anyway, so would it be a good idea to always call DECLARE with NO SCROLL as a micro-optimization? According to the documentation, NO SCROLL was introduced in PostgreSQL 8.3, which is very old. For reference: Ubuntu 12.04 and Debian 7 both supply PostgreSQL 9.1.

Documentation on cursors: https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html

Rails 5 compatibility?

With Rails 5 around the corner, no recent commits to this project and a few forks with a few small optimiztions that have not been pulled back in, I'd like to ask if a different way of lazily plucking attributes has arisen recently or if postgresql_cursor is still suggested to use?

Thanks!

How to setup data to use `each_row` in an rspec test?

Currently, I have my main code working where I run each_row on my ActiveRecord Relation. But in the test, I have an array, and the original code used to be .each so the test used to pass. Now I am running into undefined method 'each_row' for #<Array:0x00007f8a584358e8>. Any suggestion on what the structure in the test should be that it supports each_row ? Any advice will be great! Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.