神刀安全网

Really fast CSV imports with Rails and PostgreSQL

06 Jun 2016

Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added / liked / commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test scenario

I’ve created:

  • sampleexample.csv file containing 50,000 rows with the following format:
post_id;title;published_at;likes_count;comments_count 

  • PostgreSQL table with unique index on external_post_id
CREATE TABLE public.posts  (    id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),    title character varying,    published_at timestamp with time zone,    likes_count integer NOT NULL DEFAULT 0,    comments_count integer NOT NULL DEFAULT 0,    external_post_id integer NOT NULL,    CONSTRAINT posts_pkey PRIMARY KEY (id)  );  CREATE UNIQUE INDEX index_posts_on_external_post_id    ON public.posts    USING btree    (external_post_id);

  • ActiveRecord model
class Post < ActiveRecord::Base   validates :title, presence: true   validates :external_post_id, presence: true   validates :external_post_id, uniqueness: true end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV.foreach(Rails.root.join('example.csv')) do |row|   data = {     title: row[0],     published_at: row[1],     likes_count: row[2],     comments_count: row[3]   }      Post.where(external_post_id: row[4]).first_or_create(data).update(data) end

The result is expected but still pretty disappointing:

      user     system      total         real 112.940000   9.690000 122.630000 (159.846472)

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem and it also supports PostgreSQL 9.5 on conflict feature.

According to the documentation the fastest method is to use raw columns and arrays of values and it’s exactly what we are going to do:

columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id] values = CSV.read(Rails.root.join('example.csv'))  Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }

The result is amazing – we are almost 23x faster then before!

    user     system      total         real 6.160000   0.060000   6.220000 (  6.974064)

Could we be even faster?

Really fast CSV imports with Rails and PostgreSQL

The fastest way to put data into PostgreSQL is to use COPY command, but it comes with limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES .

Post.connection.execute <<-SQL   CREATE TEMP TABLE post_imports   (     title character varying,     published_at timestamp with time zone,     likes_count integer,     comments_count integer,     external_post_id integer   ) SQL  File.open(Rails.root.join('data.csv'), 'r') do |file|   Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do     while line = file.gets do       Post.connection.raw_connection.put_copy_data line     end   end end  Post.connection.execute <<-SQL   insert into posts(title, published_at, likes_count, comments_count, external_post_id)   select title, published_at, likes_count, comments_count, external_post_id   from post_imports   on conflict(external_post_id) do   update set     title = EXCLUDED.title,     published_at = EXCLUDED.published_at,     likes_count = EXCLUDED.likes_count,     comments_count = EXCLUDED.comments_count   returning id SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

    user     system      total         real 0.150000   0.020000   0.170000 (  0.643315)

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.

10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might find that 6.7s is perfectly acceptable in most cases. Because of that I would recommend going with activerecord-import gem but keep in mind that we can do much better if we really need to.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Really fast CSV imports with Rails and PostgreSQL

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址