How to collect aggregated statistics in a Ruby on Rails application?

13 Mar 2015

 

It is important to collect aggregated statistics so that management can analyze the data and make well-informed decisions.  Sphere was retained by a client in the recruiting industry who, among other things, needed to collect the following data:

  • Total shifts posted
  • Total hours posted
  • Total shifts worked
  • Total hours worked
  • Average length of shifts
  • Average shifts per job

In addition, Sphere had to provide the possibility of “spoofing” the statistics to a certain point while the production database was being tested. Up to that point, the statistics should have been based not on the actual values from the database but on some customarily-entered data.

Since our client was using a Ruby on Rails application, we decided to write a statistics module in Ruby as well in order to leverage existing code and to simplify maintenance.  We considered three implementation options:

 

Option Advantages Disadvantages
  • (1) Collect statistics on the fly. If total shifts posted must be calculated, a request is made to the corresponding table along with the constraints.
  • Statistics are always up to date.
  • Model code is polluted, as scopes and calculations for the statistics must be added.
  • Implementing the requirement of “spoofing” the statistics is difficult.
  • In order to calculate the data, SQL conditions like GROUP, JOIN, etc. must be added, which could lead to performance issues.
  • (2) Keep the statistics in separate tables which are updated on the fly. The data aggregated by day/employer is recorded and calculated in a separate table. If the data underlying the statistics are changed, a line recalculation takes place in the statistics table.
  • Statistics are always up to date.
  • Implementing the requirement of “spoofing” the statistics is easy.
  • Model code is polluted, as adding callbacks to call the code of calculation of statistical data is necessary.
  • If one model has changed, a request must be made for all models for this day for a given employer. In addition, the specifics of the application indicate that the models can change quite often during the day.
  • Minimal statistics detailing period is 1 day.
  • (3) Keep the statistics in separate tables which are updated periodically. The data aggregated by day/employer is recorded and calculated in a separate table. Application data vary throughout the day. Then a special background task at the end of the day collects the changes and updates the statistics table.
  • Clean model code.
  • All the logic is encapsulated in the collection of the statistics module.
  • Implementing the requirement of “spoofing” the statistics is easy.
  • Statistics can be irrelevant as they do not change throughout the day.
  • Minimal statistic detailing period is 1 day.

 

After presenting these three options to our client, we agreed to proceed with the third option.

Calculating & Storing Statistics

The Statistics::Employer model is used for calculating and storing statistical data. In its table, we store the date, employer’s foreign key, and all other values needed to calculate the statistics (total hours posted, total hours worked, number of applications, and average number of applications).

class CreateStatisticsEmployers < ActiveRecord::Migration
  def change
    create_table :statistics_employers do |t|
      t.date :date, index: true, null: false
      t.references :employer_profile, index: true, foreign_key: true, null: false

      t.integer :jobs_count, default: 0, null: false
      t.integer :shifts_posted_count, default: 0, null: false
      t.decimal :hours_posted_count, default: 0, null: false
      t.integer :shifts_worked_count, default: 0, null: false
      t.decimal :hours_worked_count, default: 0, null: false
    end
  end
end

All formulas are contained in the model code:

module Statistics
  class Employer < ActiveRecord::Base
    belongs_to :employer_profile

    class << self
      def total_jobs
       sum(:jobs_count)
      end

      def total_shifts_posted
       sum(:shifts_posted_count)
      end

      def total_hours_posted
       sum(:hours_posted_count)
      end

      def average_length_of_shift_posted
       total_shifts_posted.zero? ? 0 : total_hours_posted / total_shifts_posted
      end

      def average_shifts_per_job
       total_jobs.zero? ? 0 : total_shifts_posted.to_f / total_jobs
      end

      def total_shifts_worked
       sum(:shifts_worked_count)
      end

      def total_hours_worked
       sum(:hours_worked_count)
      end

      def average_length_of_shift_worked
       total_shifts_worked.zero? ? 0 : total_hours_worked / total_shifts_worked
      end
    end
  end
end

Methods are using ActiveRecord::Calculations, so they can be called up on any scope, which is useful for filtering by date/employer.

Collecting Statistics

The collection of statistics can be divided into three sub-tasks:

  • What time to start daily statistics collection.
  • What dates to collect statistics.
  • How to collect statistics.

We have already answered the first question by choosing an embodiment (implementation variation). After analyzing the operation in the application, we found that the majority of shifts end before 2 a.m., so the statistics will be collected by schedule at 3 a.m.

Cron can be used to perform this task, but we decided to use clockwork gem:

# clock.rb
require 'clockwork'
require './config/boot'
require './config/environment'

module Clockwork
  every(1.day, 'statistics.collect', at: '3:00') { Statistics::CollectJob.perform_later }
end
Statistics::CollectJob is background job, consistently resolving the remaining two sub-tasks:
# app/jobs/statistics/collect_job.rb
module Statistics
  class CollectJob < ::BaseJob
    def perform
      Statistics::UntrackedDatesService.new.execute
      Statistics::UpdateUntrackedService.new.execute
    end
  end
end

Statistics::UntrackedDatesService – detects which dates are untracked and creates UntrackedDate for them. It always counts yesterday as untracked, as well as dates on models with updated_at after midnight the previous day.

UntrackedDate is a very simple active record model that contains only date attribute with unique index.

As we collect statistics for jobs and shifts, we need to track Job and Shift model updates. Also, as we count jobs and posted shifts on each job’s creation date, and worked shifts at the end time of each shift, we assume Job#created_at‘s and JobShift#end_times dates are untracked if those jobs/shifts changed from the time of the last statistics update.

So the full code of UntrackedDatesService is:

# app/services/statistics/untracked_dates_service.rb
module Statistics
  class UntrackedDatesService
    attr_reader :working_date

    def initialize(current_date = Date.current)
      @working_date = current_date - 1.day
    end

    def execute
      untracked_dates.each do |date|
        Rails.logger.info "Marked #{date} as untracked"
        Statistics::UntrackedDate.find_or_create_by date: date
      end
    end

    private

    def untracked_dates
      [
        working_date,
        *untracked_past_jobs_dates,
        *untracked_past_shifts_posted_dates,
        *untracked_past_shifts_worked_dates
      ].uniq
    end

    def untracked_past_jobs_dates
      Job.where('updated_at >= ?', working_date.beginning_of_day)
         .where('created_at < ?', working_date.beginning_of_day)
         .pluck(:created_at).map(&:to_date)
    end

    def untracked_past_shifts_posted_dates
      # similar logic
    end

    def untracked_past_shifts_worked_dates
       # similar logic
    end
  end
end

 

The last subtask is performed by Statistics::UpdateUntrackedService. It takes each untracked date, deletes all statistics for that day, and calculates new statistics. (Calculation is incapsulated in yet another service, UpdateService.) We need to delete all previous statistics to keep the process simple. UpdateService does not know why we mark this date as untracked. It just does what it is supposed to do.

In UpdateService, we create groupings by employer and calculate aggregated stats. Then we bulk insert all stats into the Statistics::Employer model:

module Statistics
  class UpdateService
    attr_reader :date

    def initialize(date)
      @date = date
    end

    def execute
      return if date < KEEP_LIVE_STATISTICS_FROM
      Rails.logger.info "Updating statistics for #{date}"

      Statistics::Employer.where(date: date).delete_all
      Statistics::Employer.create employers_statistics
    end

    private

    def employers_statistics
      # Here we have a lot of ruby/rails/sql magic
      # and return array of hashes for each statistics entry
      # (i.e. grouped by date/employer_profile_id)

    end

  end
end

This is all we need to collect and calculate statistics, but we have one more step to cover.

Callbacks

Sometimes a model’s time attributes can be changed. In that case, we can only track that statistics were changed in the new date, but not in the old one (because we can’t know what the previous time was). So we have to use callbacks to track previous dates of previous timestamps.

Here is a Tracking module that could be required by any tracked model:

module Statistics
  module Tracking
    extend ActiveSupport::Concern

    included do
      cattr_accessor(:statistics_tracked_attributes) { Set.new }
      after_update :check_statistics_tracked_attributes_have_changed
    end

    class_methods do
      def track_attributes_for_statistics(*attributes)
        statistics_tracked_attributes.merge attributes.map(&:to_s)
      end
    end

    def check_statistics_tracked_attributes_have_changed
      (statistics_tracked_attributes & changed).each do |attr|
        before, after = changes[attr]
        next unless date_changed?(before, after)
        Statistics::UntrackedDate.mark before.to_date
      end
    end

    def date_changed?(before, after)
      before && (!after || before.to_date != after.to_date)
    end
  end
end
and it is included into Job
  include Statistics::Tracking
  track_attributes_for_statistics :created_at
and JobShift
  include Statistics::Tracking
  track_attributes_for_statistics :clocked_out_at

 

Now we have implemented full, easily expandable business logic to collect and output application statistics!

Output

Finally, we need all collected data to output. Since we use ActiveAdmin, I will show ARB code snippets and the screenshots it outputs.

First, we need filter form:

form_for search, url: admin_statistics_employer_path, method: 'post' do |f|
  f.text_field :employer_profile_id
  f.text_field :from , class: 'datepicker', 'data-datepicker-options' => '{"maxDate": "-1d"}'
  span '-'
  f.text_field :to, class: 'datepicker'
  f.submit 'Filter'
end
search here is a form object that takes params[:search] and returns scoped Statistics::Employer.where(date: from..to).
We can output total statistics by the period:
table do
  thead do
    tr do
      th :stat
      th :value, style: 'text-align: right'
    end
  end
  tbody do
    %w(total_jobs total_shifts_posted total_hours_posted
       average_length_of_shift_posted average_shifts_per_job
       total_shifts_worked total_hours_worked
       average_length_of_shift_worked).each do |stat_name|
      tr do
        td stat_name.titleize
        td number_with_delimiter(stats.public_send(stat_name).round(1)), style: 'text-align: right'
      end
    end
  end
end

We can output monthly breakdown of all these stats, using chartkick gem:

h3 'Shifts'
div line_chart(
  [
    { name: 'posted', data: stats.group_by_month(:date).sum(:shifts_posted_count) },
    { name: 'worked', data: stats.group_by_month(:date).sum(:shifts_worked_count) }
  ]
)

h3 'Hours'
div line_chart(
  [
    { name: 'posted', data: stats.group_by_month(:date).sum(:hours_posted_count) },
    { name: 'worked', data: stats.group_by_month(:date).sum(:hours_worked_count) }
  ]
)
Summary

We would like to emphasize the following:

  • The “spoofing” requirement is implemented using a constant Statistics::KEEP_LIVE_STATISTICS_FROM. (Did you noticed it in the code above?) The process of forming and loading made-up statistics prior to this date is beyond the scope of this article.
  • Prepopulating the statistics with the existing data is performed with a straightforward rake task – just take each date application worked and pass it to UpdateService.
  • In the real statistics, there are some more complex metrics, like breakdown of job roles. We used Postgresql hstore columns for storing it, but this topic is also beyond the scope of this article.