March 13, 2015


How to collect aggregated statistics in a Ruby on Rails application

Consulting & Advisory Digital Transformation Ruby on Rails Development Software Development

It is important to collect aggregated statistics so that management can analyze the data and make well-informed decisions.  Sphere was retained by a client in the recruiting industry who, among other things, needed to collect the following data:

 

  • Total shifts posted
  • Total hours posted

  • Total shifts worked
  • Total hours worked

  • Average length of shifts
  • Average shifts per job

 

In addition, Sphere had to provide the possibility of “spoofing” the statistics to a certain point while the production database was being tested. Up to that point, the statistics should have been based not on the actual values from the database but on some customarily-entered data.

Since our client was using a Ruby on Rails application, we decided to write a statistics module in Ruby as well in order to leverage existing code and to simplify maintenance.  We considered three implementation options:

 

Option Advantages Disadvantages
  • (1) Collect statistics on the fly. If total shifts posted must be calculated, a request is made to the corresponding table along with the constraints.
  • Statistics are always up to date.
  • Model code is polluted, as scopes and calculations for the statistics must be added.
  • Implementing the requirement of “spoofing” the statistics is difficult.
  • In order to calculate the data, SQL conditions like GROUP, JOIN, etc. must be added, which could lead to performance issues.
  • (2) Keep the statistics in separate tables which are updated on the fly. The data aggregated by day/employer is recorded and calculated in a separate table. If the data underlying the statistics are changed, a line recalculation takes place in the statistics table.
  • Statistics are always up to date.
  • Implementing the requirement of “spoofing” the statistics is easy.
  • Model code is polluted, as adding callbacks to call the code of calculation of statistical data is necessary.
  • If one model has changed, a request must be made for all models for this day for a given employer. In addition, the specifics of the application indicate that the models can change quite often during the day.
  • Minimal statistics detailing period is 1 day.
  • (3) Keep the statistics in separate tables which are updated periodically. The data aggregated by day/employer is recorded and calculated in a separate table. Application data vary throughout the day. Then a special background task at the end of the day collects the changes and updates the statistics table.
  • Clean model code.
  • All the logic is encapsulated in the collection of the statistics module.
  • Implementing the requirement of “spoofing” the statistics is easy.
  • Statistics can be irrelevant as they do not change throughout the day.
  • Minimal statistic detailing period is 1 day.

 

After presenting these three options to our client, we agreed to proceed with the third option.

 

Calculating & Storing Statistics

The Statistics::Employer model is used for calculating and storing statistical data. In its table, we store the date, employer’s foreign key, and all other values needed to calculate the statistics (total hours posted, total hours worked, number of applications, and average number of applications).

 

 

All formulas are contained in the model code:

 

Methods are using ActiveRecord::Calculations, so they can be called up on any scope, which is useful for filtering by date/employer.

 

Collecting Statistics

The collection of statistics can be divided into three sub-tasks:

  • What time to start daily statistics collection.
  • What dates to collect statistics.
  • How to collect statistics.

We have already answered the first question by choosing an embodiment (implementation variation). After analyzing the operation in the application, we found that the majority of shifts end before 2 a.m., so the statistics will be collected by schedule at 3 a.m.

Cron can be used to perform this task, but we decided to use clockwork gem:

 

Statistics::UntrackedDatesService – detects which dates are untracked and creates UntrackedDate for them. It always counts yesterday as untracked, as well as dates on models with updated_at after midnight the previous day.

UntrackedDate is a very simple active record model that contains only date attribute with unique index.

As we collect statistics for jobs and shifts, we need to track Job and Shift model updates. Also, as we count jobs and posted shifts on each job’s creation date, and worked shifts at the end time of each shift, we assume Job#created_at‘s and JobShift#end_times dates are untracked if those jobs/shifts changed from the time of the last statistics update.

So the full code of UntrackedDatesService is:

 

The last subtask is performed by Statistics::UpdateUntrackedService. It takes each untracked date, deletes all statistics for that day, and calculates new statistics. (Calculation is incapsulated in yet another service, UpdateService.) We need to delete all previous statistics to keep the process simple. UpdateService does not know why we mark this date as untracked. It just does what it is supposed to do.

In UpdateService, we create groupings by employer and calculate aggregated stats. Then we bulk insert all stats into the Statistics::Employer model:

 

This is all we need to collect and calculate statistics, but we have one more step to cover.

 

Callbacks

Sometimes a model’s time attributes can be changed. In that case, we can only track that statistics were changed in the new date, but not in the old one (because we can’t know what the previous time was). So we have to use callbacks to track previous dates of previous timestamps.

Here is a Tracking module that could be required by any tracked model:

 

Now we have implemented full, easily expandable business logic to collect and output application statistics!

 

Output

Finally, we need all collected data to output. Since we use ActiveAdmin, I will show ARB code snippets and the screenshots it outputs.

First, we need filter form:

 

Here is a form in table view:

image00

We can output monthly breakdown of all these stats, using chartkick gem:

 

Here is a chart view:

image01

Summary

We would like to emphasize the following:

  • The “spoofing” requirement is implemented using a constant Statistics::KEEP_LIVE_STATISTICS_FROM. (Did you noticed it in the code above?) The process of forming and loading made-up statistics prior to this date is beyond the scope of this article.
  • Prepopulating the statistics with the existing data is performed with a straightforward rake task – just take each date application worked and pass it to UpdateService.
  • In the real statistics, there are some more complex metrics, like breakdown of job roles. We used Postgresql hstore columns for storing it, but this topic is also beyond the scope of this article.
Consulting & Advisory Digital Transformation Ruby on Rails Development Software Development

Latest Insights in Consulting & Advisory

The Rise of Kotlin – Moving Away from Java for Android Development

Kotlin is a programming language for the Java Virtual Machine that’s able to be used in any scenarios that currently…

Introducing our Sphere Heroes Program – Artem Korenev – Employee of the month

At Sphere, employee recognition is a key component of our corporate culture. We believe in celebrating the successes of our…

Write For Sphere

Are you a writer with tech expertise? Then we want to hear from you! Here are a few guidelines for…

View All Articles arrow

We are here to help:

checkmarkto become a customer checkmarkto become an investor checkmarkto send a media inquiry checkmarkto join our team checkmarkto simply say ‘hi’
Get in Touch