At CloudTrucks, we aim to provide best in class data science, operations, and technology directly to truck drivers. With our service, single truck owner-operators and small fleets have freedom to run their business as they choose while enjoying the back office benefits of a large carrier - access to a large network of shippers, smooth and predictable cash flows, and safe, efficient routing - all with clear and transparent pricing.
There are 3.5M drivers in the U.S., and more than 10% of them own and operate their own trucks. 86% of trucking fleets having fewer than six trucks. Clearly, building a lean product that is broadly useful is not easy. This means, behind the scenes, we're building systems to allow team members with backgrounds at least as diverse as our customers to contribute effectively.
Our recent Business Intelligence initiative expressed this lesson well. The challenge was to expose the richness of data insights we enjoy internally (via our data science team) directly to our drivers.
We wanted to build a system that is both unobtrusive yet powerful enough to serve complex analytics, and able to scale with our customer base, while also requiring minimal engineering upkeep. This last point is especially important since we only have around 10 engineers and data scientists on our team!
I wanted to write this blog post not only to showcase the excellent work our team has done with relatively few resources, but also to share lessons learned for other companies looking to bootstrap data teams.
In the following sections, I’ll describe the tech stack we ended up implementing, some decisions that sped up production, and a couple of data principles underpinning the design.
Technologies
New data technologies are becoming available at a fantastic rate. For this project, the challenge wasn't in finding services that could perform to spec, it was in choosing which ones to adopt. In the end, our we selected services that easily fit together, reducing development time spent on "the seams", and also didn't break the bank.
Our driver facing business intelligence tech stack is a combination of three services that fit easily together.
One surprise, at least to me, was this didn't mean using the fewest services possible. Splitting our metric registry from the actual metric logic ended up as a core design decision.
The metric registry catalogues all metrics available in our platform, as well as their granularities. It tells clients what they can query, which dimensions they can segment, and the time frequencies available.
It's written in Django to be deeply embedded with the rest of our backend applications. The registry, plus a custom query language, all in a common API layer, simplified integration to the point that our front end team built the driver-facing UI practically autonomously. Django Rest Framework merits a shout-out here, for turning our serializers into elegant workhorses, and probably deserves a blog post of its own.
Also, since every registry entry is a Django model instance, we got a no-code interface for next to nothing through Django's admin site. The result: zero additional engineering effort is required to create new metrics, or edit existing ones.
Next, all metric definition logic - how metrics are actually computed - is housed in Dataform. For the uninitiated, Dataform is a toolkit for managing the "T" in ELT using SQL, JSON configurable pipelines, and some Javascript for customization. Why did we choose Dataform instead of any number of competitors?
A big draw was Dataform's web app which is a browser IDE plus managed pipeline infrastructure all in one. In one afternoon, I had a pipeline defined and running periodically to our data warehouse, with an isolated dev environment, version control, and data validation. Since then, our small data team has developed on it with minimal friction. The cost is pretty good too (free as of writing this post).
Finally, since we are a Google Cloud hosted shop, the decision to use BigQuery for our metric store was straightforward. However, here too was an unexpected benefit which sped up dev time. Federated queries made extracting transactional data from our (GC hosted) Postgres dbs almost too easy. Our entire ELT is defined by a number of SQL queries and a few config files in Dataform.
Data Principles
The flip-side to finding the right technologies is learning how we use them. By aligning ourselves to a couple of data principles, we helped ensure our system will scale well in both complexity, and throughput.
First, business metrics are hierarchical. At CloudTrucks, trucking fleets are composed of multiple drivers, who each perform many deliveries for many shippers, and each delivery has an invoice with multiple line items, etc.
This same concept extends to practically any business that manages customer data, and indeed, we were initially motivated by the F1 db design built at Google for AdWords, and its use of table schemas clustered by hierarchy to reduce latency.
From this hierarchy principle, we developed a validation rule for metrics that, when followed, allows us to return the appropriate dimensions for a query via a simple algorithm. The rule removes ambiguity in what aggregation to return for a metric query. Full details are in the Appendix.
A simple example, imagine we have a revenue metric that is aggregated at both the fleet and driver level: weekly revenue for Acme Trucking is the sum of revenue for the fleet's drivers, Joe and Sally.
Our query language should return Joe's revenue when queried with driver=Joe, Acme Trucking's revenue when queried with fleet=Acme Trucking, and Sally's revenue when queried with fleet=Acme Trucking, driver=Sally . This can be solved with a hierarchy of fleet → driver.
Things get more complex when Joe and Sally can drive each other's trucks, deliveries are categorized by region, we want to be able to query for all permutations of filters, and we want to distinguish queries for "revenue for all drivers at Acme Trucking" from "total revenue for Acme Trucking."
A valid hierarchy for querying fleets, drivers, trucks and regions according to our validation rule.
Our metric validation reduces the code complexity required to support these queries (finding the right aggregation level is equivalent to finding the root in a subtree), on the one hand, and allows heavily pre-aggregated metric tables on the other.
We capitalize on the latter in our metric table schema. We use a star schema, but with full dimensional indices for each level of hierarchy. The result is a data format that is the same for all metrics and levels, and needs no additional joins to dimension tables at query time. Metric queries are fast, simple, and powerful.
Finally, this whole setup is sharding efficient due to the natural clustering, giving us ample room to grow along with our drivers.
And Beyond
Building out our driver business intelligence service is one way we have leveraged our decades of in-house expertise across our customer base. Any data we can store and transform in our data warehouse, from performance stats to market forecasts and recommendations, is up for grabs for sharing with CloudTrucks' fleets.
We believe emergent technology and analytics are the cornerstones to running a successful trucking business, and we are excited to build them as part of the core product at CloudTrucks.
Are you interested in enabling the trucking fleets of today and tomorrow? We're hiring!
A big thanks to Gabriel Akintayo and Sam Tibbs for their help on this project!
Appendix
A metric can be aggregated with different dimensions, for example revenue can be aggregated at the driver level, or the fleet level (fleets are composed of multiple drivers). We call any such aggregation level a granularity. The following two rules validate that a metric and its dimensions form a hierarchy:
- A granularity can only be a descendent (child, child of a child, ...) of another granularity if it contains all of the dimensions of its ancestors. Any granularity that contains the dimensions of another must be a descendent.
- For any granularity, the subtrees formed by all other granularities that have overlapping dimensions (for example, the {driver, fleet} granularity overlaps with the {driver, vehicle} granularity) have their roots (granularities with no ancestors) contained in the original granularity.
The following, simple algorithm finds the correct granularity to return for a query on a hierarchy:
- Find all granularities that contain the dimension filters of the query.
- Return a root of the ancestry trees formed by the granularities from (1).
The validation rules guarantee that the algorithm terminates, and, in all but a few cases, has a unique result.