Signifyd started with a monolith — most tech startups do. Our monolith served us well until the cracks started to show — again a very common experience. In our case the most worrisome crack started because of our machine learning models.
The problem wasn’t specifically with our models, per se, but rather their memory footprint. We use random forests and as our sophistication grew so did the size of our models. As the models got bigger, our deployment times started to increase, which is a big problem for a continuous-delivery-oriented development team. Likewise, the growing models meant that each instance of the monolith was eating up more and more memory on our ECS hosts, causing deploy times to balloon and making local development harder and harder.
There are a lot of factors that could go into our models getting bigger, but the bottom line was that having them in our main monolith was slowing us down and eating up resources. So we started talking about how to split the models into a service of their own.
And while our discussions were strategic, they were driven by Signifyd’s culture. Historically, the engineering team talked informally about our Do-It-Right culture, but in fact our work on slaying the monolith mapped to a more formal value embraced by the entire company: Design for Scale. It is one of Signifyd’s six core values and it comes with the elaboration, “Build solutions to scale and last. We are here to stay.”
Inspired by that thinking, we didn’t just pick a popular REST framework and start churning out services. Instead, we took a look at the landscape of microservices and asked, “What do we as a Signifyd need out of a microservice framework?”
Design for scale
One of the other senior engineers and I explored several options for microservice frameworks. We narrowed our concerns down to a few essentials:
- Service contracts – how does a solution allow us to define and enforce a service contract
- Limit scope to internal services – we already use REST externally, but that didn’t necessarily need to define our internal services
- Operating environment – would any given solution perform well in our environment
- Match for our coding style — Signifyd likes Java, asynchronous operations and type safety
Each of our proposed solutions had a question mark or two for some of these criteria at first, but it gave us a place to start the discussion.
I suggested a REST-like JSON/HTTP solution using Vert.x 3. I liked it because it was familiar and ubiquitous and Vert.x was known to be asynchronous. It’s hard to argue against the reigning champ.
The other engineer brought GRPC to the table. I’d never heard of it, but listening to his pitch, I became more and more intrigued. I liked using Protocol Buffers as an API description mechanism. The strongly-typed generated objects also seemed very appealing given how Signifyd works.
Several other frameworks were proposed in the meeting. Our vice president of engineering suggested a framework that none of us had heard of and we discussed it for a bit. Lagom from Lightbend seemed interesting — especially given our Play heritage — but it had only just recently hit 1.0. We honed in on trying two approaches: a standard approach — REST-like — and a newer approach — GRPC. We could have explored dozens of frameworks and approaches, but those two options seemed to resonate best with the group.
We weren’t able to come up with a decision in the initial discussion, but we agreed to hold a “bake-off.” I would code a sample service in GRPC and the other engineer would create the same sample service for REST on Vert.x.
Nearly two years after the fact, the actual sample services we created are lost to time, but the rigorous process remains fresh in my mind. When I explain it to my more technically inclined friends, the very concept of the bake-off intrigues them — and the outcome intrigues them even more.
The rules were simple:
- Implement a service in Java on the assigned technology
- Create a sample client object
- Produce API documentation
I produced my service in GRPC very quickly. Proto3’s syntax was immediately familiar to me — as it should be for anyone familiar with a C- or Java-derived language. I hit a minor stumbling block when I needed to wire GRPC into our build tools, but I quickly found the plugins that I was missing. From there, the development experience was surprisingly fluid. Having the generated code that took care of a lot of the rote boilerplate of setting up a service allowed me to focus on the really important task — implementing the service’s logic. For the way Signifyd works, GRPC felt immediately natural.
The Vert.x service came together just as quickly, but the ad-hoc nature of REST services really started to show through. Hand coding the message POJOs was as time-consuming as ever. The API documentation had to be coded separately and put into Apiary, which left us concerned that the API documentation and the actual API could drift if one wasn’t careful. No one seemed to think the experience was necessarily bad, just that it could have been better.
Once we both had completed our services, we got to judging. The full judging document was several dozen pages long and concerned with intricate minutia, but the summary I presented later was brief:
Additionally, during our bake-off judging, we started to question our choice of Vert.x should be choose REST as our paradigm. Vert.x’s single-threaded, event-loop async model seemed incompatible with our multiple thread pool asynchronous codebase.
In our final analysis, GRPC simply offered more features, a better developer experience for the way we code, and a purpose-built services toolkit. On the other hand, the REST-like services only scored highly on being familiar and comfortable. While familiarity and comfort are good aspects for any solution, Signifyd’s culture prioritizes technical merit and code correctness.
A few weeks after we’d started talking about selecting a microservice framework, I signed into our Architecture Council meeting and presented our findings. I also strongly recommended choosing GRPC as our microservices base. Being that I was so close to the creation and judging, I removed myself from my normal council duties and helped facilitate the discussion.
Once we had decided and committed, it was time to get to the real work. It was also time for us to learn what hadn’t known to learn before.
Issue: true, false, or unset
One of the first issues we discovered was Proto3’s removal of the required and optional storage modifiers. Further, Proto3 had defined default values for all primitives. Only message types could be made optional — by leaving them unset. As a Java programmer, I recognize that when dealing with a Boolean object, the binary object has three states: true, false, and null.
Signifyd makes heavy use of Optional in Java to keep us from tripping over null, so both the lack of optionality and the default value were troubling to us. We initially tried the trick of wrapping all primitive values in small messages in our Proto3 definitions, but that felt burdensome and error-prone. Eventually one of our engineers found Google’s protobuf wrappers, which we adopted to provide a reusable toolkit to express optional values.
Signifyd uses Phabricator for our code reviews. Before any review can be submitted, we run lint against the changes. We like keeping our codebase nice.
If GRPC were to be a first-class citizen of our codebase, it would need at least a base amount of linting to ensure code quality. When we were adopting GRPC, there was no widely available linting solution for GRPC. Eventually, through trial and error, we were able to make a shell script that would use a provided `protoc` binary to confirm that the local protobuf changes parsed correctly. It wasn’t a complete solution, but it worked.
Issue: hearts and minds
Even in a disagree-and-commit culture, we met some hesitation. When unexpected wrinkles and issues arose, some wondered if we ought to abandon GRPC and just use REST.
Over time, we listened to concerns, addressed issues and continued to communicate and discuss the reasoning behind the GRPC decision. Today, no one questions GRPC — sometimes we even have product managers using GRPC with the help of tools like omGRPC. Plus, the process of getting here only helped hone our understanding of not only how to choose new technologies, but also how to better communicate those choices to the team.
Issue: load balancing
If you look back at our criteria for selecting a microservice base, you will see that running well in AWS is on the list. Even with all of our due diligence, we missed that neither of AWS’s load balancer options truly supports HTTP/2. Their ELBs simply do not support HTTP/2. Their ALBs nominally support HTTP/2, but that support is limited. For instance, HTTP/2 can be terminated on the load balancer, but then it has to be forwarded as HTTP/1.1.
When we deployed our first service to a staging environment, nothing worked. It took us several days, but eventually we were able to isolate the problem at the load balancers. That time frame was elongated because we didn’t consider that AWS’s load balancers could be the problem.
Eventually, we were forced to drop the load balancing from Layer 7 to Layer 4 (TCP). Given the long-lived nature of HTTP/2 connections, this created an entirely new problem — highly unbalanced loads occurred when one container took longer to spawn than the others.
We’ve since mitigated the problem by shortening the lives of our connections. However, the problem was a somewhat embarrassing reminder that due diligence can’t catch everything. On balance, it also gives us impetus to find better solutions for our container orchestration and load balancing needs, which we will apply the same rigor to selecting. That’s just our way.
GRPC and Signifyd: one year later
It’s been a solid year since we fielded that first model hosting service — warts and all. All our engineers now accept GRPC as part of our ecosystem. We’ve become so used to Protobuf that we sometimes joke about using it as our default data-encoding scheme.
GRPC also helped us understand how much we were at the mercy of AWS’s infrastructure offerings. So we’ve started branching out our infrastructure practice — bringing on newer, more robust tools.
We have several microservices running on GRPC platforms now, and it has settled in as a mature, understood technology in our toolkit. There are more services we can and should build, but we’re waiting until we get our infrastructure practice up to speed.
What we know right now is, that in the end, GRPC turned out to be the right choice for us.