How one major cybersecurity software company tripped on its own SPF record, and how to avoid ever making the same mistake.
We have written about the challenges of SPF quite a bit in the past. SPF is relied upon as one of the authenticity signals for DMARC compliance and is commonly used as a soft signal for spam filters where a DMARC policy does not exist.
SPF is extremely important for sending organizations to get right, but it is equally easy to get wrong because of the strict specification of the protocol and the increasing complexity of cloud sending infrastructure.
Today, most SPF records are not made up of simple lists of authorized sending IPs, since most sending is now done by multi-tenant third-parties (aka ‘the cloud’). In order to manage this complexity, the SPF protocol makes use of ‘include mechanisms’ which allows the owner of the domain to delegate certain authorization to someone else. This is essentially an instruction to look for a list of authorized IP addresses maintained by someone else (the cloud provider) so that it stays up to date. For example, a domain owner could add the Google Workspace include to their SPF record, thereby federating the whitelisting of Google sending IPs to Google itself. Google frequently refreshes these IPs, so it would be quite cumbersome for every Google sender to have to update these manually each time.
SPF limitations
These ‘include’ mechanisms, also known as ‘SPF lookups’, are an important part of the SPF picture, but they come with an important protocol limitation: A domain can have no more than 10 DNS lookups across all include mechanisms. This is to guard against a form of Denial of Service Attack, but we won’t go into too much detail about this here.
Exceeding the 10 lookup limit is one of the most common forms of SPF error that we see in the wild, and the net effect of it for senders is near certain authentication failure of any sending services listed after the 10th lookup. This can be catastrophic for businesses that rely on email to reach customers and other counterparties.
What could go wrong?
Just stay beneath the 10 lookup limit. Easy, right? Well, not for any modern business leveraging the cloud. Many will be using close to or over 10 services, and each service can consume one or more lookups – sometimes more. It quickly adds up (and that’s why our OnDMARC solution comes with our advanced Dynamic SPF feature that clears up this issue, you can find out more about that here). But this post is not about plugging our product, instead it aims to demonstrate just how easy it is to trip over the limit, even in a sophisticated organization.
I mentioned how a service can consume more than one lookup, since lookups can be nested in another lookup. For example, the Google Workspace lookup mentioned earlier is structured in this way:
As you can see from the diagram above, 1 lookup contains 3 others, bringing it to a total of 4.
The potential issue quickly becomes apparent: What if this sender is at exactly 10 lookups and one day one of their sending services changes the composition of their infrastructure to contain an additional lookup, thereby breaking their SPF record? It happens more often than you might think.
Everyone makes mistakes
We use our technology to scan for and help fix this kind of error across the web – but we were surprised to see the issue pop up on the primary domain of a multi-billion dollar cybersecurity giant, who offers a product that competes directly with Red Sift’s OnDMARC.
What happened to this domain is a prime example of the scenario outlined above:
This company has 6 sending services in their SPF record, including familiar sending tools for anyone in the business of business-to-business SaaS: Marketo, SendGrid, NetSuite, Salesforce, etc. Together, these services came to a grand total of 10 SPF lookups – at least until the morning of April 21st, 2021.
Suddenly, the SPF record broke, calling for 11 lookups. You can see the lookups required by each sending service in light blue next to each service on the tree below:
So what happened here? And how could it happen to this unnamed cybersecurity giant, who present themselves as email security experts?
Looking back at our logs, we observed that at some point between 3:30am and 3:50am EDT, the NetSuite lookup grew by one additional ‘include’.
Specifically, under the root NetSuite lookup, mailsenders.netsuite.com was split into sp1.mailsenders.netsuite.com and sp2.mailsenders.netsuite.com. In isolation, this is a reasonable action for a service provider to take, but the unintended knock-on effects can be major to their customers and supply chain – and particularly embarrassing for a company in this line of business.
How to fix it
You can check your own SPF record using our Investigate tool here. SPF is an important part of the fabric of email security, but it is hard and cumbersome for humans to maintain. That is why we have built software to do this job, so humans don’t have to.
We give you a record that replaces all your mechanisms with a single include that dynamically combines all your authorized services correctly at the point of query.
In this particular incident, our users, many of whom who also rely on NetSuite would have experienced absolutely no impact because Dynamic SPF, like the name suggests, dynamically compacts the SPF record to always be compliant with the SPF protocol.
We even add more resiliency by monitoring and healing using ‘last known good’ values for a policy when transient errors occur (put simply: Dynamic SPF skips over the broken stuff). This means that the rest of your email can continue to flow. When these errors get fixed, Dynamic SPF will automatically incorporate the updated values with no user intervention.
While we don’t expect our slightly distressed competitor to request an OnDMARC demo anytime soon, there’s no reason why you shouldn’t.