If you have ever dealt with addresses (either sent a letter to someone or had to deal with a 10’s of millions of voter records), you might have wondered how this location data was process. Nat and Mike are here to help you stop wondering with an in depth discussion of all things address related. We will be talking about what goes on behind the scenes in software you may be using, review common problems that arise, and will delve into some low-cost options for resolving this issues on your own. If you want to just skip all of that and unload the problem on someone else, that’s totally cool too, that just happens to be a service we provide (email us at [email protected]). Also, go here to sign up for email updates on larger topics we will be covering, such as this one.
We have a lot of experience with address parsing since we had a lot of it to do for the 2012 election. Instead of buying some fancy CASS software or passing off the address cleaning task to a sad underpaid intern, we took the hard way out. Some might call this ‘reinventing the wheel’ (we call it ‘reinventing the map’). Using stripped down CASS validation through the AMS API and a variety of parsers we either wrote or ported ourselves, we processed, parsed, fixed, and validated tens of millions of addresses. Given that we were dealing with election data and our tolerance for errors was pretty much zero, we think our solutions worked out pretty well and hope that you can find some use out of what we worked on for oh so long.
In describing our efforts, we’ll first start by explaining some of the terms we will be using repeatedly through this post (all those formalized address terms), and then will discuss the problem of address parsing, potential solutions, and when to prioritize each solution. If you are only interested in the coding sections of this post, jump on down to the solutions.
Coding Accuracy Support System (CASS) and Address Matching System (AMS)
CASS Certification is a service offered by the US Postal Service for the evaluation of address matching software. Companies dealing with large amounts of address input may write their own software for matching possibly poorly formatted addresses to correct USPS mailing addresses. The CASS program provides test sets on which to evaluate address matching performance in several areas: (1) 5-digit coding, (2) ZIP + 4/ delivery point (DP) coding, (3) carrier route coding, (4) DPV®, (5) DSF2®, (6) LACSLink® (7) eLOT® and (8), RDI™. (This blob of acronyms represents several USPS address services. Definitions can be found here: https://ribbs.usps.gov/) Both debugging sets with correct processing results and evaluation sets used for certification are available. According to the USPS certification programs site (https://www.usps.com/business/certification-programs.htm) “To be CASS Certified, participants must pass with a minimum score of 98.5 percent for ZIP + 4, carrier route, five-digit and LACSLink. 100 percent for delivery point coding, eLOT, DPV, RDI and Perfect Address.” The implication here is that some parsing and smart coding can fix things like zip codes and street names, but several of these requirements are based on proprietary USPS databases. For example, “The LACSLink Product is a secure dataset of converted addresses that primarily arise from the implementation of the 911 system, which commonly involves changing rural-style addresses to city-style addresses.” (https://ribbs.usps.gov/index.cfm?page=lacslink). Though it *may* be possible to produce these databases on your own, some of the requirements are extremely arcane. (For example, if you choose to emulate DSF2, you’re going to have to differentiate between door slot addresses and curb box addresses (https://ribbs.usps.gov/index.cfm?page=dsf2). Of course, that’s optional. You can go with DPV instead and just provide data on vacant addresses: (https://ribbs.usps.gov/index.cfm?page=dpv). I think we can all agree that sounds like a pretty Trivial Problem™)
The upshot is if you’re going to build a CASS certified address matcher, you’re probably going to be building on some USPS products. One alternative to bringing all these products together yourself is to start with a USPS product that bundles them all into a CASS certified matcher already: the USPS Address Matching System (AMS). AMS combines a parser and the required databases into a single system and provides an API in C that can be used to run address queries. The big downside to AMS is that it is not free. License cost is variable depending on a few factors, but can cost several thousand dollars for a year of bimonthly updates. Despite the cost, this was our starting point for address standardization. Information on licensing can be found here: https://ribbs.usps.gov/index.cfm?page=amsapi