Validating Email Addresses with a Regex? Do yourself a favor and don’t

Let’s say you got a simple problem: build a form that allows a user to sign up for a newsletter. Obviously, you need to prevent users from entering junk while still allowing “exotic” addresses.

What does a valid address look like? Intuitively one would say:

Looks good, but is wrong. Among other things, this regex won’t allow anyone with a dot or a hyphen in their address to sign up. Better have a look at the specification to see what’s valid and what isn’t:

Bummer! Proper validation makes the regex considerably more complicated:

Now you have four not so simple problems:

  1. How do you determine the correctness of this monster? Regular expressions you dig up on the interwebs are usually the result of someone posting a problem to a forum and someone else thinking about it for five minutes. If the answer works for a couple of selected test cases, it becomes the accepted solution and gets copied and pasted over and over again. Eventually it fails, gets “fixed” and reposted till you get something so wildly complex that nobody dares to touch it (and yes, I copied it from somewhere without checking).
  2. A lot of what RFC5322 allows will bounce in real life and should therefore not be accepted by anything but an MTA. However, you can’t really adapt the regex above to your needs without fear of breaking it.
  3. Regular expressions are only mildly efficient at best. The longer they are, the longer it takes to compile the pattern (matching is always done in O(n) – unless you got a horrible regex engine).
  4. Regular expressions only match, but don’t tell you what exactly was matched. If, for example, you wanted to check the domain against a blacklist, you’d be out of luck.

Let’s do it the right way

First, we need to simplify things. Since we are validating, we are only interested in a subset of RFC 5322 compliant email addresses: those that look like they won’t bounce right away. Rules:

  1. All addresses must conform to the localpart@domain pattern.
  2. For the local part, we only accept an RFC5322 dot-atom. However, for practical purposes, we further limit {atext} to alphanumeric characters, hyphens, underscores and plus signs. The rationale here being to keep it simple for the software we are validating for (it must deal with whatever we allow to pass).
  3. The domain must consist of at least two components (TLD and a second level domain), separated by a dot. Additional subdomains are acceptable.
  4. The TLD must consist of at least two letters, no numbers or hyphens.
  5. Domains components may only consist of alphanumeric characters and hyphens. They may, however, neither start nor end with a hyphen and must be at least one character long.
  6. Unicode domains must be encoded in punycode. Rationale: don’t allow things to pass that might break legacy systems.

Regular expressions are just one way for defining finite-state machines. Another, more user friendly, method is by drawing a state diagram (ignore for the moment, that not all rules are enforced. We’ll come to that later):

emailaddrfsm

Now, let’s transform that automaton into code, step by step. First stop, the generic DFA skeleton. This piece of code acts as a driver for iterating over the input and inspecting each character (note: safety checks omitted):

 

A state in the diagram corresponds to a case statement in the switch block. When entered, a state performs the following actions:

  • If the state diagram contains an edge that is labeled with the value of char ch and points to state qX, then set the value of int state to X.
  • If there is no edge for the value of of char ch , then set the value of int state  to -1 (error state).
  • If there is code associated with the transition, execute that code as well.

For example, consider the following, simplified state diagram for an automaton, which counts the number of lowercase letters in the input (the code for counting is omitted from the diagram):

simplecountingfsm

Transformed according to the rules above, the code for q0 looks like this:

Finally, we need to perform a couple of sanity checks. These are what blows your regex out of proportion if you try to do them with the DFA:

Putting it all together

Now, let’s stitch all the parts together. If all you wanted was to get your problem solved, then the code below is what you came here for. Word of warning, though: you should really read how it was constructed! There is no one size fits all solution for email address validation. You might find that some of my design choices were either too strict or too permissive for your use case. It helps, using a chess pawn to step through my diagram to see what the automaton accepts, what it doesn’t accept and why that is the case.

 

Posted in Coding