Let’s say you got a simple problem: build a form that allows a user to sign up for a newsletter. Obviously, you need to prevent users from entering junk while still allowing “exotic” addresses.
What does a valid address look like? Intuitively one would say:
1 2 3 4 5 6 7 8 |
// Example address: johndoe@example.com // As parts: ${MAILBOX}@${SUBDOMAIN}.${TLD} // Possible regular expression [a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+ |
Looks good, but is wrong. Among other things, this regex won’t allow anyone with a dot or a hyphen in their address to sign up. Better have a look at the specification to see what’s valid and what isn’t:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
// Missing domain? This is valid and gets delivered locally! john-doe // A TLD is a domain, but that mbox probably doesn't exist there john-doe@com // Perfectly valid address, but there are no single letter TLDs john-doe@example.c // Oh right, subdomains! We can't possibly exclude .co.uk, right? john@doe.example.com // Fancy! Though we can count on users not typing that in "John Doe" <john.doe@example.com> // Want to cut out the middleman? The IP address of a mailserver is acceptable john@[192.168.1.1] // Perfectly legal as an address, but the domain cannot be registered. john@-doe.com // This is allowed... john.doe@example.com // But curiously, none of these are john..doe@eaxmple.com .john@example.com john.@example.com |
Bummer! Proper validation makes the regex considerably more complicated:
1 |
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]) |
Now you have four not so simple problems:
- How do you determine the correctness of this monster? Regular expressions you dig up on the interwebs are usually the result of someone posting a problem to a forum and someone else thinking about it for five minutes. If the answer works for a couple of selected test cases, it becomes the accepted solution and gets copied and pasted over and over again. Eventually it fails, gets “fixed” and reposted till you get something so wildly complex that nobody dares to touch it (and yes, I copied it from somewhere without checking).
- A lot of what RFC5322 allows will bounce in real life and should therefore not be accepted by anything but an MTA. However, you can’t really adapt the regex above to your needs without fear of breaking it.
- Regular expressions are only mildly efficient at best. The longer they are, the longer it takes to compile the pattern (matching is always done in O(n) – unless you got a horrible regex engine).
- Regular expressions only match, but don’t tell you what exactly was matched. If, for example, you wanted to check the domain against a blacklist, you’d be out of luck.
Let’s do it the right way
First, we need to simplify things. Since we are validating, we are only interested in a subset of RFC 5322 compliant email addresses: those that look like they won’t bounce right away. Rules:
- All addresses must conform to the localpart@domain pattern.
- For the local part, we only accept an RFC5322 dot-atom. However, for practical purposes, we further limit {atext} to alphanumeric characters, hyphens, underscores and plus signs. The rationale here being to keep it simple for the software we are validating for (it must deal with whatever we allow to pass).
- The domain must consist of at least two components (TLD and a second level domain), separated by a dot. Additional subdomains are acceptable.
- The TLD must consist of at least two letters, no numbers or hyphens.
- Domains components may only consist of alphanumeric characters and hyphens. They may, however, neither start nor end with a hyphen and must be at least one character long.
- Unicode domains must be encoded in punycode. Rationale: don’t allow things to pass that might break legacy systems.
Regular expressions are just one way for defining finite-state machines. Another, more user friendly, method is by drawing a state diagram (ignore for the moment, that not all rules are enforced. We’ll come to that later):
Now, let’s transform that automaton into code, step by step. First stop, the generic DFA skeleton. This piece of code acts as a driver for iterating over the input and inspecting each character (note: safety checks omitted):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
char[] input; // The string to match int index = 0; // Iterator index for the input char[] ch; // Current character (input[index]) int state = 0; // Automaton state (-1 is the error state) while (index <= input.length && state != -1) { if (index == input.length) { ch = '\0'; // Terminator character } else { ch = input[index]; } switch (state) { // case 0: {... } index++; } |
A state in the diagram corresponds to a case statement in the switch block. When entered, a state performs the following actions:
- If the state diagram contains an edge that is labeled with the value of char ch and points to state qX, then set the value of int state to X.
- If there is no edge for the value of of char ch , then set the value of int state to -1 (error state).
- If there is code associated with the transition, execute that code as well.
For example, consider the following, simplified state diagram for an automaton, which counts the number of lowercase letters in the input (the code for counting is omitted from the diagram):
Transformed according to the rules above, the code for q0 looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 |
case 0: { if (ch >= 'a' && ch <= 'z') { count++; break; // State stays the same } if (ch == '\0') { state = 1; // EOL break; } state = -1; // Error break; } |
Finally, we need to perform a couple of sanity checks. These are what blows your regex out of proportion if you try to do them with the DFA:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
public class Validator { public static boolean isValid(final char[] input) { int state = 0; char ch; int index = 0; String local = null; ArrayList<String> domain = new ArrayList<String>(); // Code for the DFA ommitted. // [..] // Sanity checks // Input not accepted if (state != 6) return false; // Require at least a second level domain if (domain.size() < 2) return false; // RFC 5321 limits the length of the local part if (local.length() > 64) return false; // RFC 5321 limits the total length of an address if (input.length > 254) return false; // TLD must only consist of letters and be at least two characters long. index = input.length - 1; while (index > 0) { ch = input[index]; if (ch == '.' && input.length - index > 2) { return true; } if (!((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'))) { return false; } index--; } return true; } } |
Putting it all together
Now, let’s stitch all the parts together. If all you wanted was to get your problem solved, then the code below is what you came here for. Word of warning, though: you should really read how it was constructed! There is no one size fits all solution for email address validation. You might find that some of my design choices were either too strict or too permissive for your use case. It helps, using a chess pawn to step through my diagram to see what the automaton accepts, what it doesn’t accept and why that is the case.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
import java.util.ArrayList; public class EmailValidator { public static boolean isValid(final char[] input) { if (input == null) { return false; } int state = 0; char ch; int index = 0; int mark = 0; String local = null; ArrayList<String> domain = new ArrayList<String>(); while (index <= input.length && state != -1) { // Dealing with a char instead of char[] makes life a lot easier! if (index == input.length) { ch = '\0'; // We need to encode the end by using a terminator } else { ch = input[index]; if (ch == '\0') { // but the terminator may not be part of the input! return false; } } switch (state) { case 0: { // Transition on {atext} if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') || ch == '_' || ch == '-' || ch == '+') { state = 1; break; } // Unexpected Character -> Error state state = -1; break; } case 1: { // Consume {atext} if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') || ch == '_' || ch == '-' || ch == '+') { break; } if (ch == '.') { state = 2; break; } if (ch == '@') { // Endof local part local = new String(input, 0, index - mark); mark = index + 1; state = 3; break; } // Unexpected Character -> Error state state = -1; break; } case 2: { // Transition on {atext} if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') || ch == '_' || ch == '-' || ch == '+') { state = 1; break; } // Unexpected Character -> Error state state = -1; break; } case 3: { // Transition on {alnum} if ((ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z')) { state = 4; break; } // Unexpected Character -> Error state state = -1; break; } case 4: { // Consume {alnum} if ((ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z')) { break; } if (ch == '-') { state = 5; break; } if (ch == '.') { domain.add(new String(input, mark, index - mark)); mark = index + 1; state = 5; break; } // Match EOL if (ch == '\0') { domain.add(new String(input, mark, index - mark)); state = 6; break; // EOL -> Finish } // Unexpected Character -> Error state state = -1; break; } case 5: { if ((ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z')) { state = 4; break; } if (ch == '-') { break; } // Unexpected Character -> Error state state = -1; break; } case 6: { // Success! (we don't really get here, though) break; } } index++; } // Sanity checks // Input not accepted if (state != 6) return false; // Require at least a second level domain if (domain.size() < 2) return false; // RFC 5321 limits the length of the local part if (local.length() > 64) return false; // RFC 5321 limits the total length of an address if (input.length > 254) return false; // TLD must only consist of letters and be at least two characters long. index = input.length - 1; while (index > 0) { ch = input[index]; if (ch == '.' && input.length - index > 2) { return true; } if (!((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'))) { return false; } index--; } return true; } } |