Best Industrial Training in C,C++,PHP,Dot Net,Java in Jalandhar

Friday 24 August 2012

Validate email addresses using regular expressions

The local-part (the part before the "@" character) of the e-mail may use any of these ASCII characters [1]:
  • Uppercase and lowercase letters
  • The digits 0 through 9
  • The characters , ! # $ % & ' * + - / = ? ^ _ ` { | } ~
  • The character "." provided that it is not the first or last character in the local-part
The domain part of the address is much easier to handle. The dot separated domain labels can only include letters, digits and hyphens [1].
There are two regexps in this script. The first one will pass "normal looking" addresses like foo.bar@baz.example.com or foo+bar@example.com. This regexp won't, however, pass all syntactically valid addresses like foo,!#@example.com
// define a regular expression for "normal" addresses
$normal = "^[a-z0-9_\+-]+(\.[a-z0-9_\+-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,4})$";

To understand this expression you need to be familiar with regexp syntax. You'll find links to some good tutorials in the end of this article.
The first part, ^[a-z0-9_\+-]+ means that the address has to start with letters a-z, numbers 0-9 or characters "_", "+" or "-" The final "+" means there must be 1..n of these characters. A normal username, say jsmith2 would match this expression. It also matches to foo+bar
The regexp continues with (\.[a-z0-9_\+-]+)*. It means that the first characters defined before can be followed with a period "." and after that with the same set of characters than before the period. Because characters "." and "+" have special meaning in regexps they must be escaped with a backslash. The final * means there must be 0..n of these sequences. This way the regexp will match to strings firstname.lastname, firstname.long-middlename.lastname and foo.bar+baz
After these characters there must be a single "@" character. It must be followed by a domain label that consist of letters, numbers and hyphens. There can be 1..n domain labels separated with a period. The first label (without the period) is defined by [a-z0-9-]+. After this there can be 0..n similar sequences starting with a period. This is defined as (\.[a-z0-9-]+)*. At the time this article was written most email address end with a period followed by 2..4 letters (for example .fi or .info). The expression \.([a-z]{2,4})$ matches this.
The second regexp is supposed to match all syntactically valid addresses, even those that we don't see that often. The idea in this example is that the validator should pass those strange looking addresses but tell the user that it would probably be a good idea to double check the address.
// define a regular expression for "strange looking" but syntactically valid addresses
$validButRare = "^[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+(\.[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,})$";

This ugly regexp is actually quite similar to the one declared earlier. The period separated character sequences in the local-part can now include all the special characters defined in the RFC. Characters "$", "*", "+" "^", "{" and "|" all have their special meanings in regular expressions so they must be escaped with a backslash. The expression now allows the domain part to end with a period followed by 2..n letters such as .museum
You can use these regexps as follows (in PHP):
if (eregi($normal, $email)) {
  echo("The address $email is valid and looks normal.");
}

else if (eregi($validButRare, $email)) {
  echo("The address $email looks a bit strange but it is syntactically valid. You might want to check it for typos.");
}

else {
  echo("The address $email is not valid.");
}
These regexps were inspired by and modified from the article "Using Regular Expressions in PHP" by James Ussher-Smith [2]. The article uses email address validation as an example but the suggested regexp doesn't work with for example foo+bar@example.com.
You can use these regexps in your applications but please give credit to the original authors. Feel free to drop me an email if you liked this howto. :)

Limitations

  • The example here does not check that the length of local-part is <65
  • The example here does not check that the length of the domain name is is <256
  • The example here does not allow quoted strings in the local part (eg."Foo Bar"@example.com). Quoted strings are allowed in local-part but RFC 2821 warns that they should be avoided.

No comments:

Post a Comment