Normalisation of data


This post used to be called Don't confuse validation with filtering.  This was meant to be about normalising data, but at the time, filtering was at the forefront of my mine due to work that did some of the things discussed here.  Filtering, validation, and normalising are all essential parts of handling data.  Finally, I want to stress that while normalisation is good, it's not easy.  Making assumptions is fine as long as you make educated assumptions.  The reason you don't see normalisation as much as you do is because it's difficult to get right.  I've attempted to be careful in my examples, and have suggested seeking confirmation from the user before continuing.

Brandon Savage posted an interesting article today about how validation blind spots hurt real people.  He gave four succinct rules.

  1. Accept valid data in any form provided by the user.
  2. Where possible, use well-developed validation libraries.
  3. Do not place artificial limits on valid data.
  4. Do place valid limits on specific data.

All good advice.  However, it misses out on filtering.  You see, filtering is different from validation.  Filtering is the art of taking bad data and turning it into good data.  It might sound difficult, but it's actually easier than you think, and will ease the pain in using your software.  Let's see some examples.


Passwords are secret.  Usernames, however, are usually not.  Even if they are private, like an email address, they are usually not the secure part of the application.  Couple this with mobile devices that auto-capitalize, if you're requiring case-sensitive usernames, you are making things more difficult then they should be for your users.  It's a pain having to go back, lower case a single letter at the beginning, and then retype a fairly long password just to log in because your site decided that 'Jasonlotito' and 'jasonlotito' are not the same user.

At the same time, if I do enter my email address, and you ask for a username, then just go with what I give you.  I presume that their is only one email address per account.  If I've forgotten my username for your site, but I know I have an email address tied to an account, just let me use that.  You've seen I've entered my email address, you can then look up by email address, get my username from there, and proceed to validate.

Credit cards

Credit card numbers follow simple rules.  If I enter my credit card number, you know what card I'm using, so asking me is silly.  You don't even need to know the full number on the card to figure out the card type.  Once I've entered enough numbers in for you to know, simply change the pull down to the correct type, and change the icon.  It will give visual feedback that yes, we know what card you are using.  Their is no reason not to do this.

Also, if I put in dashes, or dots, or spaces, or whatever, you can deal with it.  Ignore them.  Simply use the numbers.  You don't need to exclude characters, simply white list numbers.

While we are here, expiration dates are easy, and yet, somehow, people get them wrong.  Maybe I'm missing out on some details here, but every credit card I've ever seen lists expiration dates as MM/YY.  This is a sign that maybe your forms should follow suit.  This means your month should be '03', not March (03).  Just '3' would suffice.  The year should be equally clear: '15', not '2015'.  I promise you, no one is assuming 1915, or even 2115.

Phone numbers

Filtering phone numbers is hard.  I know.  I've had to do it.  The system needed a filtered number because we had to call back in real time from an automated system located in the US.  This mean we needed a properly formatted number that could be dialed.  This isn't easy, because when you ask for a phone number, people are going to start adding in country codes (but they might).  My suggestion is, unless you are having to do some automated system is to simple not filter numbers unless you can be sure, 100%, of what you are doing.  Even then, you probably don't need to do it.

Simply ask the user for their phone number, and store whatever they put.  If you need to call them, you can find their country code, or area code if you need to.

Email addresses

Email addresses can be tricky.  You want to make sure they enter in the right email address. A simple mistake can cause problems.  I've, on occasion, type in jasonlottio instead of jasonlotito, and the problem was painful to resolve.  On the username side, you can't do much.  However, after you hit the ampersand, you can start some simple checks.  If someone types in gmail or hotmail or yahoo and leaves off the .com, you can make assumptions.  You can validate these assumptions with the user on the next page.  Transposing letters (htomail) might be something you should look out for, too.  Compile a list of common email providers, and then check how close they are to the users non-common provider name.  If they are close, verify with the user if they didn't misspell the name.


The key to all these filtering techniques is assumption.  It's not always perfect, but it's better than failing.  After all, if I'm in Canada, and my phone number is a 10 digit phone number, you can make the assumption that to call me long distance, you'll need to add a 1.  You can assume that if my credit card number starts with 4111, I'm using a Visa.  Their are lots of assumptions you can make that will make things easier for the user.  Go ahead.  It's okay.