Using the Normalize Diacritic Characters Activity

May 11, 2009 at 10:13 AMHenrik Nilsson

I got a comment from Joe Stepongzi today and he didn’t like my Normalize Diacritic Characters Activity that is a part of my Cortego ILM 2 Workflow Activity Library:

I am not sure I like the Normalize Diacritic Characters Activity..
As certain values could be changed to multiple characters instead of one..
I think email addresses should be done at the source and not handled in ILM "2"

The use of the Normalize Diacritic Characters Activity is to normalize characters with different kinds of diacritics into pure characters or how I should define it? The main reason I've created this activity is that I'm from Sweden and must handle "ÅÄÖ" but I'm also working for a company that has a lot of employees in the eastern European countries and that is a nightmare when trying to create for example email addresses. This could be hard to understand for Britain’s and Americans since English is a language where diacritics are sparsely used and this wouldn't have been a problem if the Americans would have understood from the beginning there are other languages than English and a need for other standards than ASCII. Here are a couple of examples of what could be accomplished (I do hope your browser supports Unicode otherwise you'll probably see a lot of boxes):

As you see the activity is only normalizing diacritics by removing any Unicode spacing marks and this is how it works code wise using the System.Globalization namespace for normalization of diacritics:


public static string NormalizeDiacriticChars(string input)
{
   string formD = input.Normalize(NormalizationForm.FormD);
   StringBuilder sb = new StringBuilder();
   for (int i = 0; i < formD.Length; i++)
   {
      UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(formD[i]);
      if (uc != UnicodeCategory.NonSpacingMark)
      {
         sb.Append(formD[i]);
      }
   }
   return (sb.ToString().Normalize(NormalizationForm.FormC));
}

First of all the input string is normalized into Form D that decomposes characters in this way:

  • å –> aRing
  • Ё –> E + Umlaut
  • æ –> a + e (Used in Danish, Norwegian and old English more)
  • –>  ++ (Hangul letter used in Korea)

Then all characters defined as Unicode spacing marks are removed and in the example above the ring and the dots (umlaut) are removed. Finally the remaining string is normalized into Form C, composing characters back, for example:

  • a -> a (The ring is already removed)
  • E -> E (The umlaut is already removed)
  • a + e –> æ (Note: if the original input would have been “ae” it would not become “æ”)
  • + + –>

Normalizing a eastern European name like "Lāčkāja Lapiņš" would end up as "Lackaja Lapins" and a typical Swedish name like "Åsa Öberg" would end up as "Asa Oberg", a lot easier to handle for creating different kind of names and also widely accepted in the countries where diacritic characters are used.

As you can see, characters are not as Joe thought changed into multiple characters but he do have a point in that for example email addresses should be handled at the source and not in ILM2/FIM2010... But if you would like accounts and mailboxes to be automatically created from for example an HR system, one of the best practices of Identity Management... You might be forced to create the email addresses and other system names following your naming standards unless you trust your HR personnel having full control over all existing email addresses and names. It’s up to you to make sure input characters are valid but by using this activity you don’t have to worry about macrons, curls, dots, accents and so on but as you can see the  and æ characters is not changed or removed so they would still a be problem when creating email addresses.

A solution to make sure you get valid strings after normalization could be to use my Regex Replace Activity to remove or replace any remaining characters that isn’t valid in the context you’re using it. In order to get unique names or email addresses you could use my Unique Name Activity. Both these activities is contained in the Cortego ILM 2 Workflow Activity Library. The pattern "[^a-zA-Z0-9\s]" could be used in the Regex Replace Activity to find and remove or replace all characters that is not within a-z, A-Z, 0-9 and whitespace characters.  

If you would like to know more about Unicode Normalization this is a great guide: Unicode Normalization Forms. If you would like to know how different characters from different scripts including Cyrillic, Greek, Latin, Thai, Katakana, and so on are composed/decomposed you could have a look at these Normalization Charts. A description of different kinds of diacritics could be found at Diacritic - Wikipedia.

Finally, do you trust your HR personnel or do you have a Catbert at your company? Laughing

Posted in: Workflow | Forefront Identity Manager

Tags: , , , , ,