Regular Expressions, advanced string matching and new Split function SQL Server 2008 R2

Posted: 2011.02.07 in 2008, SQL Server 2008

NOTE: in order to use the functions discussed below you will need to enable Master Data Services on your server and install a version of the MDS database. You *do not* need to configure the Master Data Manager website or anything else, and since the MDS is not a Windows Service it will not create any additional overhead on your server. Just run the .msi and create the database and how you do this is described here: http://thefirstsql.com/2011/02/08/how-to-enable-master-data-services-mds-in-sql-server-2008-r2/

I’ve only recently begun investigating the new Master Data Services in SQL Server 2008 R2 and I have to say that I was completely baffled at some of the new functionality!! MDS is regarded as a Data Management tool and is usually only used in Business Intelligence projects. And it was for this purpose I started investigating, but I soon realised that some of the functions that come with MDS has a far greater potential than just in BI projects. The following functions comes out-of-the-box when you install an instance of Master Data Services:

  • mdq.NGrams
  • mdq.RegexReplace
  • mdq.RegexExtract
  • mdq.RegexSplit
  • mdq.RegexIsMatch
  • mdq.Similarity
  • mdq.RegexIsValid
  • mdq.SimilarityDate
  • mdq.RegexMask
  • mdq.Split

Now, I have to admit that I can’t see the real usage value for all of these but I find mdq.RegexIsMatch, mdq.Similarity and mdq.Split to be reeeeeaally interesting :) They can solve a whole bunch of problems that previously had no solution at all or that required some cumbersome workarounds.

mdq.RegexIsMatch

There is no doubt about it; I suck at RegEx! Always when I want to do some complex matching I have to search the web for hours to find an example that I can copy and that’s kind of embarrassing. Nevertheless, this new function makes it possible to use RegEx in sql server without having to create your own clr assembly and that really helps! Here is how you use the new function (be aware that you’ll have to install the MDS database first):

--> Usage: mdq.RegexIsMatch (input,pattern,mask)
declare @table table (
	email varchar(200))
insert into @table values ('my.sample@mail.com'),('15.sample@mail.com'), ('my.sample@mail.c')

select email, isvalid = mdq.RegexIsMatch(email, N'^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3} \.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)+$', 0)
from @table
--> Result
email	isvalid
my.sample@mail.com	1
15.sample@mail.com	1
my.sample@mail.c	0

A rediculously long regex expression but it works!

mdq.Similarity

The coolest of the new functions BY FAR is the mdq.Similarity function. It’s not something you’ll use every day, but when you need it you *really* need it! What it basically does is to compare two strings and score their similarity from 0.0 to 1.0 and the higher the number is the more alike the two strings are. This is extremely useful when you for example need to merge two tables together or you need to find closely matching records and you don’t have a full-text index available. The function supports 4 different matching algorithms:

  • Levenshtein edit distance algorithm (0)
  • Jaccard similarity coefficient algorithm (1)
  • A form of the Jaro-Winkler distance algorithm (2)
  • Longest common subsequence algorithm (3)

I will not go in to detail about each algorithm, but to my (limited) experience it seems that the Jaro-Winkler distance algorithm (3) will give the best result when comparing name records which is what I have done so far. Here’s how you would use it:

--> Usage: mdq.Similarity (input1,input2,method,containmentBias , [ minScoreHint ] )
SELECT mdq.Similarity(N'The First SQL', N'TheFirstSQL', 2, 0.0, 0.0);
--> Result:
(No column name)
0,964102564102564

mdq.Split

The Split function is widely used in programming and in SQL Server there are numerous implementations that span from custom clr-functions to all kinds of user defined functions. And the mdq.Split works more or less exactly as you expect it to, except it has some additional functionality. You can for example use regular expressions inside the split function:

--> Usage: mdq.Split (input,separators,removeEmpty,tokenPattern,[mask])
select * from mdq.split('the, first, sql, rules', ', ', 1, '', 0)
--> Result:
Sequence	Token	IsValid
1	the	1
2	first	1
3	sql	1
4	rules	1

Pretty cool, eh?? :)

About these ads
Comments
  1. Sachin says:

    Just wondering is it possible to use them in check constraint on columns the way we use user defined functions.

    • Henning says:

      Well, these are perfectly normal UDF’s but they reside in the MDS database, and cross database references are not allowed in check constraints. You can however move the assembly and the UDF into your database and it will work just great.

  2. Paul says:

    Thanx for this out of the box tip!
    I’m definitely going to check this out!

  3. mattpenner says:

    We would love to use these functions in some of our SQL Server Standard license instances. Is it legal to export these functions into a non-MDS licensed instance? From what I see MDS is only available for Enterprise, Data Center or Developer versions.

    • Henning says:

      Well Matt, I usually let others deal with the licensing issues but Master Data Services is indeed an enterprise/data center feature so I suppose you will be in breach of the standard license if you use the MDS features. I would hate to see you get penalized for using a feature you don’t have a license for during an audit…

  4. mattpenner says:

    I agree and I’m playing it safe. No sense in inviting an issue when there are other Regex solutions out there. I tried using http://msdn.microsoft.com/en-us/magazine/cc163473.aspx but when doing a speed comparison between that solution and the MDS using the RegexIsMatch function against 2000 items returning 75 hits (fairly simple regex) the MDS returned in less than a second and the proposed solution in the blog took 15 seconds.

    Very disappointing performance. Now I need to try to find the performance issue. Any suggestions would be great. This is my first venture into CLR Integration but I am a professional C# developer so I should be able to over come the issue, provided I can find the cause.

    Any comments would be appreciated.

    Thanks!

  5. Henning says:

    I just tested the mdq.RegExIsMatch with the regex given in the example above against the Person.EmailAddress-table in the AdvertureWorks database. The table has roughly 20k records in it and the query returns 43 invalid emails in less than a second, so it’s seemingly scaling very well. Unfortunately I don’t know the exact nature of the function and neither do I know very much about .net but if look at the function definition you will see that it refers to Microsoft.MasterDataServices.DataQuality.SqlClr. Hopefully that means more to you than to me :)

  6. Mattias says:

    A quick comment about mdq.split:

    Instead of select * from mdq.Split(‘the, first, sql, rules’, ‘, ‘, 1, ”, 0) use select * from mdq.split(‘the, first, sql, rules’, ‘, ‘, 1, null, 0) for much better performance.

    Looks like the function runs some expensive Regex matching when you pass in an empty Regex string.

  7. Thanks for finally writing about >Regular Expressions, advanced string
    matching and new Split function SQL Server 2008 R2 | TheFirstSQL <Liked it!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s