Wednesday, June 27, 2007

Stripping HTML from text in SQL Server 2005

Today I came across a problem where a field in a SQL server 2005 database contained encoded HTML (for example, it looked like "<div>Blah blah"). However, as this field was required in a SQL server report, this HTML needed to either be removed or formatted. Since SQL reports don't have a built-in function for this, and trying to create a function for it using their Excel-style expressions didn't sound like much fun, the only option left was to trim it in the SQL code.

Normally, I'd turn to a regular expression to handle this, however in SQL server, to use regular expressions you need to mess about with COM to get a VBScript interface etc, and I'm lazy, so I wanted a simpler way. A bit of research turned up an article at http://blog.sqlauthority.com/ which appeared to do exactly what I needed using just a normal user-defined function, and as luck would have it, it had only been written 11 days ago, so my timing was pretty good ;) Of course, it wasn't exactly what I needed, but it was pretty close, and after a few changes (mostly copy-paste code to handle each HTML entity that was likely to appear), I had a lovely new function which did exactly what I needed it to do. The performance of it is probably pretty dire, but that's acceptable in this case.

Here's the code, should anyone need it:


ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int

-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END



Of course, most of the credit for this goes to Pinal Dave of SQLAuthority.com, I just hacked his code around a bit to suit my particular needs.

Note the formatting is going a bit awry, mostly due to HTML entities in the code. I'll fix this when I get a chance

19 comments:

Pinal Dave said...

Hello,

Good improvement to original code. I have posted link to your version on my blog as well. Users can use either version.
Link to this article

Regards,
Pinal Dave (SQLAuthority.com)

Ramdas said...

Excellent code. Thanks a lot for posting it, it really helped me out.

urban said...

Perfect!! THX

matt said...

Very helpful - Thanks.

Only change I had to make was to allow for quotes (&quot).

ronak said...

This code is really helpfull for my reports.

Excellent code.

Thanks a lot for posting it.

Thanks,
MJ

tizydor said...

hey,
thanks for script, I put it into my master project :)

Of course with source-link to you page :)

Best regards.

steveg said...

I understand why & is in a loop, but why not just do a replace for other entities?

select @html = replace(@html, '"', '"')
select @html = replace(@html, '<', '<')

or you could nest those as

select @html = replace(replace(replace(

etc.

Diego Jancic said...

It doesn't work to remove the comments when there're tags inside of it. This happens when the HTML is generated by MS Word.

Here's a new version of it:
http://blog.this.com.ar/2010/11/stripping-html-from-text-in-sql.html

jb said...
This comment has been removed by the author.
jb said...

Is there anyway I can get this function to ignore sup tags? Thanks

Alex said...

I tried using this function and got this error msg:

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'font'.
Msg 132, Level 15, State 1, Line 2
The label 'size' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2
The label 'family' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2

Ashok Kumar.J said...

it is helpful to my project....but some modification

Gulshan said...

Great work !

Sameh Abdelhamid said...

Nice work.
I've modified this slightly to have tighter tolerances around HTML instead of just stripping the < > tags.
For cases when you have XML inside HTML.
Linked here:

http://mishmashmoo.com.au/blog/?p=193

Vikas Patel said...

Great!, The only issue is that 'CHAR(13) + CHAR(10)' should not have single quote

Blogger said...

I've just downloaded iStripper, and now I can watch the sexiest virtual strippers on my desktop.

Rafael Sales Nascimento said...

Very nice work.

Congratz.

Blogger said...

DreamHost is definitely one of the best web-hosting provider for any hosting services you might require.

joseph taxton said...

Insty web hosting is everything, which usually means you will not ever have to manage numerous services with numerous subscriptions again because it puts everything in one spot.Click here for more interesting information on VPS Malaysia | Customer Guide | FAQ.