Wednesday, June 27, 2007

Stripping HTML from text in SQL Server 2005

Today I came across a problem where a field in a SQL server 2005 database contained encoded HTML (for example, it looked like "<div>Blah blah"). However, as this field was required in a SQL server report, this HTML needed to either be removed or formatted. Since SQL reports don't have a built-in function for this, and trying to create a function for it using their Excel-style expressions didn't sound like much fun, the only option left was to trim it in the SQL code.

Normally, I'd turn to a regular expression to handle this, however in SQL server, to use regular expressions you need to mess about with COM to get a VBScript interface etc, and I'm lazy, so I wanted a simpler way. A bit of research turned up an article at http://blog.sqlauthority.com/ which appeared to do exactly what I needed using just a normal user-defined function, and as luck would have it, it had only been written 11 days ago, so my timing was pretty good ;) Of course, it wasn't exactly what I needed, but it was pretty close, and after a few changes (mostly copy-paste code to handle each HTML entity that was likely to appear), I had a lovely new function which did exactly what I needed it to do. The performance of it is probably pretty dire, but that's acceptable in this case.

Here's the code, should anyone need it:


ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int

-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END



Of course, most of the credit for this goes to Pinal Dave of SQLAuthority.com, I just hacked his code around a bit to suit my particular needs.

Note the formatting is going a bit awry, mostly due to HTML entities in the code. I'll fix this when I get a chance

13 comments:

Pinal Dave said...

Hello,

Good improvement to original code. I have posted link to your version on my blog as well. Users can use either version.
Link to this article

Regards,
Pinal Dave (SQLAuthority.com)

Ramdas said...

Excellent code. Thanks a lot for posting it, it really helped me out.

urban said...

Perfect!! THX

matt said...

Very helpful - Thanks.

Only change I had to make was to allow for quotes (&quot).

ronak said...

This code is really helpfull for my reports.

Excellent code.

Thanks a lot for posting it.

Thanks,
MJ

tizydor said...

hey,
thanks for script, I put it into my master project :)

Of course with source-link to you page :)

Best regards.

steveg said...

I understand why & is in a loop, but why not just do a replace for other entities?

select @html = replace(@html, '"', '"')
select @html = replace(@html, '<', '<')

or you could nest those as

select @html = replace(replace(replace(

etc.

Diego Jancic said...

It doesn't work to remove the comments when there're tags inside of it. This happens when the HTML is generated by MS Word.

Here's a new version of it:
http://blog.this.com.ar/2010/11/stripping-html-from-text-in-sql.html

jb said...
This comment has been removed by the author.
jb said...

Is there anyway I can get this function to ignore sup tags? Thanks

Alex said...

I tried using this function and got this error msg:

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'font'.
Msg 132, Level 15, State 1, Line 2
The label 'size' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2
The label 'family' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2

Ashok Kumar.J said...

it is helpful to my project....but some modification

Gulshan said...

Great work !