Wednesday, June 27, 2007

Stripping HTML from text in SQL Server 2005

Today I came across a problem where a field in a SQL server 2005 database contained encoded HTML (for example, it looked like "<div>Blah blah"). However, as this field was required in a SQL server report, this HTML needed to either be removed or formatted. Since SQL reports don't have a built-in function for this, and trying to create a function for it using their Excel-style expressions didn't sound like much fun, the only option left was to trim it in the SQL code.

Normally, I'd turn to a regular expression to handle this, however in SQL server, to use regular expressions you need to mess about with COM to get a VBScript interface etc, and I'm lazy, so I wanted a simpler way. A bit of research turned up an article at http://blog.sqlauthority.com/ which appeared to do exactly what I needed using just a normal user-defined function, and as luck would have it, it had only been written 11 days ago, so my timing was pretty good ;) Of course, it wasn't exactly what I needed, but it was pretty close, and after a few changes (mostly copy-paste code to handle each HTML entity that was likely to appear), I had a lovely new function which did exactly what I needed it to do. The performance of it is probably pretty dire, but that's acceptable in this case.

Here's the code, should anyone need it:


ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int

-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END



Of course, most of the credit for this goes to Pinal Dave of SQLAuthority.com, I just hacked his code around a bit to suit my particular needs.

Note the formatting is going a bit awry, mostly due to HTML entities in the code. I'll fix this when I get a chance

26 comments:

Pinal Dave said...

Hello,

Good improvement to original code. I have posted link to your version on my blog as well. Users can use either version.
Link to this article

Regards,
Pinal Dave (SQLAuthority.com)

BI_Buff said...

Excellent code. Thanks a lot for posting it, it really helped me out.

Unknown said...

Perfect!! THX

Unknown said...

Very helpful - Thanks.

Only change I had to make was to allow for quotes (&quot).

Anonymous said...

This code is really helpfull for my reports.

Excellent code.

Thanks a lot for posting it.

Thanks,
MJ

tizydor said...

hey,
thanks for script, I put it into my master project :)

Of course with source-link to you page :)

Best regards.

steveg said...

I understand why & is in a loop, but why not just do a replace for other entities?

select @html = replace(@html, '"', '"')
select @html = replace(@html, '<', '<')

or you could nest those as

select @html = replace(replace(replace(

etc.

Diego Jancic said...

It doesn't work to remove the comments when there're tags inside of it. This happens when the HTML is generated by MS Word.

Here's a new version of it:
http://blog.this.com.ar/2010/11/stripping-html-from-text-in-sql.html

Jeremy said...
This comment has been removed by the author.
Jeremy said...

Is there anyway I can get this function to ignore sup tags? Thanks

Alex said...

I tried using this function and got this error msg:

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'font'.
Msg 132, Level 15, State 1, Line 2
The label 'size' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2
The label 'family' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2

Ashok SQL Techi said...

it is helpful to my project....but some modification

Gulshan said...

Great work !

Unknown said...

Nice work.
I've modified this slightly to have tighter tolerances around HTML instead of just stripping the < > tags.
For cases when you have XML inside HTML.
Linked here:

http://mishmashmoo.com.au/blog/?p=193

Unknown said...

Great!, The only issue is that 'CHAR(13) + CHAR(10)' should not have single quote

Unknown said...

Very nice work.

Congratz.

joseph taxton said...

Insty web hosting is everything, which usually means you will not ever have to manage numerous services with numerous subscriptions again because it puts everything in one spot.Click here for more interesting information on VPS Malaysia | Customer Guide | FAQ.

Aliza said...

To comprehend what news is unfolding, like every story, naturally you need to begin at the start. If you are more curious about Moderation System then you can learn more about it on listfilter.com.

Unknown said...

Thanks for this post. My one change is that you should move the decodes of < and > to the end in case something like this is in the the text: <Agreement Name>. If you decode the < and > first, they and the text between them are deleted in the last step.

Jessica world smith said...
This comment has been removed by the author.
Jessica world smith said...
This comment has been removed by the author.
Jessica world smith said...
This comment has been removed by the author.
Jessica world smith said...

This makes your choice of provider crucial to your entire gaming experience. Your provider should provide you with all the features you need and should have a good reputation for providing good service. To get more detailed info on server hosting, visit on hyperlinked site.

Unknown said...

Thank you! Saved me a lot of time. I believe you mistakenly had quotes around two instances of the char(13) + char(10). For removal of
and
.

Unknown said...

I meant to say for removal of "
" and "
"

charliesmith said...

Check out the new Sixredsquares website and see how you can start your project today. Sixredsquares, or 6RS, is a system that helps clients with branding, web design, and marketing.