Normally, I'd turn to a regular expression to handle this, however in SQL server, to use regular expressions you need to mess about with COM to get a VBScript interface etc, and I'm lazy, so I wanted a simpler way. A bit of research turned up an article at http://blog.sqlauthority.com/ which appeared to do exactly what I needed using just a normal user-defined function, and as luck would have it, it had only been written 11 days ago, so my timing was pretty good ;) Of course, it wasn't exactly what I needed, but it was pretty close, and after a few changes (mostly copy-paste code to handle each HTML entity that was likely to appear), I had a lovely new function which did exactly what I needed it to do. The performance of it is probably pretty dire, but that's acceptable in this case.
Here's the code, should anyone need it:
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity < with the '<' character
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity > with the '>' character
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity & with the '&' character
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity with the ' ' character
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'CHAR(13) + CHAR(10)')
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
Of course, most of the credit for this goes to Pinal Dave of SQLAuthority.com, I just hacked his code around a bit to suit my particular needs.
Note the formatting is going a bit awry, mostly due to HTML entities in the code. I'll fix this when I get a chance
26 comments:
Hello,
Good improvement to original code. I have posted link to your version on my blog as well. Users can use either version.
Link to this article
Regards,
Pinal Dave (SQLAuthority.com)
Excellent code. Thanks a lot for posting it, it really helped me out.
Perfect!! THX
Very helpful - Thanks.
Only change I had to make was to allow for quotes (").
This code is really helpfull for my reports.
Excellent code.
Thanks a lot for posting it.
Thanks,
MJ
hey,
thanks for script, I put it into my master project :)
Of course with source-link to you page :)
Best regards.
I understand why & is in a loop, but why not just do a replace for other entities?
select @html = replace(@html, '"', '"')
select @html = replace(@html, '<', '<')
or you could nest those as
select @html = replace(replace(replace(
etc.
It doesn't work to remove the comments when there're tags inside of it. This happens when the HTML is generated by MS Word.
Here's a new version of it:
http://blog.this.com.ar/2010/11/stripping-html-from-text-in-sql.html
Is there anyway I can get this function to ignore sup tags? Thanks
I tried using this function and got this error msg:
Msg 102, Level 15, State 1, Line 2
Incorrect syntax near 'font'.
Msg 132, Level 15, State 1, Line 2
The label 'size' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2
The label 'family' has already been declared. Label names must be unique within a query batch or stored procedure.
Msg 132, Level 15, State 1, Line 2
it is helpful to my project....but some modification
Great work !
Nice work.
I've modified this slightly to have tighter tolerances around HTML instead of just stripping the < > tags.
For cases when you have XML inside HTML.
Linked here:
http://mishmashmoo.com.au/blog/?p=193
Great!, The only issue is that 'CHAR(13) + CHAR(10)' should not have single quote
Very nice work.
Congratz.
Insty web hosting is everything, which usually means you will not ever have to manage numerous services with numerous subscriptions again because it puts everything in one spot.Click here for more interesting information on VPS Malaysia | Customer Guide | FAQ.
To comprehend what news is unfolding, like every story, naturally you need to begin at the start. If you are more curious about Moderation System then you can learn more about it on listfilter.com.
Thanks for this post. My one change is that you should move the decodes of < and > to the end in case something like this is in the the text: <Agreement Name>. If you decode the < and > first, they and the text between them are deleted in the last step.
This makes your choice of provider crucial to your entire gaming experience. Your provider should provide you with all the features you need and should have a good reputation for providing good service. To get more detailed info on server hosting, visit on hyperlinked site.
Thank you! Saved me a lot of time. I believe you mistakenly had quotes around two instances of the char(13) + char(10). For removal of
and
.
I meant to say for removal of "
" and "
"
Check out the new Sixredsquares website and see how you can start your project today. Sixredsquares, or 6RS, is a system that helps clients with branding, web design, and marketing.
Post a Comment