Detecting URLs in a Block of Text
Jan Goyvaerts on <a href="http://www.regex-guru.info/2008/11/detecting-urls-in-a-block-of-text/" target="_blank">Detecting URLs in a Block of Text
In his blog post <a href="http://www.codinghorror.com/blog/archives/001181.html">The Problem with URLs points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.
The short answer is that it can’t be done. Pretty much any character is valid in URLs. The very simplistic <span class="regex">\bhttp://\S+ not only fails to differentiate between punctuation that’s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I’ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.
Continue reading