Detecting URLs in a Block of Text

2010-03-31 4 min read bash

Jan Goyvaerts on <a href="http://www.regex-guru.info/2008/11/detecting-urls-in-a-block-of-text/" target="_blank">Detecting URLs in a Block of Text

In his blog post <a href="http://www.codinghorror.com/blog/archives/001181.html">The Problem with URLs points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.

The short answer is that it can’t be done. Pretty much any character is valid in URLs. The very simplistic <span class="regex">\bhttp://\S+ not only fails to differentiate between punctuation that’s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I’ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.

In <a href="http://www.regexbuddy.com/library.html">RegexBuddy’s library, you’ll find this regex if you look up “URL: Find in full text”:

<tt class="regex">\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|] (case insensitive)

Like every other regex for extracting URLs, it’s not perfect. The key benefit of this regex is that it uses a separate character class for the last character in the URL, which allows less punctuation characters than the character class for the other characters in the URL. It excludes punctuation that is unlikely to occur at the end of the URL, and more likely to be punctuation that’s part of the sentence the URL is quoted in. It does not allow parentheses at all.

In <a href="http://www.editpadpro.com/cscs.html">EditPad Pro’s syntax coloring schemes, which are fully editable and entirely based on regular expressions, you’ll often find this regex:

<tt class="regex">\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]
(case insensitive)

The main difference with the previous regex is that this one matches URLs such as www.regex-guru.info without the http:// protocol. People often type URLs that way in their documents and messages, because most browsers accept them that way too.

EditPad’s built-in “clickable URLs” syntax highlighting uses this regex:

<tt class="regex">\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]
| ((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,4})\b)
|”(?:(?:https?|ftp|file)://|www\.|ftp\.)[^"\r\n]+”?
|’(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+’? (free-spacing, case insensitive)

This log regex adds three alternatives to the previous regex. It adds the ability to match email addresses, with or without mailto:, and it matches URLs between single or double quotes. When the URL is quoted, it allows all characters in the URL, except line breaks and the delimiting quote. This way, any URL with weird punctuation can be highlighted correctly by placing it between a pair of quote characters. Because this regex is used to highlight text as you type, the closing quotes are optional. The highlighting will run until the end of the line until you type the closing quote. Remove the question marks after the quote characters if you will use this regex to extract URLs.

So how about Jeff’s problem?

I couldn’t come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.

That’s not too hard, if we add the restriction that we only allow unnested pairs of parentheses in URLs. Using the second regex in this article as the starting point, add an alternative for a pair of parentheses to both character classes in that regex:

<tt class="regex">\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$]) (free-spacing, case insensitive)

This regex allows the same set of characters in the middle of the URL, mixed with zero or more sequences of those characters between parentheses. It allows the URL to end with the same reduced set of characters, or a final run between parentheses. Because we require the opening parenthesis to be in the URL, we don’t have to do anything complicated to check if any closing parentheses we encounter are part of the URL or not.

It’s important that you observe that in order to allow any number of pairs of parentheses in the middle of the regex, I moved the star from the character class to the group it is now in. I did not add another star to the group. A double-star combination like <tt class="regex">(a|b*)* is a sure-fire recipe for <a href="http://www.regular-expressions.info/catastrophic.html">catastrophic backtracking.

All the regexes in this article will be included in RegexBuddy’s library with the next free minor update. Current version is 3.2.0.

comments powered by Disqus