- Tcl Regular Expression Cheat Sheet 2019
- Tcl Regular Expression Match
- Tcl Regular Expression Cheat Sheet Download
- Tcl Regular Expression Cheat Sheet
- Regular Expression Tcl
- Tcl Regular Expression Cheat Sheet Pdf
Russ Cox
rsc@swtch.com
January 2007
Tcl commands are built in-to the language with each having its own predefined function. These commands form the reserved words of the language and cannot be used for other variable naming. The advantage with these Tcl commands is that, you can define your own implementation for any of these commands to replace the original built-in functionality. F5 Cheat Sheet Edit Cheat Sheet F5 irule Regex Examples. Note that F5 uses TCL as a scripting language, so all these commands do follow TCL syntax. Matching with regexp. It is a translation of a similar package for xlispstat, which is in turn based on the Tcl 8.0 interface. I believe more recent Tcl's have switched to a new regular expression library by Henry Spencer that provides some additional features. The code is available as.
Introduction
This is a tale of two approaches to regular expression matching.One of them is in widespread use in thestandard interpreters for many languages, including Perl.The other is used only in a few places, notably most implementationsof awk and grep.The two approaches have wildly differentperformance characteristics:
Let's use superscripts to denote string repetition,so that a?3a3
is shorthand fora?a?a?aaa
.The two graphs plot the time required by each approachto match the regular expression a?
na
nagainst the string a
n.
Notice that Perl requires over sixty seconds to matcha 29-character string.The other approach, labeled Thompson NFA forreasons that will be explained later,requires twenty microseconds to match the string.That's not a typo. The Perl graph plots time in seconds,while the Thompson NFA graph plots time in microseconds:the Thompson NFA implementationis a million times faster than Perlwhen running on a miniscule 29-character string.The trends shown in the graph continue: theThompson NFA handles a 100-character string in under 200 microseconds,while Perl would require over 1015 years.(Perl is only the most conspicuous example of a largenumber of popular programs that use the same algorithm;the above graph could have been Python, or PHP, or Ruby,or many other languages. A more detailedgraph later in this article presents data for other implementations.)
It may be hard to believe the graphs: perhaps you've used Perl,and it never seemed like regular expression matching wasparticularly slow.Most of the time, in fact, regular expression matching in Perlis fast enough. As the graph shows, though, it is possibleto write so-called “pathological” regular expressions thatPerl matches very very slowly.In contrast, there are no regular expressions that are pathological for the Thompson NFA implementation.Seeing the two graphs side by side prompts the question, “why doesn't Perl use the Thompson NFA approach?”It can, it should, and that's what the rest of this article is about.
Historically, regular expressions are one of computer science'sshining examples of how using good theory leads to good programs.They were originally developed by theorists as asimple computational model,but Ken Thompson introduced them toprogrammers in his implementation of the text editor QEDfor CTSS.Dennis Ritchie followed suit in his own implementationof QED, for GE-TSS.Thompson and Ritchie would go on to create Unix,and they brought regular expressions with them.By the late 1970s, regular expressions were a keyfeature of the Unix landscape, in tools such ased, sed, grep, egrep, awk, and lex.
Today, regular expressions have also become a shiningexample of how ignoring good theory leads to bad programs.The regular expression implementations used bytoday's popular tools are significantly slowerthan the ones used in many of those thirty-year-old Unix tools.
This article reviews the good theory: regular expressions, finite automata, and a regular expression search algorithminvented by Ken Thompson in the mid-1960s.It also puts the theory into practice, describing a simple implementation of Thompson's algorithm.That implementation, less than 400 lines of C,is the one that went head to head with Perl above.It outperforms the more complex real-worldimplementations used byPerl, Python, PCRE, and others.The article concludes with a discussion of how theory might yet be converted into practicein the real-world implementations.
Regular Expressions
Regular expressions are a notation fordescribing sets of character strings.When a particular string is in the setdescribed by a regular expression,we often say that the regular expressionmatchesthe string.
The simplest regular expression is a single literal character.Except for the special metacharacters *+?()|
,characters match themselves.To match a metacharacter, escape it witha backslash:+
matches a literal plus character.
Two regular expressions can be alternated or concatenated to form a newregular expression:if e1 matchessand e2 matchest,then e1|
e2 matchessort,ande1e2matches st.
The metacharacters*
,+
,and?
are repetition operators:e1*
matches a sequence of zero or more (possibly different)strings, each of which match e1;e1+
matches one or more; e1?
matches zero or one.
The operator precedence, from weakest to strongest binding, isfirst alternation, then concatenation, and finally therepetition operators.Explicit parentheses can be used to force different meanings,just as in arithmetic expressions.Some examples:ab|cd
is equivalent to(ab)|(cd)
;ab*
is equivalent toa(b*)
.
The syntax described so far is a subset of the traditional Unixegrepregular expression syntax.This subset suffices to describe all regularlanguages: loosely speaking, a regular language is a setof strings that can be matched in a single pass throughthe text using only a fixed amount of memory.Newer regular expression facilities (notably Perl andthose that have copied it) have added many new operatorsand escape sequences. These additions make the regularexpressions more concise, and sometimes more cryptic, but usuallynot more powerful:these fancy new regular expressions almost always have longerequivalents using the traditional syntax.
One common regular expression extension that does provide additional power is calledbackreferences.A backreference like1
or2
matches the string matchedby a previous parenthesized expression, and only that string:(cat|dog)1
matchescatcat
anddogdog
but notcatdog
nordogcat
.As far as the theoretical term is concerned,regular expressions with backreferencesare not regular expressions.The power that backreferences add comes at great cost:in the worst case, the best known implementations requireexponential search algorithms,like the one Perl uses.Perl (and the other languages)could not now remove backreference support,of course, but they could employ much faster algorithmswhen presented with regular expressions that don't havebackreferences, like the ones considered above.This article is about those faster algorithms.
Finite Automata
Another way to describe sets of character strings is withfinite automata.Finite automata are also known as state machines,and we will use “automaton” and “machine” interchangeably.
As a simple example, here is a machine recognizingthe set of strings matched by the regular expressiona(bb)+a
:
A finite automaton is always in one of its states,represented in the diagram by circles.(The numbers inside the circles are labels to make thisdiscussion easier; they are not part of the machine's operation.)As it reads the string, it switches from state to state.This machine has two special states: the start state s0and the matching state s4.Start states are depicted with lone arrowheads pointing at them,and matching states are drawn as a double circle.
The machine reads an input string one character at a time,following arrows corresponding to the input to move fromstate to state.Suppose the input string isabbbba
.When the machine reads the first letter of the string, thea
,it is in the start state s0. It follows thea
arrow to state s1.This process repeats as the machine reads the rest of the string:b
tos2
,b
tos3
,b
tos2
,b
tos3
,and finallya
tos4
.
The machine ends in s4, a matching state, so itmatches the string.If the machine ends in a non-matching state, it does not match the string.If, at any point during the machine's execution, there is noarrow for it to follow corresponding to the currentinput character, the machine stops executing early.
The machine we have been considering is called adeterministicfinite automaton (DFA),because in any state, each possible input letterleads to at most one new state.We can also create machinesthat must choose between multiple possible next states.For example, this machine is equivalent to the previousone but is not deterministic:
The machine is not deterministic because if it reads ab
in state s2, it has multiple choices for the next state:it can go back to s1 in hopes of seeing anotherbb
,or it can go on to s3 in hopes of seeing the finala
.Since the machine cannot peek ahead to see the rest ofthe string, it has no way to know which is the correct decision.In this situation, it turns out to be interesting tolet the machinealways guess correctly.Such machines are called non-deterministic finite automata(NFAs or NDFAs).An NFA matches an input string if there is some way it can read the string and follow arrows to a matching state.
Sometimes it is convenient to let NFAs have arrows with nocorresponding input character. We will leave these arrows unlabeled.An NFA can, at any time, choose to follow an unlabeled arrowwithout reading any input.This NFA is equivalent to the previous two, but the unlabeled arrowmakes the correspondence witha(bb)+a
clearest:
Converting Regular Expressions to NFAs
Regular expressions and NFAs turn out to be exactlyequivalent in power: every regular expression has anequivalent NFA (they match the same strings) and vice versa.(It turns out that DFAs are also equivalent in power to NFAs and regular expressions; we will see this later.)There are multiple ways to translate regular expressions into NFAs.The method described here was first described by Thompsonin his 1968 CACM paper.
The NFA for a regular expression is built up from partial NFAsfor each subexpression, with a different construction foreach operator. The partial NFAs haveno matching states: instead they have one or more dangling arrows,pointing to nothing. The construction process will finish byconnecting these arrows to a matching state.
The NFAs for matching single characters look like:
The NFA for the concatenation e1e2connects the final arrow of the e1 machine to the start of the e2 machine:
The NFA for the alternation e1|
e2adds a new start state with a choice of either thee1 machine or the e2 machine.
The NFA for e?
alternates the e machine with an empty path:
The NFA for e*
uses the same alternation but loops a matching e machine back to the start:
The NFA for e+
also creates a loop, but one thatrequires passing through e at least once:
Counting the new states in the diagrams above, we can seethat this technique creates exactly one state per characteror metacharacter in the regular expression,excluding parentheses.Therefore the number of states in the final NFA is at mostequal to the length of the original regular expression.
Just as with the example NFA discussed earlier, it is always possibleto remove the unlabeled arrows, and it is also always possible to generatethe NFA without the unlabeled arrows in the first place.Having the unlabeled arrows makes the NFA easier for us to readand understand, and they also make the C representationsimpler, so we will keep them.
Regular Expression Search Algorithms
Now we have a way to test whether a regular expressionmatches a string: convert the regular expression to an NFAand then run the NFA using the string as input.Remember that NFAs are endowed with the ability to guessperfectly when faced with a choice of next state:to run the NFA using an ordinary computer, we must finda way to simulate this guessing.
One way to simulate perfect guessing is to guessone option, and if that doesn't work, try the other.For example, consider the NFA forabab|abbb
run on the stringabbb
:
At step 0, the NFA must make a choice: try to matchabab
ortry to matchabbb
?In the diagram, the NFA triesabab
,but that fails after step 3.The NFA then tries the other choice, leading to step 4 and eventually a match.This backtracking approachhas a simple recursive implementationbut can read the input string many timesbefore succeeding.If the string does not match,the machine must tryallpossible execution paths beforegiving up.The NFA tried only two different paths in the example,but in the worst case, there can be exponentiallymany possible execution paths, leading to very slow run times.
A more efficient but more complicated way to simulate perfectguessing is to guess both options simultaneously. In this approach, the simulation allows the machineto be in multiple states at once. To process each letter,it advances all the states along all the arrows thatmatch the letter.
The machine starts in the start state and all the statesreachable from the start state by unlabeled arrows.In steps 1 and 2, the NFA is in two states simultaneously.Only at step 3 does the state set narrow down to a single state.This multi-state approach tries both paths at the same time,reading the input only once.In the worst case, the NFA might be ineverystate at each step, but this results in at worst a constant amountof work independent of the length of the string,so arbitrarilylarge input strings can be processed in linear time.This is a dramatic improvement over the exponential timerequired by the backtracking approach.The efficiency comes from tracking the set of reachablestates butnotwhich paths were used to reach them.In an NFA with nnodes, there can only be nreachable states at any step, but there might be2n paths through the NFA.
Implementation
Thompson introduced the multiple-state simulation approachin his 1968 paper.In his formulation, the states of the NFA were representedby small machine-code sequences, and the list of possible stateswas just a sequence of function call instructions.In essence, Thompson compiled the regular expression into clevermachine code.Forty years later, computers are much faster and the machine code approach is not as necessary.The following sectionspresent an implementation written in portable ANSI C.The full source code (under 400 lines)and the benchmarking scripts are available online.(Readers who are unfamiliar or uncomfortable with C or pointers shouldfeel free to read the descriptions and skip over the actual code.)
Implementation: Compiling to NFA
The first step is to compile the regular expressioninto an equivalent NFA.In our C program, we will represent an NFA as alinked collection of State
structures:
EachState
represents one of the following three NFA fragments,depending on the value ofc
.
(Lastlist
is used during execution and is explained in the next section.)
Following Thompson's paper,the compiler builds an NFA from a regular expression inpostfixnotation with dot(.
) addedas an explicit concatenation operator.A separate functionre2post
rewrites infix regular expressions like“a(bb)+a
”into equivalent postfix expressions like“abb.+.a.
”.(A “real” implementation would certainlyneed to use dot as the “any character” metacharacterrather than as a concatenation operator.A real implementation would also probably build the NFA during parsing rather than build an explicit postfix expression.However, the postfix version is convenient and follows Thompson's paper more closely.)
As the compiler scans the postfix expression, it maintainsa stack of computed NFA fragments.Literals push new NFA fragments onto the stack, whileoperators pop fragments off the stack and thenpush a new fragment.For example, after compiling theabb
in abb.+.a.
,the stack contains NFA fragments fora
,b
,andb
.The compilation of the.
that follows pops the twob
NFA fragment from the stack and pushes an NFA fragment for theconcatenationbb.
.Each NFA fragment is defined by its start state and itsoutgoing arrows:
Start
points at the start state for the fragment,andout
is a list of pointers to State*
pointers that are not yet connected to anything.These are the dangling arrows in the NFA fragment.
Some helper functions manipulate pointer lists:
Tcl Regular Expression Cheat Sheet 2019
List1
creates a new pointer list containing the single pointeroutp
.Append
concatenates two pointer lists, returning the result.Patch
connects the dangling arrows in the pointer listl
to the states
:it sets*outp
=
s
for each pointeroutp
inl
.
Given these primitives and a fragment stack,the compiler is a simple loop over the postfix expression.At the end, there is a single fragment left:patching in a matching state completes the NFA.
The specific compilation cases mimic the translation steps described earlier.
Literal characters: |
Catenation: |
Alternation: |
Zero or one: |
Zero or more: |
One or more: |
Implementation: Simulating the NFA
Now that the NFA has been built, we need to simulate it.The simulation requires tracking State
sets, which are stored as a simple array list:
The simulation uses two lists:clist
is the current set of states that the NFA is in,andnlist
is the next set of states that the NFA will be in,after processing the current character.The execution loop initializesclist
to contain just the start state and thenruns the machine one step at a time.
To avoid allocating on every iteration of the loop,match
uses two preallocated listsl1
andl2
asclist
andnlist
,swapping the two after each step.
If the final state list contains the matching state,then the string matches.
Addstate
adds a state to the list,but not if it is already on the list.Scanning the entire list for each add would be inefficient;instead the variablelistid
acts as a list generation number.Whenaddstate
addss
to a list,it recordslistid
ins->lastlist
.If the two are already equal,then s
is already on the list being built.Addstate
also follows unlabeled arrows:if s
is aSplit
state with two unlabeled arrows to new states,addstate
adds those states to the list instead ofs
.
Startlist
creates an initial state list by adding just the start state:
Finally,step
advances the NFA past a single character, usingthe current listclist
to compute the next listnlist
.
Performance
The C implementation just described was not written with performance in mind.Even so, a slow implementation of a linear-time algorithmcan easily outperform a fast implementation of an exponential-time algorithm once the exponent is large enough.Testing a variety of popular regular expression engines on a so-called pathological regular expression demonstrates this nicely.
Consider the regular expressiona?nan
.It matches the stringan
when thea?
are chosen not to match any letters,leaving the entire string to be matched by thean
.Backtracking regular expression implementationsimplement the zero-or-one?
by first trying one and then zero.There arensuch choices to make, a total of2n possibilities.Only the very lastpossibility—choosing zero for all the ?
—will lead to a match.The backtracking approach thus requiresO(2n) time, so it will not scale much beyond n=25.
In contrast, Thompson's algorithm maintains state lists of lengthapproximately n and processes the string, also of length n,for a total of O(n2) time.(The run time is superlinear,because we are not keeping the regular expression constantas the input grows.For a regular expression of length m run on text of length n,the Thompson NFA requires O(mn) time.)
The following graph plots time required to check whethera?nan
matchesan
:
regular expression and text size n a? na nmatching a n |
Notice that the graph's y-axis has a logarithmic scale,in order to be able to see a wide variety of times on a single graph.
From the graph it is clear that Perl, PCRE, Python, and Ruby areall using recursive backtracking.PCRE stops getting the right answer at n=23,because it aborts the recursive backtracking after a maximum numberof steps.As of Perl 5.6, Perl's regular expression engine issaid to memoizethe recursive backtracking search, which should, at some memory cost,keep the search from taking exponential amounts of time unless backreferences are being used.As the performance graph shows, the memoization is not complete:Perl's run time grows exponentially even though thereare no backreferencesin the expression.Although not benchmarked here, Java uses a backtrackingimplementation too.In fact, thejava.util.regex
interface requires a backtrackingimplementation, because arbitrary Java codecan be substituted into the matching path.PHP uses the PCRE library.
The thick blue line is the C implementation of Thompson's algorithm given above.Awk, Tcl, GNU grep, and GNU awk build DFAs, either precomputing them or using the on-the-flyconstruction described in the next section.
Some might argue that this test is unfair tothe backtracking implementations, since it focuses on anuncommon corner case.This argument misses the point:given a choice between an implementationwith a predictable, consistent, fast running time on all inputsor one that usually runs quickly but can takeyears of CPU time (or more) on some inputs,the decision should be easy.Also, while examples as dramatic as this onerarely occur in practice, less dramatic ones do occur.Examples include using(.*)
(.*)
(.*)
(.*)
(.*)
to split five space-separated fields, or usingalternations where the common casesare not listed first.As a result, programmers often learn which constructs areexpensive and avoid them, or they turn to so-calledoptimizers.Using Thompson's NFA simulation does not require such adaptation:there are no expensive regular expressions.
Caching the NFA to build a DFA
Recall that DFAs are more efficient to execute than NFAs,because DFAs are only ever in one state at a time: they neverhave a choice of multiple next states.Any NFA can be converted into an equivalent DFAin which each DFA state corresponds to alist of NFA states.
For example, here is the NFA we used earlier forabab|abbb
,with state numbers added:
The equivalent DFA would be:
Each state in the DFA corresponds to a list of states from the NFA.
In a sense, Thompson's NFA simulation isexecuting the equivalent DFA: eachList
corresponds to some DFA state,and the step
function is computing, given a list and a next character,the next DFA state to enter.Thompson's algorithm simulates the DFA by reconstructing each DFA state as it is needed.Rather than throw away this work after each step,we could cache theLists
in spare memory, avoiding the cost of repeating the computationin the futureand essentially computing the equivalent DFA as it is needed.This section presents the implementation of such an approach.Starting with the NFA implementation from the previous section,we need to add less than 100 lines to build a DFA implementation.
To implement the cache, we first introduce a new data typethat represents a DFA state:
ADState
is the cached copy of the listl
.The arraynext
contains pointers to the next state for eachpossible input character:if the current state isd
and the next input character isc
,thend->next[c]
is the next state.Ifd->next[c]
is null, then the next state has not been computed yet.Nextstate
computes, records, and returns the next statefor a given state and character.
The regular expression match followsd->next[c]
repeatedly, callingnextstate
to compute new states as needed.
All theDStates
that have been computed need to be saved in a structure that lets us look up aDState
by itsList
.To do this, we arrange them in a binary treeusing the sortedList
as the key.Thedstate
function returns theDState
for a givenList
,allocating one if necessary:
Nextstate runs the NFAstep
and returns the correspondingDState
:
Finally, the DFA's start state is theDState
corresponding to the NFA's start list:
(As in the NFA simulation,l1
is a preallocatedList
.)
TheDStates
correspond to DFA states, but the DFA is only built as needed:if a DFA state has not been encountered during the search,it does not yet exist in the cache.An alternative would be to compute the entire DFA at once.Doing so would makematch
a little faster by removing the conditional branch,but at the cost of increased startup time andmemory use.
One might also worry about bounding the amount ofmemory used by the on-the-fly DFA construction.Since theDStates
are only a cache of the step
function, the implementation ofdstate
could choose to throw away the entire DFA so farif the cache grew too large.This cache replacement policy only requires a few extra lines of code in dstate
and innextstate
,plus around 50 lines of code for memory management.An implementation isavailable online.(Awkuses a similar limited-size cache strategy,with a fixed limit of 32 cached states; this explains the discontinuityin its performance at n=28 in the graph above.)
NFAs derived from regular expressionstend to exhibit good locality: they visit the same statesand follow the same transition arrows over and overwhen run on most texts.This makes the caching worthwhile: the first time an arrowis followed, the next state must be computed as in the NFAsimulation, but future traversals of the arrow are justa single memory access.Real DFA-based implementations can make useof additional optimizations to run even faster.A companion article (not yet written) will exploreDFA-based regular expression implementations in more detail.
Real world regular expressions
Regular expression usage in real programsis somewhat more complicated than what the regular expressionimplementations described above can handle.This section briefly describes the common complications;full treatment of any of these is beyond the scope of thisintroductory article.
Character classes.A character class, whether [0-9]
orw
or.
(dot),is just a concise representation of an alternation.Character classes can be expanded into alternationsduring compilation, though it is more efficient to adda new kind of NFA node to represent them explicitly.POSIXdefines special character classeslike [[:upper:]]
that change meaningdepending on the current locale, but the hard part ofaccommodating these is determining their meaning,not encoding that meaning into an NFA.
Escape sequences.Real regular expression syntaxes need to handleescape sequences, both as a way to match metacharacters((
,)
,,etc.)and to specify otherwise difficult-to-type characters such as
n
.
Counted repetition.Many regular expression implementations provide a countedrepetition operator{n}
to match exactly nstrings matching a pattern;{
n,
m}
to match at least nbut no more thanm;and{
n,}
to matchnor more.A recursive backtracking implementation can implementcounted repetition using a loop; an NFA or DFA-basedimplementation must expand the repetition:e{3}
expands toeee;e{3,5}
expands toeeee?
e?
,ande{3,}
expands toeee+
.
Submatch extraction.When regular expressions are used for splitting or parsing strings,it is useful to be able to find out which sections of the input stringwere matched by each subexpression.After a regular expression like([0-9]+-[0-9]+-[0-9]+)
([0-9]+:[0-9]+)
matches a string (say a date and time),many regular expression engines make thetext matched by each parenthesized expressionavailable.For example, one might write in Perl:
The extraction of submatch boundaries has been mostly ignoredby computer science theorists, and it is perhaps the mostcompelling argument for using recursive backtracking.However, Thompson-style algorithms can be adapted totrack submatch boundaries without giving up efficient performance.The Eighth Edition Unixregexp(3)library implemented such an algorithm as early as 1985,though as explained below,it was not very widely used or even noticed.
Unanchored matches.This article has assumed that regular expressionsare matched against an entire input string.In practice, one often wishes to find a substringof the input that matches the regular expression.Unix tools traditionally return the longest matching substringthat starts at the leftmost possible point in the input.An unanchored search for eis a special caseof submatch extraction: it is like searching for.*(e).*
where the first.*
is constrained to match as short a string as possible.
Non-greedy operators.In traditional Unix regular expressions, the repetition operators?
,*
,and+
are defined to match as much of the string as possible whilestill allowing the entire regular expression to match:when matching(.+)(.+)
againstabcd
,the first(.+)
will matchabc
,and the secondwill matchd
.These operators are now calledgreedy.Perl introduced??
,*?
,and+?
as non-greedy versions, which match as little of the stringas possible while preserving the overall match:when matching(.+?)(.+?)
againstabcd
,the first(.+?)
will match onlya
,and the secondwill matchbcd.
By definition, whether an operator is greedycannot affect whether a regular expression matches aparticular string as a whole; it only affects thechoice of submatch boundaries.The backtracking algorithm admits a simple implementationof non-greedy operators:try the shorter match before the longer one.For example, in a standard backtracking implementation,e?
first tries usingeand then tries not using it;e??
uses the other order.The submatch-tracking variants of Thompson's algorithmcan be adapted to accommodate non-greedy operators.
Assertions.The traditional regular expression metacharacters^
and$
can be viewed asassertionsabout the text around them:^
asserts that the previous characteris a newline (or the beginning of the string),while$
asserts that the next character is a newline(or the end of the string).Perl added more assertions, likethe word boundaryb
,which asserts that the previous character is alphanumeric but the nextis not, or vice versa.Perl also generalized the idea to arbitraryconditions called lookahead assertions:(?=
re)
asserts that the text after the current input position matchesre,but does not actually advance the input position;(?!
re)
is similar but asserts that the text does not matchre.The lookbehind assertions(?<=
re)
and(?<!
re)
are similar but make assertions about the textbefore the current input position.Simple assertions like^
,$
,andb
are easy to accommodate in an NFA,delaying the match one byte for forward assertions.The generalized assertionsare harder to accommodate but in principle couldbe encoded in the NFA.
Backreferences.As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently,though no one can prove that it's impossible either.(Specifically, the problem is NP-complete, meaning that ifsomeone did find an efficient implementation, that wouldbe major news to computer scientists and wouldwin a million dollar prize.)The simplest, most effective strategy for backreferences,taken by the original awk and egrep, is not to implement them.This strategy is no longer practical: users have come torely on backreferences for at least occasional use,and backreferences are part ofthePOSIX standard for regular expressions.Even so, it would be reasonable to use Thompson's NFA simulationfor most regular expressions, and only bring outbacktracking when it is needed.A particularly clever implementation could combine the two,resorting to backtracking only to accommodate the backreferences.
Backtracking with memoization.Perl's approach of using memoization to avoid exponential blowupduring backtrackingwhen possible is a good one. At least in theory, it should makePerl's regular expressions behave more like an NFA andless like backtracking. Memoization does not completely solve the problem, though:the memoization itself requires a memory footprint roughly equal to the size of the text times the size of the regular expression.Memoization also does not address the issue of the stack space usedby backtracking, which is linear in the size of the text:matching long strings typically causes a backtrackingimplementation to run out of stack space:
Character sets.Modern regular expression implementations must deal with large non-ASCII character sets such as Unicode.The Plan 9 regular expression libraryincorporates Unicode by running an NFA with asingle Unicode character as the input character for each step.That library separates the running of the NFA from decodingthe input, so that the same regular expression matching codeis used for both UTF-8and wide-character inputs.
History and References
Michael Rabin and Dana Scottintroduced non-deterministic finite automataand the concept of non-determinism in 1959[7],showing that NFAs can be simulated by(potentially much larger) DFAs in which each DFA state corresponds to a set of NFA states.(They won the Turing Award in 1976 for the introductionof the concept of non-determinism in that paper.)
R. McNaughton and H. Yamada[4]and Ken Thompson[9]are commonly credited with giving the first constructionsto convert regular expressions into NFAs,even though neither paper mentions thethen-nascent concept of an NFA.McNaughton and Yamada's constructioncreates a DFA,and Thompson's construction creates IBM 7094 machine code,but reading between the lines one cansee latent NFA constructions underlying both.Regular expression to NFA constructions differ only in how they encode the choices that the NFA must make.The approach used above, mimicking Thompson,encodes the choices with explicit choicenodes(theSplit
nodes above)and unlabeled arrows.An alternative approach,the one most commonly credited to McNaughton and Yamada,is to avoid unlabeled arrows, instead allowing NFA states tohave multiple outgoing arrows with the same label.McIlroy[3]gives a particularly elegant implementation of this approachin Haskell.
Thompson's regular expression implementationwas for his QED editor running on the CTSS [10]operatingsystem on the IBM 7094.A copy of the editor can be found in archived CTSS sources[5].L. Peter Deutsch and Butler Lampson[1]developed the first QED, butThompson's reimplementation was the first to useregular expressions.Dennis Ritchie, author of yet another QED implementation,has documented the early history of the QED editor[8](Thompson, Ritchie, and Lampson later wonTuring awards for work unrelated to QED or finite automata.)
Thompson's paper marked the beginning of a long line of regular expression implementations.Thompson chose not to use his algorithm when implementing the text editor ed, which appeared in First Edition Unix (1971), or in its descendant grep,which first appeared in the Fourth Edition (1973).Instead, these venerable Unix tools usedrecursive backtracking!Backtracking was justifiable because theregular expression syntax was quite limited:it omitted grouping parentheses and the|
,?
,and+
operators.Al Aho's egrep,which first appeared in the Seventh Edition (1979),was the first Unix tool to providethe full regular expression syntax, using aprecomputed DFA.By the Eighth Edition (1985), egrep computed the DFA on the fly,like the implementation given above.
While writing the text editor sam [6]in the early 1980s,Rob Pike wrote a new regular expression implementation,which Dave Presotto extracted into a library that appeared in the Eighth Edition.Pike's implementationincorporated submatch tracking into an efficient NFA simulationbut, like the rest of the Eighth Edition source, was not widelydistributed.Pike himself did not realize that his technique was anything new.Henry Spencer reimplemented the Eighth Edition libraryinterface from scratch, but using backtracking,andreleased his implementationinto the public domain.It became very widely used, eventually serving as the basisfor the slow regular expression implementationsmentioned earlier: Perl, PCRE, Python, and so on.(In his defense,Spencer knew the routines could be slow,and he didn't know that a more efficient algorithm existed.He even warned in the documentation,“Many users have found the speed perfectly adequate,although replacing the insides of egrep with this codewould be a mistake.”)Pike's regular expression implementation, extended tosupport Unicode, was made freely availablewith sam in late 1992,but the particularly efficientregular expression search algorithm went unnoticed.The code is now available in many forms: as part of sam,as Plan 9's regular expression library,orpackaged separately for Unix.Ville Laurikari independently discovered Pike's algorithmin 1999, developing a theoretical foundation as well[2].
Finally, any discussion of regular expressionswould be incomplete without mentioning Jeffrey Friedl's bookMastering Regular Expressions,perhaps the most popular reference among today's programmers.Friedl's book teaches programmers how best to use today'sregular expression implementations, but not how best to implement them.What little text it devotes to implementationissues perpetuates the widespread belief that recursive backtrackingis the only way to simulate an NFA.Friedl makes it clear that he neither understands nor respectsthe underlying theory.
Summary
Tcl Regular Expression Match
Regular expression matching can be simple and fast, usingfinite automata-based techniques that have been known for decades.In contrast, Perl, PCRE, Python, Ruby, Java,and many other languageshave regular expression implementations based on recursive backtracking that are simple but can beexcruciatingly slow.With the exception of backreferences, the featuresprovided by the slow backtracking implementationscan be provided by the automata-based implementationsat dramatically faster, more consistent speeds.
The next article in this series,“Regular Expression Matching: the Virtual Machine Approach,” discusses NFA-based submatch extraction.The third article, “Regular Expression Matching in the Wild,” examines a production implementation.The fourth article, “Regular Expression Matching with a Trigram Index,” explains how Google Code Search was implemented.
Acknowledgements
Lee Feigenbaum,James Grimmelmann,Alex Healy,William Josephson,andArnold Robbinsread drafts of this article and made many helpful suggestions.Rob Pike clarified some of the history surrounding hisregular expression implementation.Thanks to all.
References
[1]L. Peter Deutsch and Butler Lampson,“An online editor,”Communications of the ACM 10(12) (December 1967), pp. 793–799.http://doi.acm.org/10.1145/363848.363863
[2]Ville Laurikari,“NFAs with Tagged Transitions,their Conversion to Deterministic AutomataandApplication to Regular Expressions,”in Proceedings of the Symposium on String Processing andInformation Retrieval, September 2000.http://laurikari.net/ville/spire2000-tnfa.ps
[3]M. Douglas McIlroy,“Enumerating the strings of regular languages,”Journal of Functional Programming 14 (2004), pp. 503–518.http://www.cs.dartmouth.edu/~doug/nfa.ps.gz (preprint)
[4]R. McNaughton and H. Yamada,“Regular expressions and state graphs for automata,”IRE Transactions on Electronic Computers EC-9(1) (March 1960), pp. 39–47.
[5]Paul Pierce,“CTSS source listings.”http://www.piercefuller.com/library/ctss.html (Thompson's QED is in the filecom5
in the source listings archive and is marked as0QED
)
[6]Rob Pike,“The text editor sam,”Software—Practice & Experience 17(11) (November 1987), pp. 813–845.http://plan9.bell-labs.com/sys/doc/sam/sam.html
[7]Michael Rabin and Dana Scott,“Finite automata and their decision problems,”IBM Journal of Research and Development 3 (1959), pp. 114–125.http://www.research.ibm.com/journal/rd/032/ibmrd0302C.pdf
[8]Dennis Ritchie,“An incomplete history of the QED text editor.”http://plan9.bell-labs.com/~dmr/qed.html
[9]Ken Thompson,“Regular expression search algorithm,”Communications of the ACM 11(6) (June 1968), pp. 419–422.http://doi.acm.org/10.1145/363347.363387(PDF)
Tcl Regular Expression Cheat Sheet Download
[10]Tom Van Vleck,“The IBM 7094 and CTSS.”http://www.multicians.org/thvv/7094.html
Tcl Regular Expression Cheat Sheet
Regular Expression Tcl
Discussion on reddit and perlmonks andLtU
Tcl Regular Expression Cheat Sheet Pdf
Copyright © 2007 Russ Cox. All Rights Reserved.
http://swtch.com/~rsc/regexp/