Patterns

From Legacy Roblox Wiki
Revision as of 20:38, 11 July 2011 by >Merlin11188 (Created the page. In-depth (hopefully) description of patterns)
Jump to navigationJump to search

Patterns

Note: This tutorial requires some knowledge of string manipulation.

Classes

Character Class:

A character class is used to represent a set of characters. The following are character classes and their representations:

  • x — Where x is any non-magic character (^$()%.[]*+-?), x represents itself
  • . — Represents all characters (#32kas321fslk#?@34)
  • %a — Represents all letters (aBcDeFgHiJkLmNoPqRsTuVwXyZ)
  • %c — Represents all control characters (all ascii characters below 32 and ascii character 127)
  • %d — Represents all base-10 digits (1-10)
  • %l — Represents all lower-case letters (abcdefghijklmnopqrstuvwxyz)
  • %p — Represents all punctuation characters (#^;,.) etc.
  • %s — Represents all space characters
  • %u — Represents all upper-case letters (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
  • %w — Represents all alpha-numeric characters (aBcDeFgHiJkLmNoPqRsTuVwXyZ0123456789)
  • %x — Represents all hexadecimal digits (0123456789ABCDEF)
  • %z — Represents the ascii character with representation 0 (the null terminator)
  • %x — Represents (where x is any non-alphanumeric character) the character x. This is the standard way to escape the magic characters. Any punctuation character (even the non magic) can be preceded by a '%' when used to represent itself in a pattern. So, a percent sign in a string is "%%"

Here's an example:

Example
String="Ha! You'll never find any of these (323414123114452) numbers inside me!"
print(string.match(String, "%d")) -- Find a digit character

Output:
3


An upper-case version of any of these classes results in the complement of that class. For instance, %A will represent all non-letter characters. You can even combine them! Here's another example:

Example
Martian="141341432431413415072343E334141241312"
print(Martian:match("%D%d")) -- Find any non-digit character immediately followed by a digit.

Output:
E3


Modifiers

In Lua, modifiers are used for repetitions and optional parts. That's where they're useful; you can get more than one character at a time:

  • + — 1 or more repetitions
  • * — 0 or more repetitions
  • - — (minus sign) also 0 or more repetitions
  • ? — optional (0 or 1 occurrence)


I'll start with the simplest one: the ?. This makes the character class optional, and if it's there, captures 1 of it. That sounds complex, but is actually really simple, so here's an example:

Example
stringToMatch="Once upon a time, in a land far, far away..."
print(stringToMatch:match("%a?")) -- Find a letter, but it doesn't have to be there.
print(stringToMatch:match("%d?")) -- Find a number, but it doesn't have to be there.

Output:
O -- O, in Once.
--Nothing because the digit didn't need to be there, so nothing was returned.


The + symbol used after a character class requires at least one instance of that class, and will get the longest string of that class. Here's an example:

Example
stringToMatch="Once upon a time, in a land far, far away..."
print(stringToMatch:match("%a+")) -- Finds the first letter, then matches letters until a non-letter character
print(stringToMatch:match("%d+")) -- Finds the first number, then matches numbers until a non-number character

Output:
Once
nil -- Nil, because the pattern required the digit to be there, but it wasn't, which returns nil.


The * symbol used after a character class is like a combination of the + and ? modifiers. It matches the longest sequence of the character class, but it doesn't have to be there. Here's an example of it matching a floating-point (decimal) number, without requiring the decimal:

Example
numPattern="%d+%.?%d*"
--[[ Requires there to be a natural number (a digit >= 1), and if there's a decimal point, get it (remember: a period is magic character, so you have to escape it with the % sign), and if there are numbers after the decimal point, grab them. ]]

local num1="21608347 is an integer, a whole number, and a natural number!"
local num2="2034782.014873 is a decimal number!"
print(num1:match(numPattern))
print(num2:match(numPattern))

Output:
21608347 -- Grabbed a whole number, because there wasn't a decimal point or numbers after the decimal point
2034782.014873 -- Grabbed the floating-point number, because it had a decimal and numbers after it


The - symbol used after a character class is like the * symbol; there's only one difference, actually: It matches the shortest sequence of the character class. Here's an example showing the difference:

Example
String="((3+4)+3+4)+2"
print(String:match("%(.*%)")) -- Find a (, then match all (the . represens all characters) characters until the LAST ).
print(String:match("%(.-%)")) -- Find a (, then match all characters until the FIRST ).

Output:
((3+4)+3+4) -- Grabbed everything from the first parenthesis to the last closing parenthesis
((3+4) -- Grabbed everything from the first parenthesis to the first closing parenthesis


Sets

  • [set] represents the class which is the union of all characters in the set. You define a set with brackets, like [%a%d]. A range of characters may be specified by separating the end characters of the range with a '-'. All classes described above may also be used as components in set. All other characters in a set represent themselves. For example, [%w_] (or [_%w]) represents all alphanumeric characters plus the underscore, [0-7] represents the octal digits, and [0-7%l%-] represents the octal digits plus the lowercase letters plus the '-' character.

The interaction between ranges and classes is not defined. Therefore, patterns like [%a-z] or [a-%%] have no meaning.

  • [^set] represents the complement of set, where set is interpreted as above.
Example
Vowel="[AEIOUaeiou]" -- Match a vowel, upper-case or lower-case
Consonant="[^AEIOUaeiou]" -- Match a consonant by using the complement of the vowel set
OctalDigit="[0-7]" -- Match an octal digit. Octal digits: 0,1,2,3,4,5,6,7
stringToMatch="I have several vowels and consonants, and I'm followed by an octal number: 0231356701"
print(stringToMatch:match(Vowel))
print(stringToMatch:match(Consonant))
print(stringToMatch:match(OctalDigit))

Output:
I-- First vowel
 -- This is a space; it was the first non-vowel character (after the I).
0-- First octal digit, late in the string.


Pattern Items

Alright, now it's time to explain what a pattern item is. A pattern item may be:

  • a single character class, which matches any single character in the class;
  • a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
  • a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
  • a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
  • a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
  • %n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
  • %bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.

A pattern cannot contain embedded zeros. Use %z instead.

Pattern:

A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the string. A '$' at the end of a pattern anchors the match at the end of the string. At other positions, '^' and '$' have no special meaning and represent themselves. Here's an example of a pattern:

Example
local Pattern="[%w%s%p]*" -- Get the longest sequence containing alpha-numeric characters, punctuation marks, and spaces.
local Pattern2="^%a+" -- The string has to start with a sequence of letters.
x="Hello, my name is Merlin!"
print(x:match(Pattern))
print(x:match(Pattern2))

Output:
Hello, my name is Merlin! -- The entire string contained only alpha-numeric characters, punctuation marks, and spaces!
Hello -- Matched only the letters at the start of the string.


Captures

A pattern may contain sub-patterns enclosed in parentheses; they describe captures. When a match of a capture succeeds, the substring that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3. Whaaaaat??? Here:

Example
local number="55"
print(number:find("%d%d")) -- Find returns the location of the match, not the match itself
print(number:find("(%d%d)"))

Ouput:
1	2          -- The first digit is at number:sub(1,1) and the second digit is at number:sub(2,2)
1	2	55 -- The 55 is captured and returned.

The second string had the parentheses represent a capture of one digit immediately followed by another. So, what a capture does is return whatever the function returns, the locations, as well as the matched substring. What's inside the parentheses is the substring that is being matched. So, the %d%d was the substring that was to be matched, and it was returned along with the 1 and the 2, the values the function returns, followed by the matched substring (55).

As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.

See Also

http://www.lua.org/pil/20.2.html
http://www.lua.org/pil/20.3.html