正则表达式必知必会笔记 |

Tools

RegExrv2.1 提供 Explain

Matching Single Characters

纯文本
. Match any single character except \n (unless /s)
\ Escape next character, such as \/ or $ or $

Matching Sets of Characters

[] [abc] Match one out of a set of characters
- [a-z] Match one character from range, often [a-zA-Z]
^ [^abc] Match one character not in set

^ 用于字符集合求非

Using Metacharacters

\d Digit [0-9]
\D pposite of \d [^0-9]
\w Word character (alphanumeric, underscore) [a-zA-Z0-9]
\W Opposite of \w [^a-zA-Z0-9]
\s Whitespace character (space, tab, etc.) [\f\n\r\t\v]
\S Opposite of \s [^\f\n\r\t\v]
[\b] Backspace (any use of \b in a character set)
\n Newline
\c Control character
\f Form feed
\r Carriage return
\t Tab
\v Vertical tab
\x Hexadecimal number; \xf0 matches hex f0
\0 Octal number; \021 matches octal 21
POSIX字符类

Repeating Matches

* Match 0 or more of previous char/subexpression
+ Match 1 or more of previous char/subexpression
? Match 0 or 1 of previous char/subexpression
{m,n} Match m to n (inclusive) of previous char/subex.
{n,} Match n or more of previous char/subexpression
{n} Match exactly n of previous char/subexpression
*?, ?? Lazy version of same (works for any quantifier)
*+, ?+ Possessive version (works for any quantifier)

防止过度匹配:

默认为 贪婪型元字符，例如 * + {n,}，匹配结果是多多益善而不是适可而止，
当不需要这种贪婪特性时，使用懒惰型元字符， *? +? {n,}?，特性是匹配尽可能少的字符.

Position Matching

^ Start of string (equivalent: $A unless /m is used)
$ End of string (equivalent: $Z unless /m is used)
\b Word boundary, similar to: (\w\W|\W\w)
\B Anything but a word boundary
(?m) ^ or $

分行匹配模式：

分行匹配模式(Multiline Mode) (?m) 使得正则表达式引擎将行分隔符当作一个字符串分隔符来对待。
在这种模式下，^不但匹配正常的字符串开头，还将匹配行分隔符(换行符)后面的开始位置；
类似的还有 $.

Using Subexpressions

() Define a subexpression
| OR; (ab|cd) matches ab or cd

常用来对重复次数元字符的作用对象做出精确的设定和控制，子表达式允许嵌套

Using Backreferences

\a a subexpression (use inside match, eg s/(.)\1/a/)

回溯引用允许正则表达式引用前面的匹配结果，
\1 也代表着模式里的第一个子表达式，\2 代表第二个子表达式，\3 代表着第三个，依次轮推。

回溯引用只能引用模式中的子表达式 ( 和 ) 括起来的正则表达式片段

\l Make next character lowercase
\u Make next character uppercase
\L Make entire string (up to \E) lowercase
\U Make entire string (up to \E) uppercase
\E End \L or \U (so they only apply before \E)
\u\L Capitalize first char, lowercase rest (sentence)

Looking Ahead and Behind

正向前查找 (?=) Look-ahead; m/a(?=b)/ matches ab , “eats” a
正向后查找 (?<=) Look-behind; m/(?<=a)b/ matches ab , “eats” b
负向前查找 (?!) Negative look-ahead
负向后查找 (?<!) Negative look-behind

Embedding Conditions

?(a)b Conditional; if a then b
?(a)b|c Conditional; if a then b else c

RegExrV2.1 CheatSheet

Character classes

.           any character except newline
\w \d \s    word, digit, whitespace
\W \D \S    not word, digit, whitespace
[abc]       any of a, b, or c
[^abc]      not a, b, or c
[a-g]       character between a & g

Anchors

^abc$   start / end of the string
\b \B   word, not-word boundary

Escaped Character

\. \* \\    escaped special characters
\t \n \r    tab, linefeed, carriage return
\u00A9      unicode escaped ©

Groups & Lookaround

(abc)   capture group
\1      backreference to group #1
(?:abc) non-capturing group
(?=abc) positive lookahead
(?!abc) negative lookahead

Quantifiers & Alternation

a* a+ a?    0 or more, 1 or more, 0 or 1
a{5} a{2,}  exactly five, two or more
a{1,3}      between one & three
a+? a{2,}?  match as few as possible
ab|cd       match ab or cd