0%

正则表达式必知必会笔记

Tools

RegExrv2.1 提供 Explain

Matching Single Characters

  • 纯文本
  • . Match any single character except \n (unless /s)
  • \ Escape next character, such as \/ or \( or \)

Matching Sets of Characters

  • [] [abc] Match one out of a set of characters
  • - [a-z] Match one character from range, often [a-zA-Z]
  • ^ [^abc] Match one character not in set

^ 用于字符集合求非

Using Metacharacters

  • \d Digit [0-9]
  • \D pposite of \d [^0-9]
  • \w Word character (alphanumeric, underscore) [a-zA-Z0-9]
  • \W Opposite of \w [^a-zA-Z0-9]
  • \s Whitespace character (space, tab, etc.) [\f\n\r\t\v]
  • \S Opposite of \s [^\f\n\r\t\v]
  • [\b] Backspace (any use of \b in a character set)
  • \n Newline
  • \c Control character
  • \f Form feed
  • \r Carriage return
  • \t Tab
  • \v Vertical tab
  • \x Hexadecimal number; \xf0 matches hex f0
  • \0 Octal number; \021 matches octal 21
  • POSIX字符类

Repeating Matches

  • * Match 0 or more of previous char/subexpression
  • + Match 1 or more of previous char/subexpression
  • ? Match 0 or 1 of previous char/subexpression
  • {m,n} Match m to n (inclusive) of previous char/subex.
  • {n,} Match n or more of previous char/subexpression
  • {n} Match exactly n of previous char/subexpression
  • *?, ?? Lazy version of same (works for any quantifier)
  • *+, ?+ Possessive version (works for any quantifier)

防止过度匹配:

默认为 贪婪型元字符,例如 * + {n,},匹配结果是多多益善而不是适可而止
当不需要这种贪婪特性时,使用懒惰型元字符*? +? {n,}?,特性是匹配尽可能少的字符.

Position Matching

  • ^ Start of string (equivalent: $A unless /m is used)
  • $ End of string (equivalent: $Z unless /m is used)
  • \b Word boundary, similar to: (\w\W|\W\w)
  • \B Anything but a word boundary
  • (?m) ^ or $

分行匹配模式

分行匹配模式(Multiline Mode) (?m) 使得正则表达式引擎将行分隔符当作一个字符串分隔符来对待。
在这种模式下,^不但匹配正常的字符串开头,还将匹配行分隔符(换行符)后面的开始位置;
类似的还有 $.

Using Subexpressions

  • () Define a subexpression
  • | OR; (ab|cd) matches ab or cd

常用来对重复次数元字符的作用对象做出精确的设定和控制,子表达式允许嵌套

Using Backreferences

  • \a a subexpression (use inside match, eg s/(.)\1/a/)

回溯引用允许正则表达式引用前面的匹配结果,
\1 也代表着模式里的第一个子表达式,\2 代表第二个子表达式,\3 代表着第三个,依次轮推。

回溯引用只能引用模式中的子表达式 ( 和 ) 括起来的正则表达式片段

  • \l Make next character lowercase
  • \u Make next character uppercase
  • \L Make entire string (up to \E) lowercase
  • \U Make entire string (up to \E) uppercase
  • \E End \L or \U (so they only apply before \E)
  • \u\L Capitalize first char, lowercase rest (sentence)

Looking Ahead and Behind

  • 正向前查找 (?=) Look-ahead; m/a(?=b)/ matches ab , “eats” a
  • 正向后查找 (?<=) Look-behind; m/(?<=a)b/ matches ab , “eats” b
  • 负向前查找 (?!) Negative look-ahead
  • 负向后查找 (?<!) Negative look-behind

Embedding Conditions

  • ?(a)b Conditional; if a then b
  • ?(a)b|c Conditional; if a then b else c

RegExrV2.1 CheatSheet


Character classes

.           any character except newline
\w \d \s    word, digit, whitespace
\W \D \S    not word, digit, whitespace
[abc]       any of a, b, or c
[^abc]      not a, b, or c
[a-g]       character between a & g

Anchors

^abc$   start / end of the string
\b \B   word, not-word boundary

Escaped Character

\. \* \\    escaped special characters
\t \n \r    tab, linefeed, carriage return
\u00A9      unicode escaped ©

Groups & Lookaround

(abc)   capture group
\1      backreference to group #1
(?:abc) non-capturing group
(?=abc) positive lookahead
(?!abc) negative lookahead

Quantifiers & Alternation

a* a+ a?    0 or more, 1 or more, 0 or 1
a{5} a{2,}  exactly five, two or more
a{1,3}      between one & three
a+? a{2,}?  match as few as possible
ab|cd       match ab or cd