Regular Expressions or ‘regex’ are extremely powerful and utterly invaluable.
Its primary use case is to find patterns in text. And working in cyber security, text is everywhere - inside logs, artefacts and documents. There can be an endless supply of artefacts and documents and different vendor or products may have its own proprietary formats.
This article is going to cover the use of regular expressions with plenty of examples.
This article will not be explaining the meaning of each specific character (explained in the Glossary) nor covering how to open or read certain artefacts or documents to obtain plain text data, as the means to do so are endless.
Some artefacts and logs are well structured, such as syslog, Windows Event logs, IIS web server or even CSV and JSON. In this circumstance, we may want to extract sub sections of already extracted fields. Further, regex truly comes into its own when we encounter unstructured or inconsistently formatted logs.
When applying regex to a log or blob of text, I would recommend reading through the data first, or at least have a sample log open alongside your regex. As stated, some data is more forthcoming in structure (and/or supporting documentation) than others so understanding the schema or lack thereof will vary.
I will explain some of the core concepts to help you get started but I primarily want to focus on combining these characters for specific use cases. There is a glossary which helps explain all the character meanings.
To validate your regex, leverage https://regex101.com/. It explains the construction and extraction as you write the expression and has saved my skin more than once!
Without further ado, let’s get started…
[…]
{*N*}
\\
*
+
(…)
Character set - any characters in the brackets can be matched
^
negates the content of the character setQuantifier - minimum and maximum number of characters to match
Escape - match a literal character by preceding it with backslash
More than zero - match the preceding character 0 or more times
More than once - match the preceding character 1 or more times
Capture group - identify the content between the parenthesis as a ‘group’
(?:...)
- non-capture group(?<name>...)
- named capture group