Semgrep can match generic patterns in languages that it does not yet support. Use generic pattern matching for languages that do not have a parser, configuration files, or other structured data such as XML. Generic pattern matching can also be helpful in files containing multiple languages, even if the languages are otherwise supported, such as HTML with embedded JavaScript or PHP code. In those cases, you can also consider Extract mode (experimental), but generic patterns may be more straightforward and still effective.As an example of generic matching, consider this rule:
rules: - id: dynamic-proxy-scheme pattern: proxy_pass $$SCHEME:// ...; paths: include: - "*.conf" - "*.vhost" - sites-available/* - sites-enabled/* languages: - generic severity: MEDIUM message: >- The protocol scheme for this proxy is dynamically determined. This can be dangerous if the scheme is injected by an attacker because it may forcibly alter the connection scheme. Consider hardcoding a scheme for this proxy. metadata: references: - https://github.com/yandex/gixy/blob/master/en/plugins/ssrf.md category: security technology: - nginx confidence: MEDIUM
Generic pattern matching has the following properties:
A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
... (ellipsis operator) allows skipping non-matching elements, up to 10 lines down from the last match.
$X (metavariable) matches any word.
$...X (ellipsis metavariable) matches a sequence of words, up to 10 lines down from the last match.
Indentation determines primary nesting in the document.
Common ASCII braces (), [], and {} introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don’t disturb the structure of the rest of the document.
The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.
Semgrep can reliably understand the syntax of natively supported languages. The generic mode is useful for unsupported languages and consequently brings specific limitations.
CAUTIONThe quality of results in the generic mode can vary depending on the language you use it for.
The generic mode works fine with any human-readable text, as long as it is primarily based on ASCII symbols. Since the generic mode does not understand the syntax of the language you are scanning, the quality of the result may differ from language to language or even depend on specific code. As a consequence, the generic mode works well for some languages, but it does not always give consistent results. Generally, it’s possible or even easy to write code in weird ways that prevent generic mode from matching.Example: In XML, one can write Hello instead of Hello. If a rule pattern in generic mode is Hello, Semgrep is unable to match the Hello, unlike if it had full XML support.With respect to Semgrep operators and features:
support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as hello, world (in this case, there are three tokens: hello, ,, and world).
The ellipsis operator is supported and spans, at most, 10 lines.
The pattern operators like either/not/inside are supported.
Inline regular expressions for strings ("=~/word.*/") are not supported.
This section explains how to use Semgrep’s generic mode to match
single lines of code using an ellipsis metavariable. Many simple
configuration formats are collections of key and value pairs delimited
by newlines. For example, to extract the password value from the
following made-up input:
Unfortunately, the following pattern does not match the whole line. In generic mode, metavariables only capture a single word (alphanumeric sequence):
password = $PASSWORD
This pattern matches the input file but does not assign the value p to $PASSWORD instead of the full value p@$$w0rd.To match an arbitrary sequence of items and capture their value in the example:
Use a named ellipsis by changing the pattern to the following:
password = $...PASSWORD
This still leads Semgrep to capture too much information. The value assigned to $...PASSWORD are now p@$$w0rd and server = example.com. In generic mode, an ellipsis extends until the end of the current block or up to 10 lines below, whichever comes first. To prevent this behavior, continue with the next step.
In the Semgrep rule, specify the following key:
generic_ellipsis_max_span: 0
This option forces the ellipsis operator to match patterns within a single line.
Example of the resulting rule:
id: password-in-config-filepattern: | password = $...PASSWORDoptions: # prevent ellipses from matching multiple lines generic_ellipsis_max_span: 0message: | password found in config file: $...PASSWORDlanguages: - genericseverity: WARNING
By default, the generic mode does not know about comments or code
that can be ignored. The following example is
scanning for CSS code that sets the text color to blue. The target code
is the following:
color: /* my fave color */ blue;
Use the options.generic_comment_style
to ignore C-style comments, as is the case in the example.
The Semgrep rule is:
id: css-blue-is-uglypattern: | color: blueoptions: # ignore comments of the form /* ... */ generic_comment_style: cmessage: | Blue is ugly.languages: - genericseverity: WARNING
In the Semgrep code, the generic pattern matching implementation is called spacegrep because it tokenizes based on whitespace (and because it sounds cool 😎).