Why you need to be careful when writing regex

2024-11-27

Regular Expressions are the de-facto way to pattern match when programming.

Despite how powerful regular expressions can be, there are some serious risks if used improperly - let's take a look at some of them and see how we can write better regex.

If you have written some code before, you would have definitely written, or at least come across, Regular Expressions. I've seen code where people use Regular Expressions for simple substring searches, where there are faster and more efficient algorithms1. Even if it seems like regex (over-)usage projects the image of a 10x developer, it's not always done safely.

Quick refresher

Regex specifies a match pattern, which usually gets converted into an NFA and subsequently into a simplified DFA, after which a string is fed into it to find a potential match.

The complete syntax can vary from engine to engine, but most of the core syntax remains common.

A sample regex for the popular mnemonic "I before E except after C" can be modelled as ([c]ei|[^c]ie) - and a flowchart explaining it would look like this.

Automaton for I/E/C mnemonic
via

Threats

The most frequent usage for regex happens for input validation - you want to check if a certain input string, either from the user or another system, is following a set of rules you want to define, after which you can proceed with the rest of the application.

Consider a simple application where a method accepts an IPv4 address as input, and checks the whether that IP is reachable.

A naive programmer would simply write the method as follows (pseudocode) :

public boolean isReachable(String ip) {
       Regex regex = "(\d{1,3}.?){4}" as Regex;
       if(!regex.match(ip))
            throw InvalidInputException;
       return execute("ping " + ip);
}

Let's see how this can cause issues in our system.

Incorrect Regular Expressions

The most common vulnerability is an incorrect regex2, where your regex doesn't allow only the input you want it to. There are a few ways to identify this, sometimes SAST tools can help and notify you, but as a general rule of thumb always make sure you are reviewing the syntax clearly and understand exactly what your regex allows, and what it doesn't.

Write extensive unit tests, use tools like RegExr to read through the explanation of your regex, and rewrite the tests to cover scenarios you don't want to allow. Remember, just because it matched your use case doesn't mean it's correct.

In the above example, even invalid IP addresses fall through, and if your program doesn't handle invalid IPs appropriately further on, it might open the door to bugs and vulnerabilities.

  • \d represents all numerals, not just arabic ones.
  • . doesn't represent a ".", it is a wildcard for "any character"
  • We aren't restricting the correct valid numbers for IPv4 (0.0.0.0 to 255.255.255.255)

A simplified and correct version of this regex can be written as ((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4},3 which will disallow invalid IPs.

Though the regex looks complicated and messy, it's safer than wrong ones.

Regex without anchors

Anchors in regex help denote positions in the input string rather than characters - most used ones being ^ for the beginning of the string and $ for the end of the string. Regex patterns without anchors are highly prone to injection attacks. 4

A simple example is the above code block again. Because our code isn't safe and we aren't explicitly matching the full string, any substring which is a valid IPv4 will still pass the check.

So if a malicious user were to provide input similar to 0.0.0.0; ls -al;, they would be able to execute any arbitrary code on your machine.

Mitigating this is very simple - simply add anchors in all your regex patterns, if the library you're using does not already add them for you.

In some cases you might specifically want a match in only a part of the string - there are a few ways to get around this.

  • You can restrict the other characters allowed outside your expected match

    e.g. ^[a-zA-Z]*<desired-regex>[a-zA-Z]*$ would prevent code-like semicolons from causing possible injections.

  • You can keep it un-anchored, but you need to be cautious that the code which uses the input string cannot be misused.

ReDoS

A regex pattern is considered "evil" if it gets stuck on some specially crafted input. 5 These patterns usually arise from nesting of groups with repetition. Evil regex patterns cause ReDoS, or Regular expression Denial of Service, making them hog up server resources because they compute exponentially.

A simple example is the regex ^(x|x)+y$ - this regex has 2 paths inside the first group (x|x). So, if it doesn't find a match for the first group it backtracks and tries the next group.

Head over to regexr, with the JavaScript engine, set regex as ^(x|x)+y$ and add the text xxxxxxxxxxxxxxz - incrementally increase the number of 'x's. The processing time increases exponentially before timing out.

This is because after the engine goes through all the "x" in the input string, and it does not find a matching "y" (we have a "z"), it backtracks one character and tries the next path. This repeats for 2{number of x's} times.

This problem does not occur for text-directed engines. 6 Most engines today are regex-directed, which have cool features like backreferences.

One simple solution would be to move to a different engine where this problem isn't present - but you might actually want to have the extra functionality which is on your engine. In this case, a simple timeout ensures things don't get out of hand.

Always put a timeout for all your regex patterns, and if it ever breaches the timeout, investigate. You can also use linters to help identify them preemptively.

Cloudflare had an interesting encounter with ReDoS, where pretty much the entire internet was down till they fixed the regex. Their blog post explaining the scenario is a very interesting read. CloudFlare mitigated this by moving to an engine with run-time guarantees.

Should I just not use regex?

Absolutely not. Regex is very powerful and efficient for pattern matching - you just need to be careful when you write your patterns.

The above list is not exhaustive - there are more vulnerabilities with regex, head over to regular-expressions.info to explore more on the world of regex!

Till the next post, farvel!

References