What are Regular Expressions? (Regex)

Sooner or later, anyone using programming language or languages on a regular basis will come across regular expressions, either out of necessity or out of interest. Looking at it for the first time, it might seem like a bunch of randomly scattered characters that make no sense whatsoever. Which points to the question what exactly is a regular expression and why should I be using it, if I was doing just fine so far? Well, regular expressions are a set of characters put together in a specific order that define a search pattern. It can be used to process large amounts of data to find or find and replace certain strings, or even as you'll be able to see in this post, to find and add code to strings.

In practice, you will encounter regular expressions in search engines that use the logic to find, filter or replace strings, also on formus and social networks that use it to automatically edit or deny unappropriate words. Same logic is used by text editors for syntax highlighting among other things. For me personally, I had to find an alternative for automatic code syntax highlighting used in between <pre> and <code> tags used in articles. On one hand, to do it manually each time I wanted to publish an article would be counter productive in the long run and, on the other, trying to avoid using 3rd party libraries that would additionally encumber user experience, I had no choice but to resort to regular expressions.

How to use regular expressions for syntax highlighting?

Instead of guiding you through the theoretical aspect of regular expressions, which btw you can read about here, I will show you what I use to process the code part of a written article before posting. While all three programming languages that I'm familiar with provide regex capabilities, I have chosen Python's general simplicity to be the one to use for this purpose. The example that we'll take a closer look at will show you how to highlight SQL.

But before we jump straight to it, note, however, that I already have CSS for the tags defined in my CSS file. This is how it looks:


pre dr {color: #910000;}
pre g {color:green;}
pre pi {color:#b70058;}

Of course, in case you want to replicate it exactly, you'll also need to define CSS for other <pre> and <code> parameters but for now, let's just stick to the goal. Here is the Python code used for remodelling:


#!/usr/bin/python3
# Importing the regex module
import re

# Create lists of different type of strings that will be formatted
commands = ['CREATE', 'TABLE', 'NOT', 'UNIQUE', 'DEFAULT']
sqltype = ['int', 'varchar', 'char']
null = ['NULL']

# Here comes in the text that needs to be formatted (Between """ """)
ins = """CREATE TABLE `sql_users` (
  `id` int(11) NOT NULL PRIMARY KEY AUTO_INCREMENT,
  `first_name` varchar(100) COLLATE utf8_bin NOT NULL,
  `last_name` varchar(100) COLLATE utf8_bin NOT NULL,
  `address` varchar(100) COLLATE utf8_bin DEFAULT NULL,
  `username` varchar(100) COLLATE utf8_bin NOT NULL UNIQUE KEY,
  `password` char(64) COLLATE utf8_bin NOT NULL,
  `email` varchar(250) COLLATE utf8_bin NOT NULL UNIQUE KEY,
  `status` varchar(10) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;"""

# Define the pattern and according to the pattern, return elements in a list
patternlist = re.findall(r"\`\w+\`|\=|\w+|\s|[()]", ins)

# Create empty string variable to which elements from the list will be appended
# Loop through the list and append our <pre> tag parameters to targeted elements and patterns
# Else just return the element as it was before.
reformed = ''
for i in patternlist:

	if i in commands:
		reformed += '<dr>'+i+'</dr>'

	elif i in sqltype:
		reformed += '<g>'+i+'</g>'

	elif i in re.findall(r"\`\w+\`|\d+", ins) or i in null:
		reformed += '<pi>'+i+'</pi>'

	else:
		reformed += i

# Uncomment to see what it adds to the list
#print(patternlist)
print(reformed)

Firstly, If you're not familiar with regex, take your time to process it, perhaps even run the code yourself, modify something and try again, try to get around the basics. Secondly, please note that the lists here are at a bare minimum, just what is needed for the example and that, in case you are planning to use the snippet, you'll probably have to expand them to satisfy your needs. And finally, let's take a look at the patterns used:

This character, in this context, is used to separate different patterns. If we wanted to find |, we would have to encapsulate it with [], as in [|]. So whenever you see this sign on its own, it is there as a pattern separator.


\`\w+\`

This will get all the unicode characters between ` signs. \` is used to find the first such sign. In between it is the sign \w+ which basically tells it to find not just the first single unicode character but all unicode characters until the pattern is broken by the second \` which finds the last ` sign.

\=

The following will search for a single = char and add it to the list

\s

It will search for whitespace characters and add them to the list. This is here so it can later be added again to the empty string variable to preserve whitespaces.


[()]

It will match and preserve parethesis, basically both signs will be added to the list as separate elements and then added back to the string variable unmodified.

As for the loop, as said in the comments, it checks if the element in the pattern list is in any of the above lists and, if so, process them accordingly. There is another pattern added in the condition which is set to turn the strings to pink color:


\`\w+\`|\d+

Again, the first part will find all unicode characters between ` signs and the second part which is \d+ will find all unicode decimal digits. Now if we run the SQL table through the script, it will return the following:


<dr>CREATE</dr> <dr>TABLE</dr> <pi>`sql_users`</pi> (
  <pi>`id`</pi> <g>int</g>(<pi>11</pi>) <dr>NOT</dr> <pi>NULL</pi> PRIMARY KEY AUTO_INCREMENT
  <pi>`first_name`</pi> <g>varchar</g>(<pi>100</pi>) COLLATE utf8_bin <dr>NOT</dr> <pi>NULL</pi>
  <pi>`last_name`</pi> <g>varchar</g>(<pi>100</pi>) COLLATE utf8_bin <dr>NOT</dr> <pi>NULL</pi>
  <pi>`address`</pi> <g>varchar</g>(<pi>100</pi>) COLLATE utf8_bin <dr>DEFAULT</dr> <pi>NULL</pi>
  <pi>`username`</pi> <g>varchar</g>(<pi>100</pi>) COLLATE utf8_bin <dr>NOT</dr> <pi>NULL</pi> <dr>UNIQUE</dr> KEY
  <pi>`password`</pi> <g>char</g>(<pi>64</pi>) COLLATE utf8_bin <dr>NOT</dr> <pi>NULL</pi>
  <pi>`email`</pi> <g>varchar</g>(<pi>250</pi>) COLLATE utf8_bin <dr>NOT</dr> <pi>NULL</pi> <dr>UNIQUE</dr> KEY
  <pi>`status`</pi> <g>varchar</g>(<pi>10</pi>) COLLATE utf8_bin <dr>NOT</dr> <pi>NULL</pi>
) ENGINE=InnoDB <dr>DEFAULT</dr> CHARSET=utf8 COLLATE=utf8_bin

Note that If I didn't encode < and > within the <pre> and <code> tags, it would parse the CSS and color the code, giving the desired syntax highlighting effect.

In case you were wondering, I use a separate script for each of the languages. I did try, however, to process HTML and PHP together but have failed considerably. Even now and then I have to correct the article manually or re-design the script but this is only a minor setback compared to what I was trying to accomplish. Anyway, that would be all for now, have fun and feel free to leave a comment on the subject.

Comments:

Be the first to comment.

What are Regular Expressions? (Regex)

How to use regular expressions for syntax highlighting?

Comments:

Add a comment: