Bash Guide for Beginners Chapter 4. Regular expressions

From LinuxReviews
Jump to navigationJump to search

In this chapter we discuss:

  • Using regular expressions
  • Regular expression metacharacters
  • Finding patterns in files or output
  • Character ranges and classes in Bash

Regular expressions

What are regular expressions?

A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions.

The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash.

Regular expression metacharacters

A regular expression may be followed by one of several repetition operators (metacharacters):

Table 3-1. Reserved Bourne shell variables
Operator Effect
. Matches any single character.
? The preceding item is optional and will be matched, at most, once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{N} The preceding item is matched exactly N times.
{N,} The preceding item is matched N or more times.
{N,M} The preceding item is matched at least N times, but not more than M times.
- represents the range if it's not first or last in a list or the ending point of a range in a list.
^ Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.
$ Matches the empty string at the end of a line.
\b Matches the empty string at the edge of a word.
\B Matches the empty string provided it's not at the edge of a word.
\< Match the empty string at the beginning of word.
\> Match the empty string at the end of word.

Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.

Two regular expressions may be joined by the infix operator "|"; the resulting regular expression matches any string matching either subexpression.

Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.

Basic versus extended regular expressions

In basic regular expressions the metacharacters "?", "+", "{", "|", "(", and ")" lose their special meaning; instead use the backslashed versions "\?", "\+", "\{", "\|", "\(", and "\)".

Check in your system documentation whether commands using regular expressions support extended expressions.

Examples using grep

What is grep?

grep searches the input files for lines containing a match to a given pattern list. When it finds a match in a line, it copies the line to standard output (by default), or whatever other sort of output you have requested with options.

Though grep expects to do the matching on text, it has no limits on input line length other than available memory, and it can match arbitrary characters within a line. If the final byte of an input file is not a newline, grep silently supplies one. Since newline is also a separator for the list of patterns, there is no way to match newline characters in a text.

Some examples:

cathy ~> grep root /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
cathy ~> grep -n root /etc/passwd
1:root:x:0:0:root:/root:/bin/bash
12:operator:x:11:0:operator:/root:/sbin/nologin
cathy ~> grep -v bash /etc/passwd | grep -v nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
news:x:9:13:news:/var/spool/news:
mailnull:x:47:47::/var/spool/mqueue:/dev/null
xfs:x:43:43:X Font Server:/etc/X11/fs:/bin/false
rpc:x:32:32:Portmapper RPC user:/:/bin/false
nscd:x:28:28:NSCD Daemon:/:/bin/false
named:x:25:25:Named:/var/named:/bin/false
squid:x:23:23::/var/spool/squid:/dev/null
ldap:x:55:55:LDAP User:/var/lib/ldap:/bin/false
apache:x:48:48:Apache:/var/www:/bin/false
cathy ~> grep -c false /etc/passwd
7
cathy ~> grep -i ps ~/.bash* | grep -v history
/home/cathy/.bashrc:PS1="\[\033[1;44m\]$USER is in \w\[\033[0m\] "

With the first command, user cathy displays the lines from /etc/passwd containing the string root.

Then she displays the line numbers containing this search string.

With the third command she checks which users are not using bash, but accounts with the nologin shell are not displayed.

Then she counts the number of accounts that have /bin/false as the shell.

The last command displays the lines from all the files in her home directory starting with ~/.bash, excluding matches containing the string history, so as to exclude matches from ~/.bash_history which might contain the same string, in upper or lower cases. Note that the search is for the string "ps", and not for the command ps.

Now let's see what else we can do with grep, using regular expressions.

Grep and regular expressions

Lovelyz Kei ProTip.jpg
TIP: If you are not on Linux

We use GNU grep in these examples, which supports extended regular expressions. GNU grep is the default on Linux systems. If you are working on proprietary systems, check with the -V option which version you are using. GNU grep can be downloaded from www.gnu.org/software/grep/.

Line and word anchors

From the previous example, we now exclusively want to display lines starting with the string "root":

cathy ~> grep ^root /etc/passwd
root:x:0:0:root:/root:/bin/bash

If we want to see which accounts have no shell assigned whatsoever, we search for lines ending in ":":

cathy ~> grep :$ /etc/passwd
news:x:9:13:news:/var/spool/news:

To check that PATH is exported in ~/.bashrc, first select "export" lines and then search for lines starting with the string "PATH", so as not to display MANPATH and other possible paths:

cathy ~> grep export ~/.bashrc | grep '\<PATH'
  export PATH="/bin:/usr/lib/mh:/lib:/usr/bin:/usr/local/bin:/usr/ucb:/usr/dbin:$PATH"

Similarly, \> matches the end of a word.

If you want to find a string that is a separate word (enclosed by spaces), it is better use the -w, as in this example where we are displaying information for the root partition:

cathy ~> grep -w / /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1

If this option is not used, all the lines from the file system table will be displayed.

Character classes

A bracket expression is a list of characters enclosed by "[" and "]". It matches any single character in that list; if the first character of the list is the caret, "^", then it matches any character NOT in the list. For example, the regular expression "[0123456789]" matches any single digit.

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, "[a-d]" is equivalent to "[abcd]". Many locales sort characters in dictionary order, and in these locales "[a-d]" is typically not equivalent to "[abcd]"; it might be equivalent to "[aBbCcDd]", for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value "C".

Finally, certain named classes of characters are predefined within bracket expressions. See the grep man or info pages for more information about these predefined expressions.

cathy ~> grep [yf] /etc/group
sys:x:3:root,bin,adm
tty:x:5:
mail:x:12:mail,postfix
ftp:x:50:
nobody:x:99:
floppy:x:19:
xfs:x:43:
nfsnobody:x:65534:
postfix:x:89:

In the example, all the lines containing either a "y" or "f" character are displayed.

Wildcards

Use the "." for a single character match. If you want to get a list of all five-character English dictionary words starting with "c" and ending in "h" (handy for solving crosswords):

cathy ~> grep '\<c...h\>' /usr/share/dict/words
catch
clash
cloth
coach
couch
cough
crash
crush

If you want to display lines containing the literal dot character, use the -F option to grep.

For matching multiple characters, use the asterisk. This example selects all words starting with "c" and ending in "h" from the system's dictionary:

cathy ~> grep '\<c.*h\>' /usr/share/dict/words
caliph
cash
catch
cheesecloth
cheetah
--output omitted--

If you want to find the literal asterisk character in a file or output, use single quotes. Cathy in the example below first tries finding the asterisk character in /etc/profile without using quotes, which does not return any lines. Using quotes, output is generated:

cathy ~> grep * /etc/profile

cathy ~> grep '*' /etc/profile
for i in /etc/profile.d/*.sh ; do

Pattern matching using Bash features

Character ranges

Apart from grep and regular expressions, there's a good deal of pattern matching that you can do directly in the shell, without having to use an external program.

As you already know, the asterisk (*) and the question mark (?) match any string or any single character, respectively. Quote these special characters to match them literally:

cathy ~> touch "*"

cathy ~> ls "*"
*

But you can also use the square braces to match any enclosed character or range of characters, if pairs of characters are separated by a hyphen. An example:

cathy ~> ls -ld [a-cx-z]*
drwxr-xr-x    2 cathy	 cathy		4096 Jul 20  2002 app-defaults/
drwxrwxr-x    4 cathy    cathy          4096 May 25  2002 arabic/
drwxrwxr-x    2 cathy    cathy          4096 Mar  4 18:30 bin/
drwxr-xr-x    7 cathy    cathy          4096 Sep  2  2001 crossover/
drwxrwxr-x    3 cathy    cathy          4096 Mar 22  2002 xml/

This lists all files in cathy's home directory, starting with "a", "b", "c", "x", "y" or "z".

If the first character within the braces is "!" or "^", any character not enclosed will be matched. To match the dash ("-"), include it as the first or last character in the set. The sorting depends on the current locale and of the value of the LC_COLLATE variable, if it is set. Mind that other locales might interpret "[a-cx-z]" as "[aBbCcXxYyZz]" if sorting is done in dictionary order. If you want to be sure to have the traditional interpretation of ranges, force this behavior by setting LC_COLLATE or LC_ALL to "C".

Character classes

Character classes can be specified within the square braces, using the syntax [:CLASS:], where CLASS is defined in the POSIX standard and has one of the values

"alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper", "word" or "xdigit".

Some examples:

cathy ~> ls -ld [[:digit:]]*
drwxrwxr-x    2 cathy	cathy		4096 Apr 20 13:45 2/
cathy ~> ls -ld [[:upper:]]*
drwxrwxr--    3 cathy   cathy           4096 Sep 30  2001 Nautilus/
drwxrwxr-x    4 cathy   cathy           4096 Jul 11  2002 OpenOffice.org1.0/
-rw-rw-r--    1 cathy   cathy         997376 Apr 18 15:39 Schedule.sdc

When the extglob shell option is enabled (using the shopt built-in), several extended pattern matching operators are recognized. Read more in the Bash info pages, section Basic shell features ▸ Shell Expansions ▸ Filename Expansion ▸ Pattern Matching.

Summary

Regular expressions are powerful tools for selecting particular lines from files or output. A lot of UNIX commands use regular expressions: vim, perl, the PostgreSQL database and so on. They can be made available in any language or application using external libraries, and they even found their way to non-UNIX systems. For instance, regular expressions are used in the Excell spreadsheet that comes with the MicroSoft Windows Office suite. In this chapter we got the feel of the grep command, which is indispensable in any UNIX environment.

Lovelyz Kei ProTip.jpg
TIP: The grep command can do much more than the few tasks we discussed here; we only used it as an example for regular expressions. The GNU grep version comes with plenty of documentation, which you are strongly advised to read!

Bash has built-in features for matching patterns and can recognize character classes and ranges.

Exercises

These exercises will help you master regular expressions.

  1. Display a list of all the users on your system who log in with the Bash shell as a default.
  2. From the /etc/group directory, display all lines starting with the string "daemon".
  3. Print all the lines from the same file that don't contain the string.
  4. Display localhost information from the /etc/hosts file, display the line number(s) matching the search string and count the number of occurrences of the string.
  5. Display a list of /usr/share/doc subdirectories containing information about shells.
  6. How many README files do these subdirectories contain? Don't count anything in the form of "README.a_string".
  7. Make a list of files in your home directory that were changed less that 10 hours ago, using grep, but leave out directories.
  8. Put these commands in a shell script that will generate comprehensible output.
  9. Can you find an alternative for wc -l, using grep?
  10. Using the file system table (/etc/fstab for instance), list local disk devices.
  11. Make a script that checks whether a user exists in /etc/passwd. For now, you can specify the user name in the script, you don't have to work with arguments and conditionals at this stage.
  12. Display configuration files in /etc that contain numbers in their names.