![]() |
| |||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Regular expressions are formed by combining simple building blocks - either characters that represent themselves, or special characters that have a particular meaning. The simplest regular expression is a character or string that matches itself: 'J' or 'Janice' as in the examples above. The special character '.' matches any single character: 'Ja.' will match 'Jan' or 'Jab' or 'Jaz' as a word. A regular expression does not have to match a whole word. 'Ja.' will still match against the first three letters of 'Jane', Jabber', 'Jazz', or the last three letters of 'AJax'. 'Ja..e' will match 'Ja' followed by any two characters followed by 'e' - eg 'Jayne' or 'Jabber' but not 'Jane'. Note that it is necessary to put the pattern between quotes to prevent the shell from interpreting special characters like '*' before it passes the pattern to 'grep'. To select a list of alternative matches, use '[]': 'J[aeiuo]n' will match (the first three letters of) 'Janice', 'Jenny', 'Jinny', 'Jon', or 'June'. It will not match 'Jean' because only one of the alternative characters can be matched against. Use the special character '^' to indicate non of the alternatives in []: 'J[^aeuo]n' will only match 'Jinny' form the above list, and also 'Jynny'. Use the range character '-' to indicate a range in []: 'F[0-9][a-z][A-Z]L' will match F followed by a digit followed by a lower case letter followed by an upper case letter - 'F0oOL' for example. The special character '*' repeats the previous pattern 0 or more times: 'Ja*' will match 'J', 'Ja', 'Jaa', 'Jaaa', ..., and the first two letters of 'Janice'. Note that '*' itself does not match anything - it acts on the preceding regular expression. The special character '+' repeats the previous pattern 1 or more times: 'He+lp' will not match 'Hlp' but will match 'Help', 'Heeeeeeelp', ... The special character '?' repeats the previous pattern 0 or 1 times: 'He?lp' will match either 'Hlp' or 'Help', no others. Patterns and repeaters can be combined: 'J.*e' will match 'J' followed by any number of any characters followed by 'e'. For example: 'Je', 'Jayne', or 'Jddhdhsdkjhkjncxznxzdlksdljaslkdjkasjdaskldje'. 'J[aeiou]+n' will match J followed by one or more vowels. For example: 'Jen', 'Jean' and 'Jeeeeeeeeeeeeeeeeeeaaaaaaaan'. The special characters '^' and '$' match the start and end of a line respectively: '^Hello' matches 'Hello', but only at the start of a line. 'goodbye.$' matches 'goodbye.', but again only at the end of a line. Effectively '^' matches a start of line, and '$' matches and end of line. The pattern '^A whole line.$' will match only a line containing exactly 'A whole line.' Nothing before, nothing after, and nothing in the middle. Escaping Special Characters: The 'goodbye.$' pattern will match 'goodbye.' and also 'goodbyeZ' because the '.' at the end is interpreted as a special character, not a period. To treat a special character literally escape it with either '\' or '[]'. To search for, literally '*.*' use the pattern '\*\.\*' or '[*][.][*]'. Basic and Extended Regular Expressions Regular expressions come in two flavours - basic and extended. In the examples I give here the special characters '?' and '+' are only valid in extended regular expressions. Note that 'x+' is equivalent to 'xx*'. |
Tell Me More...
|
|
Shell Wildcards You will recall from Part 3 that the shell uses a form of pattern matching. ls *.txtHowever, the rules for the shell's 'globbing' are different to the more powerful regular expressions. Always remember this when switching between the two - it can be a source of great confusion. '*' in shell terms means zero or more of any character. '*' in regular expression terms means zero or more of the preceding pattern. Zero or more of any character is expressed as '.*'. Question What will "j*" match? Answer Anything! Debugging When you need to check whether your regular expression works as you expect, the easiest way is by typing: grep 'the-pattern'with no filename following. In this way 'grep' takes input from the keyboard. If the line you type matches the regular expression it will be echoed back to you. Remember that (e)grep does not necessarily match the whole line, it looks for the pattern somewhere on the line you type. REMEMBER You can use regular expressions in Unix commands other than 'grep'. For example 'sed', 'awk', and the 'vi' editor all use regular expressions in their search and replace functionality. |
And More...
There is more to regular expressions that I have given in the above Section, but what I have given should cover your every day pattern matching needs. Further coverage is reserved for an Advanced Lesson.
Summary
| . | Matches any single character |
| [xyz] | Matches any one character of x,y, or ,z |
| [^xyz] | Matching any one character that is not x, y, or z |
| [x-z] | Matches any one character in the range x to z |
| * | Matches the preceding pattern 0 or more times |
| ? | Matches the preceding pattern 0 or 1 times |
| + | Matches the preceding pattern 1 or more times |
| ^ | Matches the start of a line |
| $ | Matches the end of a line |
| \x or [x] | removes any special meaning from x and becomes literally that character x |
'sed' stands for Stream EDitor and is used to edit files automatically. It reads a file line by line, edits each line as directed by a list of commands, and spits out the changed line. 'sed' does a lot, much more than I can cover in this tutorial. A fuller tutorial on 'sed' will appear in an Advanced Lesson.
Rather than teaching 'sed' from the ground up I will present some useful tricks one can use. For example, removing all blank lines from a file, or removing all lines containing certain text.
A 'sed' command specifies the lines in the file to edit, the edit function to be applied to those lines, and any arguments to the function. Lines can be specified by a line number, or as those that match a particular regular expression. Each selected line is then edited by applying the function and its arguments.
% sed 'lines command arguments' input-fileRe-direct to a File
'sed' does not change the input file, instead it write the changed file to the Terminal window. If you wish to write the changes back to a file the output can be re-directed to a new file:
% sed 'lines command arguments' input-file > new-fileThe next tutorial explains re-direction in more detail - it can be used to direct the Terminal output of any command to a file.
|
Remove all blank lines: First we need to specify a pattern that will match all blank lines. Such a pattern is '^$' - the end immediately follows the start with no characters between. Next we specify the function to be applied to each matched (i.e. blank) line. In 'sed' the 'd' function deletes a line. Therefore the 'sed' command is: % sed '/^$/d' file-nameThe regular expression is delimited by '/' and the delete function 'd' immediately follows it. The whole command is enclosed in quotes to protect it from interpretation by the shell before it is passed to 'sed'. It is possible that blank lines actually contain spaces. To remove these blank lines too use: % sed '/^ *$/d' file-namewhich matches start of line, space zero or more times, end of line (there is a space character between '^' and '*'). Remove all lines containing some given text: To remove all lines containing the word 'remove' % sed '/remove/d' file-nameOf course, you can use any regular expression in place of 'remove'. Change text: The 's' function will substitute one string for another. To correct a wide-spread spelling mistake use: % sed 's/wizzard/wizard/' file-nameNotice in this example that I have not specified a line to match, in which case 'sed' assumes every line of the file. In this example, it would be possible to select only the lines that contain 'wizzard', but that would be pointless. Sometimes it may make sense to do so: % cat eg en:color us:color % sed '/en/s/color/colour/' eg en:colour us:color % One important point, the search function only replaces the first occurrence of a pattern on each line. If you wish to replace all occurrences on each line use the 'g' (for global) argument. % sed 's/wizzard/wizard/g' file-nameConvert case: 'sed' is able to translate characters with the 'y' function. In this example I convert a file to upper case. sed 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/' file-nameNo lines are specified so the 'y' function is applied to every line in the file. 'a' is replaced with 'A', 'b' with 'B', and so on. Any character can be replaced by any other character by the 'y' function, but the replacement must be one for one. |
Tell Me More...
|
|
Limitations 'sed' uses regular expressions for its pattern matching. It supports only basic regular expressions, not extended regular expressions. 'sed' will therefore not recognise the special characters '?' and '+'. 'grep' with 'sed' If you call 'sed' with the option '-n' it will not output each line as it usually does. This can be used in conjunction with the 'p' function, which explicitly outputs the current line. A grep-like functionality can be obtained with: % sed - n '/re/p' filewhere 're' is the regular expression to be searched for. The function '=' prints the line number instead of the line text. A useful variant of the 'grep'-like example is to search a file for a regular expression and print out the line numbers where that expression is found. % sed -n '/re/=' file'head' with 'sed' By specifying a range of line numbers instead of a regular expression, it is possible to display the first n lines of a file. To display the first 15 lines use: % sed -n '1,15 p' fileThe last line of a file is represented by the special character '$'. Thus the address 1,$ represents the whole file. Line Count with 'sed' By selecting just the last line and printing its line number we can count the number of lines in a file. % sed -n '$=' file |
A matched pattern can be recalled using the special character '&'. To illustrate, take the following example.
% cat hw This file contains the words hello world
% sed 's/hello/& & &/' hw This file contains the words hello hello hello world
Suppose I have an html file like:
% cat test.html <html> <head> <link rel="stylesheet" type="text/css" href="/css/osxfaq.css"> </head> <body> <p> text here </p> </body> </html>
I need to comment out the stylesheet line. I don't want to delete it because I want to un-comment it at a later date. I can do this by searching for a pattern, and replacing it with the original pattern surrounded by comment start and end markers.
Additionally, to avoid commenting out an already commented out line (in case the 'sed' command is applied twice to the same file) I want to anchor that pattern to the start and end of the line.
I needn't search for the whole line verbatim, just enough to make sure I don't accidentally include similar looking lines. Finally, the line may include leading and trailing spaces.
Building up the regular expression we have:
^ *to match 0 or more spaces at the start of a line
<link rel=.*osxfaq\.cssto pick out the string uniquely. Note that the dot in 'osxfaq.css' has to be escaped so it is taken literally.
.*$to absorb all characters to the end of the line. We need to absorb all characters so that the end comment marker is placed after the end of the line.
The replace pattern is:
<\!--&-->/<!-- and --> are the HTML comment delimiters we want to add. The '!' character must be protected from the shell, which processes it even when surrounded by quotes.
The required command is thus:
% sed 's/^ *<link rel=.*osxfaq\.css.*$/<\!--&-->/' test.htmlThe working command is illustrated:
% cat test.html <html> <head> <link rel="stylesheet" type="text/css" href="/css/osxfaq.css"> </head> <body> <p> text here </p> </body> </html>
% sed 's/^ *<link rel=.*osxfaq\.css.*$/<\!--&-->/' test.html >test2.html % cat test2.html <html> <head> <!-- <link rel="stylesheet" type="text/css" href="/css/osxfaq.css">--> </head> <body> <p> text here </p> </body> </html>
A Neat Trick
I want to write the processed file back to the original file. I can do this in one command line using ';'. This is a feature of the shell and is not limited to 'sed'.
sed 's/^ *<link rel=.*osxfaq\.css.*$/<\!--&-->/' test.html > test2.html ; mv test2.html test.htmlNext Part
In Part 7 I will explain re-direction and introduce the concept of pipes. A pipe joins two Unix commands enabling one to take the output of another.
I will show how to combine commands, and how to combine 'find' and 'sed' to apply the above working example to a whole web site of files.
Until then, here is one method to remove the comments added by the above working example:
sed -e 's/<\!--//' -e 's/^ *<link rel=.*osxfaq\.css">/&/' -e 's/-->//' test2.html > test.htmlI will explain how this works at the start of Part 7.
Enjoy :-)
Discuss this article in the Learning Center forum
|
|
Part 6 - 'grep', 'sed', and Regular Expressions (page 2 of 2) |
|
| Copyright © 2000-2010 Inside Mac Media, Inc. All rights reserved. | ||
| Apple assumes no responsibility with regard to the selection, performance, or use of the products or services. All understandings, agreements, or warranties, if any, take place directly between the vendors and prospective users. | ||
| Apple, the Apple logo, Mac, PowerMac G4, PowerMac G5, Xserve, Xserve RAID, PowerBook, iBook, Airport, AirPort Extreme, iMac, eMac, iLife, iMovie, iCal, iPhoto, iTunes, QuickTime, FireWire, iPod, iSight, AppleWorks, Macintosh, Jaguar, Panther, Mac OS, Mac OS X and Mac OS X Server are trademarks of Apple Computer, Inc. |