Tuesday, March 25, 2014

Mastering sed

sed
sed is a very famous tool to the UNIX* community but which is very often misused. Most people try to use it in cases that it's not the right tool or they tend to use it in the wrong way. I try in this post to show the things that are worth knowing when it comes to sed. I thus skip things like buffer holders, labels etc. which just make scripts totally unreadable for no benefit. I also talk a bit about the inner-workings of sed so the user has a grasp of why sometimes things don't work as expected.

Why even learn sed?


For two reasons:

  1. You can automate any boring mechanical work you would do on a normal text editor

  2. If you know the syntax of sed, then you are a better vi/vim user


The second point, becomes obvious when you realise that the two programs have similar if not the exact syntax. For example, substituting the word "cow" with the word "horse" in the third line of the document in vim is :3s/cow/horse/ while in sed it's 3s/cow/horse/. See the magic?!

A bit of history


Sed, awk and grep are the offsprings of a line editor called ed. ed pretty much let the user to edit one line at a time. That is also the reason that sed, awk and grep work on lines. All three programs have inherited the syntax of ed to some extend. In ed, to search & replace the word "cow" with the word "milk" in a text document, someone would type s/milk/beer/. That is exactly the same command used in sed - an indication of ancestry.
The tool grep actually takes its name from the command g/re/p, a command used in ed to only show lines that contain a specific regular expression. The 'p' in the end means to print on the screen while the 'g' in front means to go through all the lines.

When to use sed


While all three of these tools (grep, sed and awk) work on lines, sed and awk are very similar to each other while grep is more of a loner. Grep is used merely to filter out (or in) a line based on a regular expression. Sed and awk offer much more. What differs sed from awk is the data that they were built to edit.
Awk should be used when every line in the file has a specific structure. In other words that includes files where each line has a specific number of fields with every field separated with a delimiter (in most cases a tab). Such files can be CVS files, tables, the output of ls in linux, and more.
For everything else, use sed. Common examples are raw texts, this post, a C file, a script, an HTML file, etc. What all these files have in common is that lines don't have a specific structure: the first line can have one word, while the second can have 100 words etc. Sed can still edit files that awk edits, but the opposite is most times impossible. If you are trying to just do that then you are most probably using the wrong tool.

Syntax


Sed in the terminal


The common syntax for sed in the terminal is:
[sourcecode light="true"]sed SCRIPT INPUTFILE[/sourcecode]
The meat of sed is the SCRIPT and that is pretty much what I cover in this post. It's a good convention to put quotation marks around it in case there is a space inside it (the shell might interpret it as multiple commands then).
sed can have multiple SCRIPTS or it can use a file with commands.

Multiple script lines:
[sourcecode light="true"]sed 'SCRIPT1; SCRIPT2; SCRIPT3;' INPUTFILE[/sourcecode]

Using a script file:
[sourcecode light="true"]sed -f SCRIPTFILE INPUTFILE[/sourcecode]
The SCRIPTFILE should have a command on each line. So SCRIPT1 should be on separate line than SCRIPT2 and SCRIPT3 on a separate etc.
Thoughout

Sed's script syntax


Sed uses three things to accomplish tasks:

  • line specifiers (address)

  • commands

  • flags



The syntax of a single SCRIPT line is:
[sourcecode light="true"]<line><command><flag>[/sourcecode]

A line or line specifier is a way to specify which lines you want the command to affect (parse). If the line specifier is missing, then the command affects all lines, which is the default behaviour.

A command is denoted by single character. For example to replace (substitute) a word with an other word we use the command 's':
[sourcecode light="true"]s/word1/word2/[/sourcecode]

A flag is used to modify the <command> or the <line>'s behaviour a bit and is placed after the whole command or line. Using the flag g on the example above, we get:
[sourcecode light="true"]s/word1/word2/g[/sourcecode]
g stands for global and is used to replace all word1 in the line. Without it, only the first occurrence of word1 in the line is replaced.

A flag goes hand in hand with regular expressions so they can only be used if <command> or <line> have a regular expression in them. If the <line> is specified with a number for example, it's illegal to use a flag:
[sourcecode light="true"]
#Illegal
sed -n '6g'
[/sourcecode]

The reason is that flags are made for text strings (patterns) so it doesn't make sense to sed when you are telling sed to use the flag 'g' on a line (which is something else than a string) and not a pattern.

The minimum thing needed on a SCRIPT line is a command or a line specifier. Someone can have both, one of them - with a flag or without. These combinations (or permutations to be exact) are allowed:
[sourcecode light="true"]
<line>
<line><flag>
<command>
<command><flag>
<line><command>
<line><command><flag>
<line><flag><command><flag>
<line><flag><command>
[/sourcecode]
A flag applies to the command or line before it and assume that the previous has a regular expression in it.





How sed works internally


Imagine we have the file list.txt with the lines:
[sourcecode light="true"]
Today I will drink my milk
and afterwards I will eat a cow.
The cow will taste like cow.
[/sourcecode]
Sed works with lines. As stated earlier we can have multiple script lines seperated with question marks:
[sourcecode light="true"]sed 's/Today/Tomorrow/g; s/Today/Next Friday/g' list.txt[/sourcecode]

sed has a working buffer for each line (called pattern space). sed will initially load the first line of list.txt in the buffer. Then it will go through all script lines one by one altering everything in-place (in the buffer). In our example the buffer for the first line initially has:
[sourcecode light="true"]
Today I will drink my milk
[/sourcecode]

After the first script command executes, it becomes:
[sourcecode light="true"]
Tomorrow I will drink my milk
[/sourcecode]

The second script line doesn't alter anything as sed can't find an occurrence of the word 'Today' since it just got altered.
Once the first line is done, sed will load the second line in the buffer and go through the same procedure until all lines in list.txt have been parsed.

Something important to notice is that each SCRIPT line will run regardless if the previous SCRIPT line succeeded or not. With success we mean that the command did what it is meant to do. If substitution is used, then we define success as the alteration of a line. If we just specify a line, then success is if the line exists etc.






Addressing specific lines


<line><command><flag>
By default, sed goes through all the lines. However, one can address a specific line or a range of lines. That can be done by specifying lines by their number in the file (1st line, 2nd line, 50th line etc.) or by a line's content.

To run a command on the 10th line we do:
[sourcecode light="true"]10<command>[/sourcecode]

To run a command on each line that contains the word "cow" we do:
[sourcecode light="true"]/cow/<command>[/sourcecode]

The latter makes use of regular expressions. When we use regular expressions we need to add slashes to the start and end of the regular expression.

For a range of lines we use a comma (awkward, I know). To specify all lines between the 10th and 25th line (including those) we would write:
[sourcecode light="true"]10,25<command>[/sourcecode]

If we want to specify a range of lines by using regular expressions we still have to encapsulate the regular expressions in slashes. To run a command on all the lines between the first line found with the word "cow" and the first line found with the word "grass", we would issue:
[sourcecode light="true"]/cow/,/grass/<command>[/sourcecode]

Bellow you can see all the ways for specifying lines.








[sourcecode light="true"]<line1>,<line2>[/sourcecode]Lines between <line1> and <line2> (including those)
[sourcecode light="true"]<line1>~N[/sourcecode]Every Nth line after <line1>
[sourcecode light="true"]<line1>![/sourcecode]All lines except line1
[sourcecode light="true"]/<regex>/[/sourcecode]Lines matching the regular expression
[sourcecode light="true"]$[/sourcecode]Last line
[sourcecode light="true"]<line1>,+N[/sourcecode]All N number of lines after <line1>


Some more practical examples can be seen bellow. The command p is used to print the line that is specified. (Notice that sed needs the parameter -n for the command p to work.)
[sourcecode light="true"]
sed -n '2p' -> print line 2
sed -n '2,4p' -> print line 2 to 4
sed -n '$p' -> print last line
sed -n '2!p' -> print all lines but line 2
sed -n '/red/p' -> print every line that contains the word red
sed -n '/red/,/green/p' -> print all lines between the <strong>first occurrences</strong> of the words 'red' and 'green'
[/sourcecode]
Of course for all these examples to work you need to either feed sed with a stream from a file or a pipe.




Using commands


<line><command><flag>
Now we are to the meat of all meats.I have explained the substitution command a bit but bellow you can see all the commands with their syntax (if they have one).








s/<regex>/<subst>/substituteReplace a match (using regular expression) with a string.
pprintPrints a specific line or a range of lines. The sed flag -n should be used for this command to work.
=line numberShows the number of the line
ddeleteDeletes (omits) the matched line
y/<char1>/<char2>/transformSubstitutes char1 with char2. Works even with a sequence of characters. For example y/abc/ABC/ will replace a with A, b with B and c with C.


Substitution


Substitution is the most commonly used command. It's syntax is as follows:
[sourcecode light="true"]s<d><regex><d><subst><d>[/sourcecode]
where <d> is a delimiter which should be a single character. Most commonly the delimiter is a slash / but it can essentially be any character. All lines bellow are equivalent.
[sourcecode light="true"]s/cow/horse/[/sourcecode]
[sourcecode light="true"]s_cow_horse_[/sourcecode]
[sourcecode light="true"]sDcowDhorseD[/sourcecode]

In the example above the first occurrence of the word cow on each line will be replaced with the word horse, which is the default behaviour. If you want all occurrences of the word cow in a line to be substituted, the flag g (global) has to be appended to the line:
[sourcecode light="true"]s/cow/horse/g[/sourcecode]

The substitution field <subst> can take some special variables like the ampersand symbol "&" which holds the matched string from the regular expression. There are also a few macros to automate conversion of capitals to lower-case and vice versa.
All these are shown in the table bellow.








&Holds the matched string.
\1Holds a part of the match specified in the regular expression with parentheses.
\U<string>Converts all letters in <string> to capitals.
\u<string>Converts the first letter in <string> to capital.
\L<string>Converts all letters in <string> to lower-case.
\l<string>Converts the first letter in <string> to lower-case.
<string>\EEnds the conversion at specific point. Should be used in conjunction with \L and \U.


The match holders \1 \2 \3 etc. have to be specified in the regular expression with parentheses. The parentheses themselves have to be escaped or else sed will be looking for parenthesis characters in the line. So to replace each word "cow" with "supercow" we can do:
[sourcecode light="true"]s/\(cow\)/super\1/[/sourcecode]
or
[sourcecode light="true"]s/cow/super&/[/sourcecode]
The latter is more elegant of course. However there are two cases where the specific holder has to be used:

  1. When we want only a portion of the matched string and not the whole string (&).

  2. When there are many different strings you want to grab from a match.



For example this can't be solved by merely using the & symbol:
[sourcecode light="true"]s/\(\w*\) cow \(\w*\)/\2 cow \1/[/sourcecode]
This script will look for each occurrence where "cow" has a word before it and a word after it and it will change their order. The \w matches any character while the asterisk * tells the pattern that the word can be arbitrarily long. We use the parentheses around the first word and the second word to denote which matched parts of the regular expression should be given to \1 and \2. In <subst> we just reverse their order by putting the second word first and the first word second.







Flags


<line><command><flag>
A flag can be used on a <command> or a <line> or both.





gglobalFor all occurences in the line (default is to stop at first occurence)
IIgnore caseNot case-sensitive
pPrintOutput only this line (not everything as default). The sed flag -n should be used for this flag to work.










Find line in lines of lines in lines..


An important concept to comprehend in sed is nesting. I try to leave out all the advanced things sed offers, like holder buffer (horror to read), labels etc. but nesting is worth learning as it gives a lot of extra power for a little learning curve. Nesting is similar to the IF..THEN conditional.

We have this text file:
Today I will drink my milk
and afterwards I will eat a cow.
The cow will taste like cow.
Today is not afterwards if I am a cow. Right?


Imagine that we want to check if the last line of the text contains the word cow. Someone might think that putting $p together with /cow/p would work:
[sourcecode light="true"]
sed -n '$p; /cow/p'
[/sourcecode]

Admittedly the output seems strange:
and afterwards I will eat a cow.
The cow will taste like cow.
Today is not afterwards if I am a cow. Right?
Today is not afterwards if I am a cow. Right?


The reason this doesn't work as expected is that the second script runs regardless if the first one succeeded or not. Thus in the first line none of the scripts print anything. On the second line, the first script fails but the second script finds the word cow so it prints the line. The same happens at the third line. At the forth line, the first script succeeds as the line is the last line of the file, so that line gets printed. Then the second script runs and that succeeds. Thus we get the line printed again (a second time).

A way to solve this is is to nest the second script line in the first somehow so that it runs only if the first script line has succeeded. This is similar to the pseudocode:
[sourcecode light="true"]
if <line1>
then SCRIPT
[/sourcecode]

Where <line1> is a line specified by a number, range or pattern (regular expression).
The syntax for nesting script lines in sed is
[sourcecode light="true"]
<line>{<line><command><flag>}
[/sourcecode]

For our example:
[sourcecode light="true"]
sed -n '${/cow/p}'
[/sourcecode]

This translates to: if this is the last line ($) then do whatever is in the brackets. So everything in the brackets will be checked only if the current line being parsed is the last one.

We can even nest inside a nested script:
[sourcecode light="true"]
sed -n '2,4{/cow/{/Today/p}}'
[/sourcecode]
This script will go through lines 2 to 4. It will check first if a line contains the word cow. If the line contains the word cow, then it will check if it contains the word Today. If it does, it will print it. Ofcourse there is no practical reason to have second braces in the above example. I just wrote it to show that it's feasible. In some cases you need braces to accomplish tasks, especially in cases where we avoid using too advanced things.


Examples


Printing a specific line


[sourcecode light="true"]
sed -n '2p'
[/sourcecode]
The p in this case is the p command (and not flag). Remember that flags apply only to regular expressions. When we use sed's -n parameter, only lines specified with the p command or flag will be printed on screen.

Printing a range of lines


[sourcecode light="true"]
sed -n '1,3p'
[/sourcecode]
The comma is used to specify a range. In this example we specify line 1 to line 3 and then we print each such line.

Hiding a specific line


[sourcecode light="true"]
sed -n '2!p'
[/sourcecode]
This is similar to:
[sourcecode light="true"]
sed '2d'
[/sourcecode]

Show the last line


[sourcecode light="true"]
sed -n '$p'
[/sourcecode]
In regular expressions the dollar sign $ denotes the end of a string. In sed when it's being used with substitution it denotes the end of a line.
However when used with the command p, it denotes the last line of the input. In the same way the regex symbol ^ denotes the first line.

Converting a specific word to uppercase


[sourcecode light="true"]
sed 's/Today/\U&/' list.txt
[/sourcecode]
This will match any word 'Today' and replace it with 'TODAY'. In the replacement the escaped U (\U) tells sed to convert to uppercase everything following in the replacement string. In this example we use an ampersand which in sed represents the matched string. If we wanted to stop the conversion we just need to add \E where we want it to end.

Converting all text to uppercase


[sourcecode light="true"]
sed 's/.*/\U&/'
[/sourcecode]
For this we use the substituion command s.
.* is a regular expression which fits any sequence of characters. As sed is working on lines, .* matches a whole line each time.

Converting all text to lowercase


[sourcecode light="true"]
sed 's/.*/\L&/'
[/sourcecode]
Similar to the previous example with only difference that we use \L instead of \U.

Grabbing all content between the body tags in HTML


Say we have the ugly HTML code bellow and we want to grab all the content between the body tags.
[sourcecode light="true" language="html"]
<html><head></head>
<h1>h1 outside of body</h1><body><h1>h1 stuck to body</h1>
<img src="images/soon.png"/>
<p>a paragraph</p></body></html>[/sourcecode]

Most sed experts would go about using some very advanced commands to accomplish this. The readability of that becomes horrific. As my opinion is that you can do the same things without knowing all the advanced commands I am going to just do that.
Notice that I use pipes instead of using advanced commands.

Solution:
[sourcecode light="true"]
sed -n '/<body>/,/<\/body>/p' | sed 's_.*<body>\(.*\)_\1_; s_\(.*\)</body>.*_\1_'
[/sourcecode]

It might seem like a mess but I can assure you that it's much more elegant than a pure single sed SCRIPT solution. I will break it down so you can see how it works.
[sourcecode light="true"]/body/,/body/p[/sourcecode]
matches all lines between the body tags, including the body tags. In this way I have minimized my problem to:
[sourcecode light="true" language="html"]
<h1>h1 outside of body</h1><body><h1>h1 stuck to body</h1>
<img src="images/soon.png"/>
<p>a paragraph</p></body></html>[/sourcecode]

After that, we are sure that the first line has the start of the body tag and the last line has the closing of the body tag. So we start by filtering out things we don't need from the first line: the tag itself and everything preceding it.

[sourcecode light="true"]s_.*<body>\(.*\)_\1_[/sourcecode]

I use _ as a delimeter instead of slashes to make the substitution code a bit more readable. The regular expression .* matches 0 to infinite number of arbitrary characters. So I use it around the body tag in case there is something before and after it. I put the second .* in parentheses to grab the text that might be there as I want to keep that. Using \1 in the substitution field I accomplish substituting the whole line with the text after body.

[sourcecode light="true"]s_\(.*\)</body>.*_\1_[/sourcecode]
We do something similar to the line with </body>. The only difference is that now the portion of the match that we are interested in is the one before the body tag so we move the parentheses there.

Keep in mind that this solution will not work if the body tag includes some attributes like style. Someone might think that just using the bellow regular expression in the substitution would work.

[sourcecode light="true"].*<body.*>\(.*\)[/sourcecode]

Notice that the only difference is that we added the regular expression .* between body and its closing arrow > to point out that there can be nothing in between or there could be some arbitrary things (in our case attributes).

That regular expression doesn't work however as it matches the last > in the line. The reason is that in sed .* is greedy and there is no way to make it non-greedy. With greedy we mean that the pattern will try to match as much as it can in the line. So if you want to match the first > it's not possible. Or.. actually it is possible but the code as you will see in the next example starts looking like a monster.


Grabbing all content between tags in HTML


You are probably better off learning Perl or Python if you need to do these kind of "advanced things". I will however show that you can achieve things like this without using the more advanced commands or other programs/languages. This solution is a continuation of the previous example on catching the content between the body tags. The mere difference is that in this solution we allow even attributes to a tag and are a bit more permissive towards whitespaces. This makes it a more general solution that can be used for other tags than the body tag.

Solution:
[sourcecode light="true"]sed -n '/<body>/,/<\/body>/p' | sed '1s_.*<body[a-zA-Z0-9="\x27_ ]*> *<\(.*\)_<\1_; 1{/<body>/d}; s_\(.*\)</body>.*_\1_'[/sourcecode]

Essentially the only thing that got changed from the previous example is the alteration of the commands in the first line:
[sourcecode light="true"]'s_.*<body>\(.*\)_\1_;'[/sourcecode]
to two commands:
[sourcecode light="true"]'1s_.*<body[a-zA-Z0-9="\x27_ ]*> *<\(.*\)_<\1_; 1/{/<body>/d};'[/sourcecode]

The first script line (everything before the first ;) looks for a second tag after body. We don't really need to specify line 1 but it is a good convention as it makes the code easier to understand and the processing faster. I am looking for the body tag followed by an attribute or not. TO define an attribute in HTML, only the characters in the brackets are allowed (says the HTML protocol, not me). \x27 is the code for a single quote mark '. The reason I use its code instead of the mark itself (I use the double quote after all), is that I use the single quote marks around the whole command so if I insert it in the expression, then it will break it. After that I use " *<" (notice the space) to denote that there might be an arbitrary number of spaces or none before the opening of the new tag. I replace everything with the opening of the tag after body with the rest of the line.

If the first command didn't succeed then it means that there's not a second tag after the body. Thus it's safe to delete the whole line in that case. First we check if there is a body tag in the first line. If there is a body tag (/<body>/), then we delete it with the command d.

No comments:

Post a Comment