August 20, 2011

Regular Expressions Tutorial - eBook Formatting


Thank you for visiting this eBook design tutorial. We now have an eBook design startup—BB eBooks—dedicated to helping independent authors and small presses get their eBooks formatted, converted, and ready for sale at all the major online retailers (e.g. Amazon's Kindle Store, Barnes & Noble's Nook, iBookstore, Smashwords, etc.) Please contact us for a no-obligation quote. For those writers, editors, and publishers looking to go the DIY route for eBook production (you probably are if you visited this page), we offer free online tutorials and apps to help you professionally design your eBook. Please visit our Developers page and let’s work together to improve the overall standards of eBooks. Also, please sign up for the mailing list for promotions, design & marketing tips, plus eBook industry news.



Looking for a complete guide on eBook design and development? Please consider The eBook Design and Development Guide, which contains everything you need to know about HTML, CSS, EPUB, and MOBI/KF8 to make an eBook like a pro. Pick it up at Amazon for $6.99 today.

The full tutorials for the eBook formatting series include a basic XHTML tutorial, a tutorial for converting your manuscript into XHTML, and a Calibre tutorial for converting XHTML into eBooks. For those looking for something more advanced, you can also peruse the Regular Expressions tutorial, as well as the EPUB and KindleGen tutorial. Templates for XHTML and EPUB are also available for your formatting arsenal. Additionally, there are some helpful hints for formatting for Smashwords in this tutorial.

Table of Contents for Regular Expressions Tutorial
Introduction to Regular Expressions
The Find Function
Multiplying Operators ?+*
Matching Non-standard Characters
Using Anchors
Using Substitutions
Useful Regular Expressions for eBook Formatting

Introduction to Regular Expressions
When working with large amounts of text, such as the XHTML code for an eBook, it is advantageous to use regular expressions. Perhaps you are familiar with wildcards, such as searching for "*.jpg" to find all JPEGs in one directory. The "*" indicates any characters and the ".jpg" refers specifically to the extension ".jpg". Essentially, regular expressions are a more sophisticated system of wildcards that are utilized for locating and manipulating text strings.

A regular expression, or a "regex", can be utilized by programmers and web administrators when working with large amounts of complex data. Regular expressions can become very complicated, but this guide will cover the basic ways to have regular expressions assist you in formatting your eBook.

Important Note: For this tutorial, quotes are wrapped around what you should type in the Find or Replace window. As an example, if the guide asks you to type "something" in the Find window, it would look as follows in your text editor:
The Find Window in Notepad++

The Find Function
Normally when you use the Find function (Ctrl-F), you type in a character or word in the Find window and click Find Next. The text editor then goes through the document looking for exactly what you typed into the Find window.

For example, you have the following text document on one line in your text editor:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Typical Text Editor Layout
Hypothetically, you want to search for the letter "p". You type "p", without the quotes, in the find window. Click Find Next, and it cycles through all the p's, selecting them one-by-one as indicated with the underlined text below:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
While useful, you want to go beyond this simple type of finding and to manipulate complicated strings of text.

Multiplying Operators ?+*
Multiplying operators are special characters reserved in regular expressions that provide instructions to the Find function as it searches for the text. You can utilize multiplying operators to find zero, one, or more of any character.

If you want to search for "pp" or "p"-character-"p", you can use the multiplying operator "?". The "character?" tells the Find function to look for character one or zero times. Therefore, if you typed "po?p" in the Find window, you would find "pp" or "pop".

Type "po?p" in the Find window (make sure you click the button for regular expressions in Notepad++) and it will match the following underlined text:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
If you want to search for "p"-character-"p", where character has to be in text at least once, you can use the multiplying operator "+". The "character+" tells the Find function to look for character one or more times.

Therefore, if you typed "po+p" in the Find window, it would highlight "pop", "poop", "pooop", etc. but not "pp":
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
The multiplying operator "*" is used in a similar fashion as "?" and "+". If you want to search for "p"-character-"p", where character can be in the text zero to any number of times, you would use "*".

Therefore, if you typed "po*p" in the Find window, it would highlight "pp" "pop", "poop", "pooop", etc.:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Important Note: Some text editors have a different syntax for regular expressions. However, the principle is essentially the same. Consult your help menu for syntax details if you are not using Notepad++.

Matching Non-standard Characters
Rather than typing in specific combinations of characters, it can be advantageous to search for different types of characters, or any character. This is similar to wildcards.

For regular expressions, to search for any character (a letter, digit, space, and everything else) you can type "." in the Find window. If you just type a ".", it will go one by one through every character in your text.
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Typing "p.p" in the Find window would select "p1p", "pxp", "p p", etc.:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
It may be necessary to find special characters reserved for regular expression functionality (i.e. "*", "?", ".", etc.) If you want to actually find a "." (i.e. the period), you need to place a slash in front of the "." like "\.".

Type "\." In the Find window, and you will locate the two periods:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
You may want to locate any digits in your text. Type "\d" in the Find window to find a digit character:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
If you want to try to highlighting any number of characters between a "p" and an "a", you can try typing the expression "p.*a". However, this will give you strange results, because most text editors exhibit what is called greedy behavior. This means the ".*" part of the regular expression will go as far as it can on the line of your text editor to find the next "a" after it finds the "p".

Typing "p.*a" in the Find window matches the following:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
To limit how far the "*" multiplying operator will go, you can add the "?" multiplying operator on the end of the ".*". This limits the behavior of the "*" so that after finding a "p", the ".*" will stop at the first "a" that it sees.

Type "p.*?a" in your text editor, and the following will be selected:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Sets [ ]
Say you wanted to search for the letter "p" or "i". You would have to run the Find function twice. However, if you use a regular expression, you can make the text editor look for "p" or "i" on the same pass.

The brackets [ ] enclose what is called a set, and you can put any combination of words or numbers in there.

Type "[ip]" in the Find window (make sure you click the button for regular expressions). The text editor will go through the document and find:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
You can also use sets to provide a range of numbers or letters. For example the set "[a-c]" will find the letters "a", "b", and "c":
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
If you want to search for characters not in a set, you add the "^" symbol before the characters within the [ ] brackets.

Type "[^a-z]" in the Find window. The text editor will go through the document and match:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
If you wanted to not highlight the spaces, you can use the special code "\s" which matches spaces, tabs, and breaks.

Type "[^a-z\s]" in the Find window. The text editor will go through the document and match:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Using Anchors
If you wanted to search for entire words that just started with the letter "p", there are special codes called anchors. "\<character" will look for a word that starts with character. Likewise, "character\>" will look for a word that ends with character.

Type "\<p" in the Find window. The text editor will go through the document and find:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
To find any words that end with "e", type "e\>" in the Find window. The text editor will go through the document and find:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
To find any words that start with "p" and end with "e", you may be inclined to type "\<pe\>". This will only find the string "pe". Instead, type "\<p[a-z]*e\>" in the Find window:
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Let's analyze this regular expression. The "\<p" term is looking for any word that starts with "p". Then, the [a-z]* is looking for zero or more letters. Finally, the "e\>" means the word must end with an "e".

You may have noticed that when you search for words starting with "p" by using the regular expression "\<p", it only selected the "p" by itself. However, if you want to select the entire word that starts with the letter "p", type in "\<p[a-z]*".
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
Say you want to get tricky and find entire sentences. You could type the regular expression "[a-z].*?[\.]". This will look for any word character, and select until it hits ".", a period. Don't forget that the "?" after the "*" turns the behavior of the "*" to non-greedy. If the "?" was not included, it would select the entire text on the line.
Poppa took the apple to Paul. The Pope placed it in the box labeled 4.
However, not all sentences end in a period, sometimes they end in a question mark or a closing quotation. You can add additional characters to the final set (such as "\?" or """), and the Find function will look for either a "?" or """ at the end.

Type "[a-z].*?[\.\?"]" in the Find window to select the following text in your text editor:
Poppa took the apple to Paul? The Pope placed it in the box labeled 4.
For eBook formatting, you want to select entire lines of text when you wrap <p> and </p> tags around your paragraphs in the XHTML code. To select an entire line of text, type the following "^(.+)$".
Poppa took the apple to Paul? The Pope placed it in the box labeled 4.
The "^" is an anchor that tells the Find function to start at the front of the line in the text editor. Likewise, the "$" is an anchor that tells the Find function to end at the end of the line in the text editor. The ".+" means one or more characters (i.e. any line that is not blank). The "^" should not be confused with the "^" inside of a set (e.g. "[^a-z]"), because within a set it means not "a-z".

Using Substitutions
Substitutions are used when you are conducting find and replace operations. In this guide, you will need to use regular expression substitutions to automatically convert the placeholder tags for italics/bold/underlined text to proper XHTML.


You have wrapped some italicized text with the "QQQ" placeholder tags as follows:
QQQitalics textQQQ not italics text QQQmore italics textQQQ
To select the entire italics text and the "QQQ" place holder tags wrapped around it, you could have a regular expression like "QQQ.*?QQQ". This would select the following:
QQQitalics textQQQ not italics text QQQmore italics textQQQ
To convert the placeholder tags to proper XHTML, you want to replace the opening "QQQ" with "<span class="i">" and the closing "QQQ" with "</span>", but you don't want to alter the text within. The way your regular expression is now, this would not be possible.

To be able to replace the "QQQ" and leave the italics text as is, you need to build parentheses into your regular expression for the Find function.

Typing "(QQQ)(.*?)(QQQ)" in the Find window will yield the same selection results previously mentioned. However, in the Replace window, you want to type "<span class="i">\2</span>". Clicking replace, the text will be altered as follows:
<span class="i">italics text</span> not italics text <span class="i">more italics text</span>
This is exactly the type of XHTML code you want. To analyze the expression, look at the three parenthetical elements in the find expression. The 1st was "(QQQ)", the 2nd was "(.*?)", and the 3rd was "(QQQ)". In the Replace window, you made no reference to the "(QQQ)" expression. To do so, you would have had to type "\1" or "\3".

Useful Regular Expressions for eBook Formatting
Finds Special Characters

FIND "[^<>A-Za-z0-9\.,'"?\\\^\|\-\[\]:!;()/$#@&%*_+{}=~\s]"
Explanation: The [ ] marks a set of characters for the Find function to look for. The "^" inside a set means NOT the characters in the set.

Finds Special Characters (not including Curled Quotes and Em Dashes)

FIND "[^<>A-Za-z0-9\.,'"?\\\^\|\-\[\]:!;()/$#@&%*_+{}=~\s…“”‘’–—]"
Wrap <p> and </p> Tags on Each Line
FIND "^(.+)$"
REPLACE "<p>\1<\p>"
Explanation: The "^" starts at the beginning of the line. The "(.+)" finds one or more of any character (i.e. not a blank line). The "$" ends at the end of a line. The "<p>\1<\p>" wraps the paragraph tags around the text of the line.

Add the Style "centered" to Section Breaks
FIND "(<p>)(\*\*\*)(</p>)"
REPLACE "<p class="centered">\2\3"
Explanation: The find string looks for "***" wrapped around <p> tags. The replace string changes the leading "<p>" tag with "<p class="centered">"

Delete Blank Lines (except for the last line):
FIND "\n\r" (use in Extended Search Mode only)
REPLACE ""
Explanation: "\n" matches a line feed and "\r" matches a carriage return. Putting them together matches a blank line.

Add the "Chapter" Style to Chapters
FIND "(<p>)(Chapter.*?)(</p>)"
REPLACE "<p class="chapter">\2\3"
Explanation: The regular expression in the Find window matches any line with the word "Chapter" that is wrapped in <p>. The regular expression in the Replace window adds the "class="chapter"" to the leading <p> tag.

Replace Placeholder Tags from Word Processor with XHTML
FIND "(QQQ)(.*?)(QQQ)"
REPLACE "<span class="i">\2</span>"
Explanation: The regular expression in the Find window matches any text wrapped in the placeholder tags "QQQ", including the actual placeholder tags. The regular expression in the Replace window wraps in-line XHTML code around the text and deletes the placeholder tags.
Share/Bookmark

3 comments:

IPCreaper said...

hello
i am a newbie to all of this but you and your site has give light in this adventure i am traveling. Question is there a syntax for the copyright trademark?

Paul Salvette said...

IPCreaper,

The entity for the little copyright symbol is © I hope that helps.

Paul Salvette said...

er &copy;