Paul Salvette in Bangkok: Regular Expressions Tutorial

Thank you for visiting this eBook design tutorial. We now have an eBook design startup—BB eBooks—dedicated to helping independent authors and small presses get their eBooks formatted, converted, and ready for sale at all the major online retailers (e.g. Amazon's Kindle Store, Barnes & Noble's Nook, iBookstore, Smashwords, etc.) Please contact us for a no-obligation quote. For those writers, editors, and publishers looking to go the DIY route for eBook production (you probably are if you visited this page), we offer free online tutorials and apps to help you professionally design your eBook. Please visit our Developers page and let’s work together to improve the overall standards of eBooks. Also, please sign up for the mailing list for promotions, design & marketing tips, plus eBook industry news.

Looking for a complete guide on eBook design and development? Please consider The eBook Design and Development Guide, which contains everything you need to know about HTML, CSS, EPUB, and MOBI/KF8 to make an eBook like a pro. Pick it up at Amazon for $6.99 today.

The full tutorials for the eBook formatting series include a basic XHTML tutorial, a tutorial for converting your manuscript into XHTML, and a Calibre tutorial for converting XHTML into eBooks. For those looking for something more advanced, you can also peruse the Regular Expressions tutorial, as well as the EPUB and KindleGen tutorial. Templates for XHTML and EPUB are also available for your formatting arsenal. Additionally, there are some helpful hints for formatting for Smashwords in this tutorial.

Table of Contents for Regular Expressions Tutorial
Introduction to Regular Expressions
The Find Function
Multiplying Operators ?+*
Matching Non-standard Characters
Using Anchors
Using Substitutions
Useful Regular Expressions for eBook Formatting

Introduction to Regular Expressions

When working with large amounts of text, such as the XHTML code for an eBook, it is advantageous to use regular expressions. Perhaps you are familiar with wildcards, such as searching for "*.jpg" to find all JPEGs in one directory. The "*" indicates any characters and the ".jpg" refers specifically to the extension ".jpg". Essentially, regular expressions are a more sophisticated system of wildcards that are utilized for locating and manipulating text strings.

A regular expression, or a "regex", can be utilized by programmers and web administrators when working with large amounts of complex data. Regular expressions can become very complicated, but this guide will cover the basic ways to have regular expressions assist you in formatting your eBook.

Important Note: For this tutorial, quotes are wrapped around what you should type in the Find or Replace window. As an example, if the guide asks you to type "something" in the Find window, it would look as follows in your text editor:

The Find Window in Notepad++

The Find Function

Normally when you use the Find function (Ctrl-F), you type in a character or word in the Find window and click Find Next. The text editor then goes through the document looking for exactly what you typed into the Find window.

For example, you have the following text document on one line in your text editor:

Typical Text Editor Layout

Hypothetically, you want to search for the letter "p". You type "p", without the quotes, in the find window. Click Find Next, and it cycles through all the p's, selecting them one-by-one as indicated with the underlined text below:

While useful, you want to go beyond this simple type of finding and to manipulate complicated strings of text.

Multiplying Operators ?+*

Multiplying operators are special characters reserved in regular expressions that provide instructions to the Find function as it searches for the text. You can utilize multiplying operators to find zero, one, or more of any character.

If you want to search for "pp" or "p"-character-"p", you can use the multiplying operator "?". The "character?" tells the Find function to look for character one or zero times. Therefore, if you typed "po?p" in the Find window, you would find "pp" or "pop".

Type "po?p" in the Find window (make sure you click the button for regular expressions in Notepad++) and it will match the following underlined text:

If you want to search for "p"-character-"p", where character has to be in text at least once, you can use the multiplying operator "+". The "character+" tells the Find function to look for character one or more times.

Therefore, if you typed "po+p" in the Find window, it would highlight "pop", "poop", "pooop", etc. but not "pp":

The multiplying operator "*" is used in a similar fashion as "?" and "+". If you want to search for "p"-character-"p", where character can be in the text zero to any number of times, you would use "*".

Therefore, if you typed "po*p" in the Find window, it would highlight "pp" "pop", "poop", "pooop", etc.:

Important Note: Some text editors have a different syntax for regular expressions. However, the principle is essentially the same. Consult your help menu for syntax details if you are not using Notepad++.

Matching Non-standard Characters

Rather than typing in specific combinations of characters, it can be advantageous to search for different types of characters, or any character. This is similar to wildcards.

For regular expressions, to search for any character (a letter, digit, space, and everything else) you can type "." in the Find window. If you just type a ".", it will go one by one through every character in your text.

Typing "p.p" in the Find window would select "p1p", "pxp", "p p", etc.:

It may be necessary to find special characters reserved for regular expression functionality (i.e. "*", "?", ".", etc.) If you want to actually find a "." (i.e. the period), you need to place a slash in front of the "." like "\.".

Type "\." In the Find window, and you will locate the two periods:

You may want to locate any digits in your text. Type "\d" in the Find window to find a digit character:

If you want to try to highlighting any number of characters between a "p" and an "a", you can try typing the expression "p.*a". However, this will give you strange results, because most text editors exhibit what is called greedy behavior. This means the ".*" part of the regular expression will go as far as it can on the line of your text editor to find the next "a" after it finds the "p".

Typing "p.*a" in the Find window matches the following:

To limit how far the "*" multiplying operator will go, you can add the "?" multiplying operator on the end of the ".*". This limits the behavior of the "*" so that after finding a "p", the ".*" will stop at the first "a" that it sees.

Type "p.*?a" in your text editor, and the following will be selected:

Sets [ ]
Say you wanted to search for the letter "p" or "i". You would have to run the Find function twice. However, if you use a regular expression, you can make the text editor look for "p" or "i" on the same pass.

The brackets [ ] enclose what is called a set, and you can put any combination of words or numbers in there.

Type "[ip]" in the Find window (make sure you click the button for regular expressions). The text editor will go through the document and find:

You can also use sets to provide a range of numbers or letters. For example the set "[a-c]" will find the letters "a", "b", and "c":

If you want to search for characters not in a set, you add the "^" symbol before the characters within the [ ] brackets.

Type "[^a-z]" in the Find window. The text editor will go through the document and match:

If you wanted to not highlight the spaces, you can use the special code "\s" which matches spaces, tabs, and breaks.

Type "[^a-z\s]" in the Find window. The text editor will go through the document and match:

Using Anchors

If you wanted to search for entire words that just started with the letter "p", there are special codes called anchors. "\<character" will look for a word that starts with character. Likewise, "character\>" will look for a word that ends with character.

Type "\<p" in the Find window. The text editor will go through the document and find:

To find any words that end with "e", type "e\>" in the Find window. The text editor will go through the document and find:

To find any words that start with "p" and end with "e", you may be inclined to type "\<pe\>". This will only find the string "pe". Instead, type "\<p[a-z]*e\>" in the Find window:

Let's analyze this regular expression. The "\<p" term is looking for any word that starts with "p". Then, the [a-z]* is looking for zero or more letters. Finally, the "e\>" means the word must end with an "e".

You may have noticed that when you search for words starting with "p" by using the regular expression "\<p", it only selected the "p" by itself. However, if you want to select the entire word that starts with the letter "p", type in "\<p[a-z]*".

Say you want to get tricky and find entire sentences. You could type the regular expression "[a-z].*?[\.]". This will look for any word character, and select until it hits ".", a period. Don't forget that the "?" after the "*" turns the behavior of the "*" to non-greedy. If the "?" was not included, it would select the entire text on the line.

However, not all sentences end in a period, sometimes they end in a question mark or a closing quotation. You can add additional characters to the final set (such as "\?" or """), and the Find function will look for either a "?" or """ at the end.

Type "[a-z].*?[\.\?"]" in the Find window to select the following text in your text editor:

For eBook formatting, you want to select entire lines of text when you wrap and tags around your paragraphs in the XHTML code. To select an entire line of text, type the following "^(.+)$".

The "^" is an anchor that tells the Find function to start at the front of the line in the text editor. Likewise, the "$" is an anchor that tells the Find function to end at the end of the line in the text editor. The ".+" means one or more characters (i.e. any line that is not blank). The "^" should not be confused with the "^" inside of a set (e.g. "[^a-z]"), because within a set it means not "a-z".

Using Substitutions

Substitutions are used when you are conducting find and replace operations. In this guide, you will need to use regular expression substitutions to automatically convert the placeholder tags for italics/bold/underlined text to proper XHTML.

You have wrapped some italicized text with the "QQQ" placeholder tags as follows:

To select the entire italics text and the "QQQ" place holder tags wrapped around it, you could have a regular expression like "QQQ.*?QQQ". This would select the following:

To convert the placeholder tags to proper XHTML, you want to replace the opening "QQQ" with "" and the closing "QQQ" with "", but you don't want to alter the text within. The way your regular expression is now, this would not be possible.

To be able to replace the "QQQ" and leave the italics text as is, you need to build parentheses into your regular expression for the Find function.

Typing "(QQQ)(.*?)(QQQ)" in the Find window will yield the same selection results previously mentioned. However, in the Replace window, you want to type "\2". Clicking replace, the text will be altered as follows:

This is exactly the type of XHTML code you want. To analyze the expression, look at the three parenthetical elements in the find expression. The 1st was "(QQQ)", the 2nd was "(.*?)", and the 3rd was "(QQQ)". In the Replace window, you made no reference to the "(QQQ)" expression. To do so, you would have had to type "\1" or "\3".

Useful Regular Expressions for eBook Formatting

Finds Special Characters

Explanation: The [ ] marks a set of characters for the Find function to look for. The "^" inside a set means NOT the characters in the set.

Finds Special Characters (not including Curled Quotes and Em Dashes)

Wrap and Tags on Each Line

Explanation: The "^" starts at the beginning of the line. The "(.+)" finds one or more of any character (i.e. not a blank line). The "$" ends at the end of a line. The "\1<\p>" wraps the paragraph tags around the text of the line.

Add the Style "centered" to Section Breaks

Explanation: The find string looks for "***" wrapped around tags. The replace string changes the leading "" tag with ""

Delete Blank Lines (except for the last line):

Explanation: "\n" matches a line feed and "\r" matches a carriage return. Putting them together matches a blank line.

Add the "Chapter" Style to Chapters

Explanation: The regular expression in the Find window matches any line with the word "Chapter" that is wrapped in . The regular expression in the Replace window adds the "class="chapter"" to the leading tag.

Replace Placeholder Tags from Word Processor with XHTML

Explanation: The regular expression in the Find window matches any text wrapped in the placeholder tags "QQQ", including the actual placeholder tags. The regular expression in the Replace window wraps in-line XHTML code around the text and deletes the placeholder tags.

3 comments:

IPCreaper said...: hello
i am a newbie to all of this but you and your site has give light in this adventure i am traveling. Question is there a syntax for the copyright trademark?; August 14, 2012 at 12:17 AM
Paul Salvette said...: IPCreaper,

The entity for the little copyright symbol is © I hope that helps.; August 14, 2012 at 9:11 AM

Pages

August 20, 2011

Regular Expressions Tutorial - eBook Formatting

3 comments: