How to extract characters from a string in Python

A piece of data may contain letters, numbers as well as special characters. If we are interested in extracting only the letters form this string of data, then we can use various options available in python.

With isalpha

The isalpha function will check if the given character is an alphabet or not. We will use this inside a for loop which will fetch each character from the given string and check if it is an alphabet. The join method will capture only the valid characters into the result.

Example

 Live Demo

stringA = "Qwer34^&t%y" # Given string print("Given string : ", stringA) # Find characters res = "" for i in stringA: if i.isalpha(): res = "".join([res, i]) # Result print("Result: ", res)

Output

Running the above code gives us the following result −

Given string : Qwer34^&t%y Result: Qwerty

With Regular expression

We can leverage the regular expression module and use the function findall giving the parameter value which represents only the characters.

One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of regular expressions. Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context.

Strings in Python can be defined using either single or double quotations (they are functionally equivalent):

In [1]:

x = 'a string' y = "a string" x == y

Out[1]:

True

In addition, it is possible to define multi-line strings using a triple-quote syntax:

In [2]:

multiline = """ one two three """

With this, let's take a quick tour of some of Python's string manipulation tools.

Simple String Manipulation in Python¶

For basic manipulation of strings, Python's built-in string methods can be extremely convenient. If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing. We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper

Formatting strings: Adjusting case¶

Python makes it quite easy to adjust the case of a string. Here we'll look at the True21, True22, True23, True24, and True25 methods, using the following messy string as an example:

In [3]:

fox = "tHe qUICk bROWn fOx."

To convert the entire string into upper-case or lower-case, you can use the True21 or True22 methods respectively:

In [4]:

fox.upper()

Out[4]:

'THE QUICK BROWN FOX.'

In [5]:

fox.lower()

Out[5]:

'the quick brown fox.'

A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence. This can be done with the True24 and True23 methods:

In [6]:

fox.title()

Out[6]:

'The Quick Brown Fox.'

In [7]:

True0

Out[7]:

True1

The cases can be swapped using the True25 method:

In [8]:

True2

Out[8]:

True3

Formatting strings: Adding and removing spaces¶

Another common need is to remove spaces (or other characters) from the beginning or end of the string. The basic method of removing characters is the True31 method, which strips whitespace from the beginning and end of the line:

In [9]:

True4

Out[9]:

True5

To remove just space to the right or left, use True32 or True33 respectively:

In [10]:

True6

Out[10]:

True7

In [11]:

True8

Out[11]:

True9

To remove characters other than spaces, you can pass the desired character to the True31 method:

In [12]:

multiline = """ one two three """ 0

Out[12]:

multiline = """ one two three """ 1

The opposite of this operation, adding spaces or other characters, can be accomplished using the True35, True36, and True37 methods.

For example, we can use the True35 method to center a given string within a given number of spaces:

In [13]:

multiline = """ one two three """ 2

Out[13]:

multiline = """ one two three """ 3

Similarly, True36 and True37 will left-justify or right-justify the string within spaces of a given length:

In [14]:

multiline = """ one two three """ 4

Out[14]:

multiline = """ one two three """ 5

In [15]:

multiline = """ one two three """ 6

Out[15]:

multiline = """ one two three """ 7

All these methods additionally accept any character which will be used to fill the space. For example:

In [16]:

multiline = """ one two three """ 8

Out[16]:

multiline = """ one two three """ 9

Because zero-filling is such a common need, Python also provides True41, which is a special method to right-pad a string with zeros:

In [17]:

fox = "tHe qUICk bROWn fOx." 0

Out[17]:

multiline = """ one two three """ 9

Finding and replacing substrings¶

If you want to find occurrences of a certain character in a string, the True42/True43, True44/True45, and True46 methods are the best built-in methods.

True42 and True44 are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:

In [18]:

fox = "tHe qUICk bROWn fOx." 2

Out[18]:

fox = "tHe qUICk bROWn fOx." 3

In [19]:

fox = "tHe qUICk bROWn fOx." 4

Out[19]:

fox = "tHe qUICk bROWn fOx." 3

The only difference between True42 and True44 is their behavior when the search string is not found; True42 returns True52, while True44 raises a True54:

In [20]:

fox = "tHe qUICk bROWn fOx." 6

Out[20]:

fox = "tHe qUICk bROWn fOx." 7

In [21]:

fox = "tHe qUICk bROWn fOx." 8

fox = "tHe qUICk bROWn fOx." 9

The related True43 and True45 work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

In [22]:

fox.upper() 0

Out[22]:

fox.upper() 1

For the special case of checking for a substring at the beginning or end of a string, Python provides the True57 and True58 methods:

In [23]:

fox.upper() 2

Out[23]:

True

In [24]:

fox.upper() 4

Out[24]:

fox.upper() 5

To go one step further and replace a given substring with a new string, you can use the True46 method. Here, let's replace True60 with True61:

In [25]:

fox.upper() 6

Out[25]:

fox.upper() 7

The True46 function returns a new string, and will replace all occurrences of the input:

In [26]:

fox.upper() 8

Out[26]:

fox.upper() 9

For a more flexible approach to this True46 functionality, see the discussion of regular expressions in Flexible Pattern Matching with Regular Expressions.

Splitting and partitioning strings¶

If you would like to find a substring and then split the string based on its location, the True64 and/or True65 methods are what you're looking for. Both will return a sequence of substrings.

The True64 method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [27]:

'THE QUICK BROWN FOX.'0

Out[27]:

'THE QUICK BROWN FOX.'1

The True67 method is similar, but searches from the right of the string.

The True65 method is perhaps more useful; it finds all instances of the split-point and returns the substrings in between. The default is to split on any whitespace, returning a list of the individual words in a string:

In [28]:

'THE QUICK BROWN FOX.'2

Out[28]:

'THE QUICK BROWN FOX.'3

A related method is True69, which splits on newline characters. Let's do this with a Haiku, popularly attributed to the 17th-century poet Matsuo Bashō:

In [29]:

'THE QUICK BROWN FOX.'4

Out[29]:

'THE QUICK BROWN FOX.'5

Note that if you would like to undo a True65, you can use the True71 method, which returns a string built from a splitpoint and an iterable:

In [30]:

'THE QUICK BROWN FOX.'6

Out[30]:

'THE QUICK BROWN FOX.'7

A common pattern is to use the special character True72 (newline) to join together lines that have been previously split, and recover the input:

In [31]:

'THE QUICK BROWN FOX.'8

'THE QUICK BROWN FOX.'9

Format Strings¶

In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats. Another use of string methods is to manipulate string representations of values of other types. Of course, string representations can always be found using the True73 function; for example:

In [32]:

fox.lower() 0

Out[32]:

fox.lower() 1

For more complicated formats, you might be tempted to use string arithmetic as outlined in Basic Python Semantics: Operators:

In [33]:

fox.lower() 2

Out[33]:

fox.lower() 3

A more flexible way to do this is to use format strings, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted. Here is a basic example:

In [34]:

fox.lower() 4

Out[34]:

fox.lower() 3

Inside the True74 marker you can also include information on exactly what you would like to appear there. If you include a number, it will refer to the index of the argument to insert:

In [35]:

fox.lower() 6

Out[35]:

fox.lower() 7

If you include a string, it will refer to the key of any keyword argument:

In [36]:

fox.lower() 8

Out[36]:

fox.lower() 7

Finally, for numerical inputs, you can include format codes which control how the value is converted to a string. For example, to print a number as a floating point with three digits after the decimal point, you can use the following:

In [37]:

'the quick brown fox.'0

Out[37]:

'the quick brown fox.'1

As before, here the "True75" refers to the index of the value to be inserted. The "True76" marks that format codes will follow. The "True77" encodes the desired precision: three digits beyond the decimal point, floating-point format.

This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available. For more information on the syntax of these format strings, see the Format Specification section of Python's online documentation.

Flexible Pattern Matching with Regular Expressions¶

The methods of Python's True78 type give you a powerful set of tools for formatting, splitting, and manipulating string data. But even more powerful tools are available in Python's built-in regular expression module. Regular expressions are a huge topic; there are there are entire books written on the topic (including Jeffrey E.F. Friedl’s Mastering Regular Expressions, 3rd Edition), so it will be hard to do justice within just a single subsection.

My goal here is to give you an idea of the types of problems that might be addressed using regular expressions, as well as a basic idea of how to use them in Python. I'll suggest some references for learning more in Further Resources on Regular Expressions.

Fundamentally, regular expressions are a means of flexible pattern matching in strings. If you frequently use the command-line, you are probably familiar with this type of flexible matching with the "True79" character, which acts as a wildcard. For example, we can list all the IPython notebooks (i.e., files with extension .ipynb) with "Python" in their filename by using the "True79" wildcard to match any characters in between:

In [38]:

'the quick brown fox.'2

'the quick brown fox.'3

Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes. The Python interface to regular expressions is contained in the built-in True81 module; as a simple example, let's use it to duplicate the functionality of the string True65 method:

In [39]:

'the quick brown fox.'4

Out[39]:

'THE QUICK BROWN FOX.'3

Here we've first compiled a regular expression, then used it to split a string. Just as Python's True65 method returns a list of all substrings between whitespace, the regular expression True65 method returns a list of all substrings between matches to the input pattern.

In this case, the input is True85: "True86" is a special character that matches any whitespace (space, tab, newline, etc.), and the "True87" is a character that indicates one or more of the entity preceding it. Thus, the regular expression matches any substring consisting of one or more spaces.

The True65 method here is basically a convenience routine built upon this pattern matching behavior; more fundamental is the True89 method, which will tell you whether the beginning of a string matches the pattern:

In [40]:

'the quick brown fox.'6

'the quick brown fox.'7

Like True65, there are similar convenience routines to find the first match (like True91 or True92) or to find and replace (like True93). We'll again use the line from before:

In [41]:

'the quick brown fox.'8

With this, we can see that the True94 method operates a lot like True91 or True92:

In [42]:

fox = "tHe qUICk bROWn fOx." 4

Out[42]:

fox = "tHe qUICk bROWn fOx." 3

In [43]:

fox.title() 1

Out[43]:

fox = "tHe qUICk bROWn fOx." 3

Similarly, the True97 method operates much like True93:

In [44]:

fox.title() 3

Out[44]:

fox.title() 4

In [45]:

fox.title() 5

Out[45]:

fox.title() 4

With a bit of thought, other native string operations can also be cast as regular expressions.

A more sophisticated example¶

But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods? The advantage is that regular expressions offer far more flexibility.

Here we'll consider a more complicated example: the common task of matching email addresses. I'll start by simply writing a (somewhat indecipherable) regular expression, and then walk through what is going on. Here it goes:

In [46]:

fox.title() 7

Using this, if we're given a line from a document, we can quickly extract things that look like email addresses

In [47]:

fox.title() 8

Out[47]:

fox.title() 9

(Note that these addresses are entirely made up; there are probably better ways to get in touch with Guido).

We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output:

In [48]:

'The Quick Brown Fox.'0

Out[48]:

'The Quick Brown Fox.'1

Finally, note that if you really want to match any email address, the preceding regular expression is far too simple. For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes. So, for example, the period used here means that we only find part of the address:

In [49]:

'The Quick Brown Fox.'2

Out[49]:

'The Quick Brown Fox.'3

This goes to show how unforgiving regular expressions can be if you're not careful! If you search around online, you can find some suggestions for regular expressions that will match all valid emails, but beware: they are much more involved than the simple expression used here!

Basics of regular expression syntax¶

The syntax of regular expressions is much too large a topic for this short section. Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more. My hope is that the following quick primer will enable you to use these resources effectively.

Simple strings are matched directly¶

If you build a regular expression on a simple string of characters or digits, it will match that exact string:

In [50]:

'The Quick Brown Fox.'4

Out[50]:

'The Quick Brown Fox.'5

Some characters have special meanings¶

While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:

'The Quick Brown Fox.'6

We will discuss the meaning of some of these momentarily. In the meantime, you should know that if you'd like to match any of these characters directly, you can escape them with a back-slash:

In [51]:

'The Quick Brown Fox.'7

Out[51]:

'The Quick Brown Fox.'8

The True99 preface in multiline = """ one two three """ 00 indicates a raw string; in standard Python strings, the backslash is used to indicate special characters. For example, a tab is indicated by multiline = """ one two three """ 01:

In [52]:

'The Quick Brown Fox.'9

True00

Such substitutions are not made in a raw string:

In [53]:

True01

True02

For this reason, whenever you use backslashes in a regular expression, it is good practice to use a raw string.

Special characters can match character groups¶

Just as the multiline = """ one two three """ 02 character within regular expressions can escape special characters, turning them into normal characters, it can also be used to give normal characters special meaning. These special characters match specified groups of characters, and we've seen them before. In the email address regexp from before, we used the character multiline = """ one two three """ 03, which is a special marker matching any alphanumeric character. Similarly, in the simple True65 example, we also saw multiline = """ one two three """ 05, a special marker indicating any whitespace character.

Putting these together, we can create a regular expression that will match any two letters/digits with whitespace between them:

In [54]:

True03

Out[54]:

True04

This example begins to hint at the power and flexibility of regular expressions.

The following table lists a few of these characters that are commonly useful:

CharacterDescriptionCharacterDescriptionmultiline = """ one two three """ 06Match any digitmultiline = """ one two three """ 07Match any non-digitmultiline = """ one two three """ 05Match any whitespacemultiline = """ one two three """ 09Match any non-whitespacemultiline = """ one two three """ 03Match any alphanumeric charmultiline = """ one two three """ 11Match any non-alphanumeric char

This is not a comprehensive list or description; for more details, see Python's regular expression syntax documentation.

Square brackets match custom character groups¶

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in. For example, the following will match any lower-case vowel:

In [55]:

True05

Out[55]:

True06

Similarly, you can use a dash to specify a range: for example, multiline = """ one two three """ 12 will match any lower-case letter, and multiline = """ one two three """ 13 will match any of multiline = """ one two three """ 14, multiline = """ one two three """ 15, or multiline = """ one two three """ 16. For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [56]:

True07

Out[56]:

True08

Wildcards match repeated characters¶

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, multiline = """ one two three """ 17. Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:

In [57]:

True09

Out[57]:

True10

There are also markers available to match any number of repetitions – for example, the multiline = """ one two three """ 18 character will match one or more repetitions of what precedes it:

In [58]:

True11

Out[58]:

True12

The following is a table of the repetition markers available for use in regular expressions:

CharacterDescriptionExamplemultiline = """ one two three """ 19Match zero or one repetitions of precedingmultiline = """ one two three """ 20 matches multiline = """ one two three """ 21 or multiline = """ one two three """ 22True79Match zero or more repetitions of precedingmultiline = """ one two three """ 24 matches multiline = """ one two three """ 21, multiline = """ one two three """ 22, multiline = """ one two three """ 27, multiline = """ one two three """ 28...True87Match one or more repetitions of precedingmultiline = """ one two three """ 30 matches multiline = """ one two three """ 22, multiline = """ one two three """ 27, multiline = """ one two three """ 28... but not multiline = """ one two three """ 21multiline = """ one two three """ 35Match multiline = """ one two three """ 36 repetitions of preeedingmultiline = """ one two three """ 37 matches multiline = """ one two three """ 27multiline = """ one two three """ 39Match between multiline = """ one two three """ 40 and multiline = """ one two three """ 36 repetitions of precedingmultiline = """ one two three """ 42 matches multiline = """ one two three """ 27 or multiline = """ one two three """ 28

With these basics in mind, let's return to our email address matcher:

In [59]:

True13

We can now understand what this means: we want one or more alphanumeric character (multiline = """ one two three """ 45) followed by the at sign (multiline = """ one two three """ 46), followed by one or more alphanumeric character (multiline = """ one two three """ 45), followed by a period (multiline = """ one two three """ 48 – note the need for a backslash escape), followed by exactly three lower-case letters.

If we want to now modify this so that the Obama email address matches, we can do so using the square-bracket notation:

In [60]:

True14

Out[60]:

True15

We have changed multiline = """ one two three """ 45 to multiline = """ one two three """ 50, so we will match any alphanumeric character or a period. With this more flexible expression, we can match a wider range of email addresses (though still not all – can you identify other shortcomings of this expression?).

Parentheses indicate groups to extract¶

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to group the results:

In [61]:

True16

In [62]:

True17

Out[62]:

True18

As we see, this grouping actually extracts a list of the sub-components of the email address.

We can go a bit further and name the extracted components using the multiline = """ one two three """ 51 syntax, in which case the groups can be extracted as a Python dictionary:

In [63]:

True19

Out[63]:

True20

Combining these ideas (as well as some of the powerful regexp syntax that we have not covered here) allows you to flexibly and quickly extract information from strings in Python.

Further Resources on Regular Expressions¶

The above discussion is just a quick (and far from complete) treatment of this large topic. If you'd like to learn more, I recommend the following resources:

  • Python's True81 package Documentation: I find that I promptly forget how to use regular expressions just about every time I use them. Now that I have the basics down, I have found this page to be an incredibly valuable resource to recall what each specific character or sequence means within a regular expression.
  • Python's official regular expression HOWTO: a more narrative approach to regular expressions in Python.
  • Mastering Regular Expressions (OReilly, 2006) is a 500+ page book on the subject. If you want a really complete treatment of this topic, this is the resource for you.

For some examples of string manipulation and regular expressions in action at a larger scale, see Pandas: Labeled Column-oriented Data, where we look at applying these sorts of expressions across tables of string data within the Pandas package.

Chủ đề