BlogsDope image BlogsDope

Intro to BeautifulSoup

June 17, 2018 PYTHON INTERNET WEB SCRAPING 6461

BeautifulSoup is a class in the bs4 library of python. Developed by Leonard Richardson, BeautifulSoup was made for the purpose of parsing HTML or XML documents (or say, collecting data from documents).

To install, bs4 run the following command:

sudo pip install bs4

(If you run into any errors, feel free to drop a question in the discussion forum).
Suppose we have an HTML document and we want to collect all the reference links in the document. First of all, we will store the document as a string.

html_doc='''<a href='wwww.example.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href=‘wwww.linkedin.com'/a>'''

Now, pass the variable html_doc in the initializer function BeautifulSoup. That way, we create a soup object.

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser') #An instance of the BeautifulSoup class

Now that we have this soup object, we can apply methods of the BeautifulSoup class on it. One such method is find which takes the HTML tags as strings. So, to locate the <a> tag, we can write:

soup.find('a')

We can access the attributes of a tag and the values in the attributes with some aptly named methods. Have a look at the following piece code, and try to understand it:

print soup.find('a').get('href')

This method called .get() helps you fetch an attribute of the given tag. In this case, it fetches the href attribute of the <a> tag. We get this as output:

>>> www.example.com

But the find method locates the very first tag it gets in the whole document. You can’t locate the second or third tag using the find method. It just gets you the first tag found in the whole document, like www.example.com in this case. What could we do in order to get all such href attributes of the <a> tag?

Well, there’s a nice method called find_all.

for tag in soup.find_all('a'):
   print tag.get('href')

This loop instructs python to locate every <a> in the document and get the href attribute. With this, you will end up getting all the links in the html_doc string.

Full Code:

from bs4 import BeautifulSoup

html_doc=''''<a href='wwww.example.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href='wwww.linkedin.com'/a>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all('a'):
    print tag.get('href')

This was just a simple illustration of how you can use BeautifulSoup. You can use it to extract thousands of phone numbers, e-mail ids, etc. from a webpage.
You might consider reading the bs4 Documentation for more information.


Liked the post?
Contributor
Editor's Picks
0 COMMENT

Please login to view or add comment(s).