Intro to BeautifulSoup

BeautifulSoup is a class in the bs4 library of python. Developed by Leonard Richardson, BeautifulSoup was made for the purpose of parsing HTML or XML documents (or say, collecting data from documents).

To install, bs4 run the following command:

sudo pip install bs4

(If you run into any errors, feel free to drop a question in the discussion forum).
Suppose we have an HTML document and we want to collect all the reference links in the document. First of all, we will store the document as a string.

html_doc='''<a href='wwww.example.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href=‘wwww.linkedin.com'/a>'''

Now, pass the variable html_doc in the initializer function BeautifulSoup. That way, we create a soup object.

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser') #An instance of the BeautifulSoup class

Now that we have this soup object, we can apply methods of the BeautifulSoup class on it. One such method is find which takes the HTML tags as strings. So, to locate the <a> tag, we can write:

soup.find('a')

We can access the attributes of a tag and the values in the attributes with some aptly named methods. Have a look at the following piece code, and try to understand it:

print soup.find('a').get('href')

This method called .get() helps you fetch an attribute of the given tag. In this case, it fetches the href attribute of the <a> tag. We get this as output:

>>> www.example.com

But the find method locates the very first tag it gets in the whole document. You can’t locate the second or third tag using the find method. It just gets you the first tag found in the whole document, like www.example.com in this case. What could we do in order to get all such href attributes of the <a> tag?

Well, there’s a nice method called find_all.

for tag in soup.find_all('a'):
   print tag.get('href')

This loop instructs python to locate every <a> in the document and get the href attribute. With this, you will end up getting all the links in the html_doc string.

Full Code:

from bs4 import BeautifulSoup

html_doc=''''<a href='wwww.example.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href='wwww.linkedin.com'/a>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all('a'):
    print tag.get('href')

This was just a simple illustration of how you can use BeautifulSoup. You can use it to extract thousands of phone numbers, e-mail ids, etc. from a webpage.
You might consider reading the bs4 Documentation for more information.

Intro to BeautifulSoup

C++ : Linked lists in C++ (Singly linked list)

Adding Outline to Text Using CSS

Set, toggle and clear a bit in C

12 Creative CSS and JavaScript Text Typing Animations

Inserting a new node to a linked list in C++

pow() in Python

Dutch National Flag problem - Sort 0, 1, 2 in an array

memoryview() in Python

next() in Python

map() in Python

Mouse Rollover Zoom Effect on Images

Important functions in math.h library of C

Formatting the print using printf in C

Linked list traversal using loop and recursion in c++

Calculator using Java Swing and AWT with source code

Animate your Website Elements with CSS Transforms

Controlling the Outline Position with outline-offset

Prime numbers using Sieve Algorithm in C