How to scrape data from web using python

Data Mining | Tutorials   |   
Published July 31, 2018   |   

Can you guess a simple way you can get data from a web page? It’s through a technique called web scraping.

In case you are not familiar with web scraping, here is an explanation:

“Web scraping is a computer software technique of extracting information from websites”

“Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.”

Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. Websites like Rotten tomatoes and Twitter provides API to access data. But if a web page doesn’t provide an API, you can use Python to scrape data from that webpage.

I will be using two Python modules for scraping data.

  • Urllib
  • Beautifulsoup

So, are you ready to scrape a webpage? All you have to do to get started is follow the steps given below:

Understanding HTML Basics

Scarping is all about html tags. So you need to understand html  inorder to scrape data.

This is an example for a minimal webpage defined in HTML tags. The root tag is <html> and then you have the <head> tag. The page includes the title of the page and might also have other meta information like the keywords. The <body> tag includes the actual content of the page. <h1>, <h2> , <h3>, <h4>, <h5> and <h6> are different header levels.

data-science

These are some useful html tags you need to know.

Useful tags

I encourage you to inspect a web page and view its source code to understand more about html.

Scraping A Web Page Using Beautiful Soup

I will be scraping data from bigdataexaminer.com. I am importing urllib2, beautiful soup(bs4), Pandas and Numpy.

import urllib2

  import bs4

  import pandas as pd

  import numpy as np

url lib

url lib 2

What beautiful = urllib2.urlopen(url).read() does is, it goes to bigdataexaminer.com and gets the whole html text. I then store it in a variable called beautiful.

Now I have to parse and clean the HTML code. BeautifulSoup is a really useful Python module for parsing HTML and XML files.  Beautiful Soup gives aBeautifulSoup object, which represents the document as a nested data structure.

Prettify

You can use prettify()  function to show different levels of the HTML code.

beautiful soup

html language

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <h1> tag, just say soup.h1.prettify():

soup

Contents

soup.tag.contents will return contents of a tag as a list.

In[18] : soup.head.contents

meta char set

The following function will return the title present inside head tag.

In[45] : x = soup.head.title

Out [45]: <title></title>

.string will return the string present inside the title tag of big data examiner. As big dataexaminer.com doesn’t have a title, the value returned is None.

string

Descendants

Descendants lets you iterate over all of a tags children, recursively.

descendants

meta

You can also look at the strings using .strings generator

soup strings

text string

In[56]: soup.get_text()

extracts all the text from Big data examiner.com

FindALL

You can use Find_all() to find all the ‘a’ tags on the page.

find all

To get the first four ‘a’ tags you can use limit attribute.

soup-findall

To find a particular text on a web page, you can use text attribute along with find All.  Here I am searching for the term ‘data’ on big data examiner.

a tag

Get me the attribute of  the second ‘a’ tag on big data examiner.

big data exam

You can also use a list comprehension to get the attributes of the first 4 a tags on bigdata examiner.

big data examiner

Conclusion

A data scientist should know how to scrape data from websites, and I hope you have found this article useful as an introduction to web scraping with Python. Apart from beautiful soup there is another useful python library called pattern for web scraping. I also found a  good tutorial on web scraping using Python.

Instead of taking the difficult path of web scraping using an in-house setup built by you from scratch, you could always safely trust PromtCloud’s web scraping service to take end-to-end ownership of your project.

Web scraping is not all about “coding” per se, you need to be adept in coding, internet protocols, database warehousing, service-request, code cleansing, converting unstructured data to structured data, and even some machine learning nowadays.