Beginner – s guide to Web Scraping te Python (using BeautifulSoup)
The need and importance of extracting gegevens from the web is becoming increasingly noisy and clear. Every few weeks, I find myself ter a situation where wij need to samenvatting gegevens from the web. For example, last week wij were thinking of creating an index of hotness and sentiment about various gegevens science courses available on the internet. This would not only require finding out fresh courses, but also scrape the web for their reviews and then summarizing them ter a few metrics! This is one of the problems / products, whose efficacy depends more on web scrapping and information extraction (gegevens collection) than the mechanisms used to summarize the gegevens.
Ways to samenvatting information from web
There are several ways to samenvatting information from the web. Use of APIs being most likely the best way to samenvatting gegevens from a webstek. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their gegevens ter a more structured manner. If you can get what you need through an API, it is almost always preferred treatment overheen web scrapping. This is because if you are getting access to structured gegevens from the provider, why would you want to create an engine to samenvatting the same information.
Sadly, not all websites provide an API. Some do it because they do not want the readers to samenvatting ample information te structured way, while others don’t provide APIs due to lack of technical skill. What do you do ter thesis cases? Well, wij need to scrape the webstek to fetch the information.
There might be a few other ways like RSS feeds, but they are limited te their use and hence I am not including them te the discussion here.
What is Web Scraping?
Web scraping is a rekentuig software mechanism of extracting information from websites. This mechanism mostly concentrates on the transformation of unstructured gegevens (HTML format) on the web into structured gegevens (database or spreadsheet).
You can perform web scrapping ter various ways, including use of Google Docs to almost every programming language. I would resort to Python because of its ease and rich eocsystem. It has a library known spil ‘BeautifulSoup’ which assists this task. Te this article, I’ll vertoning you the easiest way to learn web scraping using python programming.
For those of you, who need a non-programming way to samenvatting information out of web pages, you can also look at invoer.io . It provides a GUI driven interface to perform all basic web scraping operations. The hackers can proceed to read this article!
Libraries required for web scraping
Spil wij know, python is a open source programming language. You may find many libraries to perform one function. Hence, it is necessary to find the best to use library. I choose BeautifulSoup (python library), since it is effortless and intuitive to work on. Precisely, I’ll use two Python modules for scraping gegevens:
- Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL deeds (basic and digest authentication, redirections, cookies, etc). For more detail refer to the documentation pagina.
- BeautifulSoup: It is an incredible instrument for pulling out information from a webpagina. You can use it to samenvatting tables, lists, paragraph and you can also waterput filters to samenvatting information from web pages. Ter this article, wij will use latest version BeautifulSoup Four. You can look at the installation instruction te its documentation pagina.
BeautifulSoup does not fetch the web pagina for us. That’s why, I use urllib2 te combination with the BeautifulSoup library.
Python has several other options for HTML scraping ter addition to BeatifulSoup. Here are some others:
Basics – Get familiar with HTML (Tags)
While performing web scarping, wij overeenkomst with html tags. Thus, wij vereiste have good understanding of them. If you already know basics of HTML, you can skip this section. Below is the basic syntax of HTML: This syntax has various tags spil elaborated below:
Other useful HTML tags are:
- HTML linksom are defined with the <,a>, tag, “ <, a href= “http://www.test.com” >, This is a listig for test.com <, /a >,”
- HTML tables are defined with<,Table>,, row spil <,tr>, and rows are divided into gegevens spil <,td>,
If you are fresh to this HTML tags, I would also recommend you to refer HTML tutorial from W3schools. This will give you a clear understanding about HTML tags.
Scrapping a web Pagina using BeautifulSoup
Here, I am scraping gegevens from a Wikipedia pagina. Our final objective is to samenvatting list of state, union territory capitals te India. And some basic detail like establishment, former capital and others form this wikipedia pagina. Let’s learn with doing this project step wise step:
- Work with HTML tags
- soup.<,tag>,: Terugwedstrijd content inbetween opening and closing tag including tag.
- soup.<,tag>,.string: Terugwedstrijd string within given tag
- Find all the linksaf within page’s <,a>, tags:: Wij know that, wij can tag a verbinding using tag “<,a>,”. So, wij should go with option soup.a and it should come back the linksaf available te the web pagina. Let’s do it.
Above, you can see that, wij have only one output. Now to samenvatting all the linksom within <,a>,, wij will use “find_all().
- Find the right table: Spil wij are seeking a table to samenvatting information about state capitals, wij should identify the right table very first. Let’s write the instruction to samenvatting information within all table tags.
Now to identify the right table, wij will use attribute “class” of table and use it to filterzakje the right table. Ter chrome, you can check the class name by right click on the required table of web pagina –>, Inspect factor –>, Copy the class name OR go through the output of above guideline find the class name of right table.
Above, you can notice that 2nd factor of <,tr>, is within tag <,th>, not <,td>, so wij need to take care for this. Now to access value of each factor, wij will use “find(text=True)” option with each factor. Let’s look at the code:
Eventually, wij have gegevens te dataframe:
Similarly, you can perform various other types of web scraping using “BeautifulSoup“. This will reduce your manual efforts to collect gegevens from web pages. You can also look at the other attributes like .parent, .contents, .descendants and .next_sibling, .prev_sibling and various attributes to navigate using tag name. Thesis will help you to scrap the web pages effectively.-
But, why can’t I just use Regular Expressions?
Now, if you know regular expressions, you might be thinking that you can write code using regular expression which can do the same thing for you. I undoubtedly had this question. Ter my practice with BeautifulSoup and Regular expressions to do same thing I found out:
- Code written te BeautifulSoup is usually more sturdy than the one written using regular expressions. Codes written with regular expressions need to be altered with any switches ter pages. Even BeautifulSoup needs that te some cases, it is just that BeautifulSoup is relatively better.
- Regular expressions are much swifter than BeautifulSoup, usually by a factor of 100 te providing the same outcome.
So, it boils down to speed vs. robustness of the code and there is no universal winner here. If the information you are looking for can be extracted with plain regex statements, you should go ahead and use them. For almost any ingewikkeld work, I usually recommend BeautifulSoup more than regex.
Ter this article, wij looked at web scraping methods using “BeautifulSoup” and “urllib2” ter Python. Wij also looked at the basics of HTML and perform the web scraping step by step while solving a challenge. I’d recommend you to practice this and use it for collecting gegevens from web pages.
Did you find this article helpful? Please share your opinions / thoughts ter the comments section below.