This is the second article of my web scraping guide. In the first article, I showed you how you can find, extract, and clean the data from one single web page on IMDb.
In this article, you’ll learn how to scrape multiple web pages — a list that’s 20 pages and 1,000 movies total — with a Python web scraper.
Where We Left Off
In the previous article, we scraped and cleaned the data of the title
, year
of release, imdb_ratings
, metascore
, length
of movie, number of votes
, and the us_gross
earnings of all movies on the first page of IMDb’s Top 1,000 movies.
This was the code we used:
And our results looked like this:
What We’ll Cover
I’ll be guiding you through these steps:
- You’ll request the unique URLs for every page on this IMDb list.
- You’ll iterate through each page using a
for
loop, and you’ll scrape each movie one by one. - You’ll control the loop’s rate to avoid flooding the server with requests.
- You’ll extract, clean, and download this final data.
- You’ll use basic data-quality best practices.
Introducing New Tools
These are the additional tools we’ll use in our scraper:
Time to Code
As mentioned in the first article, I recommend following along in a Repl.it environment if you don’t already have an IDE.
I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code beforehand.
You can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight changes.
Alternatively, you can go straight to the code here.
Now, let’s begin!
Import tools
Let’s import our previous tools and our new tools — time
and random
.
Initialize your storage
Like previously, we’re going to continue to use our empty lists as storage for all the data we scrape:
English movie titles
After we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:
Analyzing our URL
Let’s go to the URL of the page we‘re scraping.
Now, let’s click on the next page and see what page 2’s URL looks like:
And then page 3’s URL:
What do we notice about the URL from page 2 to page 3?
We notice &start=51
is added into the URL when we go to page 2, and the number 51
turns to the number 101
on page 3.
This makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so on.
Why is this important? This information will help us tell our loop how to go to the next page to scrape.
Refresher on ‘for'
loops
Just like the loop we used to loop through each movie on the first page, we’ll use a for
loop to iterate through each page on the list.
To refresh, this is how a for
loop works:
for <variable> in <iterable>: <statement(s)>
<iterable>
is a collection of objects—e.g. a list or tuple. The <statement(s)>
are executed once for each item in <iterable>
. The loop <variable>
takes on the value of the next element in <iterable>
each time through the loop.
Changing the URL Parameter
As I mentioned earlier, each page’s URL follows a certain logic as the web pages change. To make the URL requests we’d have to vary the value of the page parameter, like this:
Breaking down the URL parameters:
pages
is the variable we create to store our page-parameter function for our loop to iterate throughnp.arrange(1,1001,50)
is a function in the NumPy Python library, and it takes four arguments — but we’re only using the first three which are:start
,stop
, andstep
.step
is the number that defines the spacing between each. So: Start at1
, stop at1001
, and step by50
.
Start at 1
: This will be our first page’s URL.
Stop at 1001
: Why stop at 1001? The number in the stop parameter is the number that defines the end of the array, but it isn’t included in the array. The last page for movies would be at the URL number of 951. This page has movies 951-1000. If we used 951, it wouldn’t include this page in our scraper, so we have to go one page further to make sure we get the last page.
Step at 50
: We want the URL number to change by 50 each time the loop comes around — this parameter tells it to do that.
Looping Through Each Page
Now we need to create another for
loop that’ll loop our scraper through the pages function we created above, which loops through each different URL we need. We can do this simply like this:
Breaking this loop down:
page
is the variable that’ll iterate through ourpages
functionpages
is the function we created:np.arrange(1,1001,50)
Requesting the URL + ‘html_soup’ + ‘movie_div’
Inside this new loop is where we’ll request our new URLs, add our html_soup
(helps us parse the HTML files), and add our movie_div
(stores each div container we’re scraping). This is what it’ll look like:
Breaking page
down:
page
is the variable we’re using which stores each of our new URLsrequests.get()
is the method we use to grab the contents of each URL“https://www.imdb.com/search/title/?groups=top_1000&start="
is the part of the URL that stays the same when we change each page+ str(page)
tells the request to add each iteration ofpage
(the page function we’re using to change the page number of the URL) into the URL request. It also tells it to make sure it’s a string we’re using — not an integer or float — because it’s an URL link we’re building.+ “&ref_=adv_nxt”
is added to the end of every URL because this also does not change when we go to the next pageheaders=headers
tells our scraper to bring us English-translated content from the URLs we’re requesting
Breaking soup
down:
soup
is the variable we create to assign the methodBeautifulSoup
toBeautifulSoup
is a method we’re using that specifies a desired format of results(page.text, “html.parser")
grabs the text contents ofpage
and uses the HTML parser — this allows Python to read the components of the page rather than treating it as one long string
Breaking movie_div
down:
movie_div
is the variable we use to store all of thediv
containers with a class oflister-item mode-advanced
- The
find_all()
method extracts all thediv
containers that have aclass
attribute oflister-item mode-advanced
from what we’ve stored in our variablesoup
Controlling the Crawl Rate
Controlling the crawl rate is beneficial for the scraper and for the website we’re scraping. If we avoid hammering the server with a lot of requests all at once, then we’re much less likely to get our IP address banned — and we also avoid disrupting the activity of the website we scrape by allowing the server to respond to other user requests as well.
We’ll be adding this code to our new for
loop:
Breaking crawl rate down:
- The
sleep()
function will control the loop’s rate by pausing the execution of the loop for a specified amount of time - The
randint(2,10)
function will vary the amount of waiting time between requests for a number between 2-10 seconds. You can change these parameters to any that you like.
Please note that this will delay the time it takes to grab all the data we need from every page, so be patient. There are 20 pages with a max of 10 seconds per loop, so it’d take a max of 3.5 minutes to get all of the data with this code.
It’s very important to practice good scraping and to scrape responsibly!
Our code should now look like this:
Scraping Code
We can add our scraping for
loop code into our new for
loop:
Pointing Out Previous Errors
I’d like to point out a slight error I made in the previous article — a mistake I made regarding the cleaning of the metascore
data.
I received this DM from an awesome dev who was running through my article and coding along but with a different IMDb URL than the one I used to teach in the guide.
In the extracting metascore
data code, we wrote this:
This extraction code says if there is Metascore data there, grab it — but if the data is missing, then put a dash there and continue.
In the cleaning of themetascore
data code, we wrote this:
This cleaning code says to turn this pandas object into an integer data type, which worked for my URL I scraped because I didn’t have any missing Metascore data — e.g., no dashes in place of missing data.
What I failed to notice is if someone scraped a different IMDb page than I did, they’d possibly have missing metascore
data there, and once we scraped multiple pages in this guide, we’ll have missing metascore
data as well.
What does this mean?
It means when we do get those dashes in place of missing data, we can’t use the code .astype(int)
to convert that entiremetascore
data into an integer like I previously used — this would produce an error. We’d need to turn our metascore
data into a float data type (decimal).
Fixing the Cleaning of the Metascore
Data Code
Instead of this metascore
data cleaning code:
We’ll use this:
Breaking down the new cleaning of the Metascore data:
Top-cleaning code:
movies[‘metascore’]
is our Metascore data in our moviesDataFrame
. We’ll be assigning our new cleaned up data to ourmetascore
column.movies[‘metascore’]
tells pandas to go to the columnmetascore
in ourDataFrame
.str.extract(‘(\d+’)
— this method:(‘(\d+’)
says to extract all the digits in the string
Bottom-conversion code:
movies[‘metascore’]
is stripped of the elements we don’t need, and now we’ll assign the conversion code data to it to finish it uppd.to_numeric
is a method we use to change this column to a float. The reason we use this is because we have a lot of dashes in this column, and we can’t just convert it to a float using.astype(float)
— this would catch an error.errors=’coerce’
will transform the nonnumeric values, our dashes, into not-a-number (NaN) values because we have dashes in place of the data that’s missing.
Add the DataFrame and Cleaning Code
Let’s add our DataFrame
and cleaning code to our new scraper, which will go below our loops. If you have any questions regarding how this code works, go to the first article to see what each line executes.
The code should look like this:
Save to CSV
We have all the elements of our scraper ready — now it’s time to save all the data we’re about to scrape into our CSV.
Below is the code you can add to the bottom of your program to save your data to a CSV file:
In case you need a refresher, if you’re in Repl.it, you can create an empty CSV file by hovering near “Files” and clicking the “Add file” option. Name it, and save it with a .csv
extension. Then, add the code to the end of your program:
movies.to_csv(‘the_name_of_your_csv_here.csv’)
If we run and save our .csv
, we should get a file with a list of movies and all the data from 0-999:
Basic Data-Quality Best Practices (Optional)
Here, I’ll discuss some basic data-quality tricks you can use when cleaning your data. You don’t need to apply any of this to our final scraper.
Usually, a dataset with a lot of missing data isn’t a good dataset at all. Below are ways we can look up, manipulate, and change our data — for future reference.
Missing data
One of the most common problems in a dataset is missing data. In our case, the data wasn’t available. There are a couple of ways to check and deal with missing data:
- Check where we’re missing data and how much is missing
- Add in a default value for the missing data
- Delete the rows that have missing data
- Delete the columns that have a high incidence of missing data
We’ll go through each of these in turn.
Check missing data:
We can easily check for missing data like this:
The output:
This shows us where the data is missing and how much data is missing. We have 165 missing values in metascore
and 161 missing in us_grossMillions
— a total of 326 missing data in our dataset.
Add default value for missing data:
If you wanted to change your NaN values to something else specific, you can do so like this:
For this example, I want the words “None Given”
in place of metascore
NaN values and empty quotes (nothing) in place of us_grossMillions
NaN values.
If you print those columns, you can see our NaN values have been changed as specified:
Beware: Our metascore
column was an int
, and our us_grossMillions
column was a float
prior to this change — and you can see how they’re both objects
now because of the change. Be careful when changing your data, and always check to see what your data types are when making any alterations.
Delete rows with missing data:
Sometimes the best route to take when having a lot of missing data is to just remove them altogether. We can do this a couple of different ways:
Delete columns with missing data:
Sometimes when we have too many missing values in a column, it’s best to get rid of them. We can do so like this:
axis=1
is the parameter we use — it means to operate on columns, not rows.Axis=0
means rows. We could’ve used this parameter in our delete-rows section, but the default is already0
, so I didn’t use it.how=‘any’
means if anyNA
values are present to drop that column.
The Final Code
Conclusion
There you have it! We’ve successfully extracted data of the top 1,000 best movies of all time on IMDb, which included multiple pages, and saved it into a CSV file.
I hope you enjoyed building a Python scraper. If you followed along, let me know how it went.
Happy coding!