Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top-泓源视野

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图

This is the second article of my web scraping guide. In the first article, I showed you how you can find, extract, and clean the data from one single web page on IMDb.

In this article, you’ll learn how to scrape multiple web pages — a list that’s 20 pages and 1,000 movies total — with a Python web scraper.

Where We Left Off

In the previous article, we scraped and cleaned the data of the titleyear of release, imdb_ratingsmetascorelength of movie, number of votes, and the us_gross earnings of all movies on the first page of IMDb’s Top 1,000 movies.

This was the code we used:

And our results looked like this:

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图1

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图2

What We’ll Cover

I’ll be guiding you through these steps:

  1. You’ll request the unique URLs for every page on this IMDb list.
  2. You’ll iterate through each page using a for loop, and you’ll scrape each movie one by one.
  3. You’ll control the loop’s rate to avoid flooding the server with requests.
  4. You’ll extract, clean, and download this final data.
  5. You’ll use basic data-quality best practices.

Introducing New Tools

These are the additional tools we’ll use in our scraper:

  • The sleep() function from Python’s time module will control the loop’s rate by pausing the execution of the loop for a specified amount of seconds.
  • The randint() function from Python’s random module will vary the amount of waiting time between requests — within your specified interval

Time to Code

As mentioned in the first article, I recommend following along in a Repl.it environment if you don’t already have an IDE.

I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code beforehand.

You can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight changes.

Alternatively, you can go straight to the code here.

Now, let’s begin!

Import tools

Let’s import our previous tools and our new tools — time and random.

Initialize your storage

Like previously, we’re going to continue to use our empty lists as storage for all the data we scrape:

English movie titles

After we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:

Analyzing our URL

Let’s go to the URL of the page we‘re scraping.

Now, let’s click on the next page and see what page 2’s URL looks like:

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图3

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图4

And then page 3’s URL:

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图5

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图6

What do we notice about the URL from page 2 to page 3?

We notice &start=51 is added into the URL when we go to page 2, and the number 51 turns to the number 101 on page 3.

This makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so on.

Why is this important? This information will help us tell our loop how to go to the next page to scrape.

Refresher on ‘for' loops

Just like the loop we used to loop through each movie on the first page, we’ll use a for loop to iterate through each page on the list.

To refresh, this is how a for loop works:

for <variable> in <iterable>: <statement(s)>

<iterable> is a collection of objects—e.g. a list or tuple. The <statement(s)> are executed once for each item in <iterable>. The loop <variable> takes on the value of the next element in <iterable> each time through the loop.

Changing the URL Parameter

As I mentioned earlier, each page’s URL follows a certain logic as the web pages change. To make the URL requests we’d have to vary the value of the page parameter, like this:

Breaking down the URL parameters:

  • pages is the variable we create to store our page-parameter function for our loop to iterate through
  • np.arrange(1,1001,50) is a function in the NumPy Python library, and it takes four arguments — but we’re only using the first three which are: startstop, and step. step is the number that defines the spacing between each. So: Start at 1, stop at 1001, and step by 50.

Start at 1This will be our first page’s URL.

Stop at 1001Why stop at 1001? The number in the stop parameter is the number that defines the end of the array, but it isn’t included in the array. The last page for movies would be at the URL number of 951. This page has movies 951-1000. If we used 951, it wouldn’t include this page in our scraper, so we have to go one page further to make sure we get the last page.

Step at 50We want the URL number to change by 50 each time the loop comes around — this parameter tells it to do that.

Looping Through Each Page

Now we need to create another for loop that’ll loop our scraper through the pages function we created above, which loops through each different URL we need. We can do this simply like this:

Breaking this loop down:

  • page is the variable that’ll iterate through our pages function
  • pages is the function we created: np.arrange(1,1001,50)

Requesting the URL + ‘html_soup’ + ‘movie_div’

Inside this new loop is where we’ll request our new URLs, add our html_soup (helps us parse the HTML files), and add our movie_div (stores each div container we’re scraping). This is what it’ll look like:

Breaking page down:

  • page is the variable we’re using which stores each of our new URLs
  • requests.get() is the method we use to grab the contents of each URL
  • “https://www.imdb.com/search/title/?groups=top_1000&start=" is the part of the URL that stays the same when we change each page
  • str(page) tells the request to add each iteration of page (the page function we’re using to change the page number of the URL) into the URL request. It also tells it to make sure it’s a string we’re using — not an integer or float — because it’s an URL link we’re building.
  • + “&ref_=adv_nxt” is added to the end of every URL because this also does not change when we go to the next page
  • headers=headers tells our scraper to bring us English-translated content from the URLs we’re requesting

Breaking soup down:

  • soup is the variable we create to assign the method BeautifulSoup to
  • BeautifulSoup is a method we’re using that specifies a desired format of results
  • (page.text, “html.parser") grabs the text contents of page and uses the HTML parser — this allows Python to read the components of the page rather than treating it as one long string

Breaking movie_div down:

  • movie_div is the variable we use to store all of the div containers with a class of lister-item mode-advanced
  • The find_all() method extracts all the div containers that have a class attribute of lister-item mode-advanced from what we’ve stored in our variable soup

Controlling the Crawl Rate

Controlling the crawl rate is beneficial for the scraper and for the website we’re scraping. If we avoid hammering the server with a lot of requests all at once, then we’re much less likely to get our IP address banned — and we also avoid disrupting the activity of the website we scrape by allowing the server to respond to other user requests as well.

We’ll be adding this code to our new for loop:

Breaking crawl rate down:

  • The sleep() function will control the loop’s rate by pausing the execution of the loop for a specified amount of time
  • The randint(2,10) function will vary the amount of waiting time between requests for a number between 2-10 seconds. You can change these parameters to any that you like.

Please note that this will delay the time it takes to grab all the data we need from every page, so be patient. There are 20 pages with a max of 10 seconds per loop, so it’d take a max of 3.5 minutes to get all of the data with this code.

It’s very important to practice good scraping and to scrape responsibly!

Our code should now look like this:

Scraping Code

We can add our scraping for loop code into our new for loop:

Pointing Out Previous Errors

I’d like to point out a slight error I made in the previous article — a mistake I made regarding the cleaning of the metascore data.

I received this DM from an awesome dev who was running through my article and coding along but with a different IMDb URL than the one I used to teach in the guide.

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图7

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图8

In the extracting metascore data code, we wrote this:

This extraction code says if there is Metascore data there, grab it — but if the data is missing, then put a dash there and continue.

In the cleaning of themetascore data code, we wrote this:

This cleaning code says to turn this pandas object into an integer data type, which worked for my URL I scraped because I didn’t have any missing Metascore data — e.g., no dashes in place of missing data.

What I failed to notice is if someone scraped a different IMDb page than I did, they’d possibly have missing metascore data there, and once we scraped multiple pages in this guide, we’ll have missing metascore data as well.

What does this mean?

It means when we do get those dashes in place of missing data, we can’t use the code .astype(int) to convert that entiremetascore data into an integer like I previously used — this would produce an error. We’d need to turn our metascore data into a float data type (decimal).

Fixing the Cleaning of the Metascore Data Code

Instead of this metascore data cleaning code:

We’ll use this:

Breaking down the new cleaning of the Metascore data:

Top-cleaning code:

  • movies[‘metascore’] is our Metascore data in our movies DataFrame. We’ll be assigning our new cleaned up data to our metascore column.
  • movies[‘metascore’] tells pandas to go to the column metascore in our DataFrame
  • .str.extract(‘(\d+’) — this method: (‘(\d+’) says to extract all the digits in the string

Bottom-conversion code:

  • movies[‘metascore’] is stripped of the elements we don’t need, and now we’ll assign the conversion code data to it to finish it up
  • pd.to_numeric is a method we use to change this column to a float. The reason we use this is because we have a lot of dashes in this column, and we can’t just convert it to a float using .astype(float) — this would catch an error.
  • errors=’coerce’ will transform the nonnumeric values, our dashes, into not-a-number (NaN) values because we have dashes in place of the data that’s missing.

Add the DataFrame and Cleaning Code

Let’s add our DataFrame and cleaning code to our new scraper, which will go below our loops. If you have any questions regarding how this code works, go to the first article to see what each line executes.

The code should look like this:

Save to CSV

We have all the elements of our scraper ready — now it’s time to save all the data we’re about to scrape into our CSV.

Below is the code you can add to the bottom of your program to save your data to a CSV file:

In case you need a refresher, if you’re in Repl.it, you can create an empty CSV file by hovering near “Files” and clicking the “Add file” option. Name it, and save it with a .csv extension. Then, add the code to the end of your program:

movies.to_csv(‘the_name_of_your_csv_here.csv’)

If we run and save our .csv, we should get a file with a list of movies and all the data from 0-999:

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图9

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图10

Basic Data-Quality Best Practices (Optional)

Here, I’ll discuss some basic data-quality tricks you can use when cleaning your data. You don’t need to apply any of this to our final scraper.

Usually, a dataset with a lot of missing data isn’t a good dataset at all. Below are ways we can look up, manipulate, and change our data — for future reference.

Missing data

One of the most common problems in a dataset is missing data. In our case, the data wasn’t available. There are a couple of ways to check and deal with missing data:

  • Check where we’re missing data and how much is missing
  • Add in a default value for the missing data
  • Delete the rows that have missing data
  • Delete the columns that have a high incidence of missing data

We’ll go through each of these in turn.

Check missing data:

We can easily check for missing data like this:

The output:

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图11

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图12

This shows us where the data is missing and how much data is missing. We have 165 missing values in metascore and 161 missing in us_grossMillions— a total of 326 missing data in our dataset.

Add default value for missing data:

If you wanted to change your NaN values to something else specific, you can do so like this:

For this example, I want the words “None Given” in place of metascore NaN values and empty quotes (nothing) in place of us_grossMillions NaN values.

If you print those columns, you can see our NaN values have been changed as specified:

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图13

Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb’s Top插图14

Beware: Our metascore column was an int, and our us_grossMillions column was a float prior to this change — and you can see how they’re both objects now because of the change. Be careful when changing your data, and always check to see what your data types are when making any alterations.

Delete rows with missing data:

Sometimes the best route to take when having a lot of missing data is to just remove them altogether. We can do this a couple of different ways:

Delete columns with missing data:

Sometimes when we have too many missing values in a column, it’s best to get rid of them. We can do so like this:

  • axis=1 is the parameter we use — it means to operate on columns, not rows. Axis=0 means rows. We could’ve used this parameter in our delete-rows section, but the default is already 0, so I didn’t use it.
  • how=‘any’ means if any NA values are present to drop that column.

The Final Code

Conclusion

There you have it! We’ve successfully extracted data of the top 1,000 best movies of all time on IMDb, which included multiple pages, and saved it into a CSV file.

I hope you enjoyed building a Python scraper. If you followed along, let me know how it went.

Happy coding!

本文由 泓源视野 作者:admin 发表,其版权均为 泓源视野 所有,文章内容系作者个人观点,不代表 泓源视野 对观点赞同或支持。如需转载,请注明文章来源。
19

发表评论

Protected with IP Blacklist CloudIP Blacklist Cloud
您是第8231594 位访客, 您的IP是:[3.238.227.73]