Time to Code
As mentioned in the first article, I recommend following along in a Repl.it environment if you don’t already have an IDE.
I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code beforehand.
You can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight changes.
Alternatively, you can go straight to the code here.
Now, let’s begin!
Let’s import our previous tools and our new tools —
Initialize your storage
Like previously, we’re going to continue to use our empty lists as storage for all the data we scrape:
English movie titles
After we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:
Analyzing our URL
Let’s go to the URL of the page we‘re scraping.
Now, let’s click on the next page and see what page 2’s URL looks like:
And then page 3’s URL:
What do we notice about the URL from page 2 to page 3?
&start=51 is added into the URL when we go to page 2, and the number
51 turns to the number
101 on page 3.
This makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so on.
Why is this important? This information will help us tell our loop how to go to the next page to scrape.
Refresher on ‘
Just like the loop we used to loop through each movie on the first page, we’ll use a
for loop to iterate through each page on the list.
To refresh, this is how a
for loop works:
for <variable> in <iterable>: <statement(s)>
<iterable> is a collection of objects—e.g. a list or tuple. The
<statement(s)> are executed once for each item in
<iterable>. The loop
<variable> takes on the value of the next element in
<iterable> each time through the loop.