﻿{"id":849,"date":"2021-01-25T22:44:12","date_gmt":"2021-01-25T14:44:12","guid":{"rendered":"https:\/\/byy3.com\/?p=849"},"modified":"2021-01-25T22:44:12","modified_gmt":"2021-01-25T14:44:12","slug":"scrape-multiple-pages-of-a-website-using-a-python-web-scraper-imdbs-top","status":"publish","type":"post","link":"https:\/\/byy3.com\/?p=849","title":{"rendered":"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top"},"content":{"rendered":"<section class=\"df dg dh di dj\">\n<div class=\"hf\">\n<div class=\"n p\">\n<div class=\"hg hh hi hj hk hl af hm ag hn ai aj\">\n<figure class=\"hp hq hr hs ht hf hu hv paragraph-image\">\n<div class=\"hw hx fh hy aj hz\" tabindex=\"0\" role=\"button\">\n<div class=\"cx cy ho\">\n<div class=\"if s fh ig\">\n<div class=\"ih ii s\">\n<section class=\"df dg dh di dj\">\n<div class=\"hf\">\n<div class=\"n p\">\n<div class=\"hg hh hi hj hk hl af hm ag hn ai aj\">\n<figure class=\"hp hq hr hs ht hf hu hv paragraph-image\">\n<div class=\"hw hx fh hy aj hz\" tabindex=\"0\" role=\"button\">\n<div class=\"cx cy ho\">\n<div class=\"if s fh ig\">\n<div class=\"ih ii s\"><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/2036\/1*brkY7M89o4x8pDW9chkPEg.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 1000px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*brkY7M89o4x8pDW9chkPEg.png 276w, https:\/\/miro.medium.com\/max\/828\/1*brkY7M89o4x8pDW9chkPEg.png 552w, https:\/\/miro.medium.com\/max\/960\/1*brkY7M89o4x8pDW9chkPEg.png 640w, https:\/\/miro.medium.com\/max\/1092\/1*brkY7M89o4x8pDW9chkPEg.png 728w, https:\/\/miro.medium.com\/max\/1224\/1*brkY7M89o4x8pDW9chkPEg.png 816w, https:\/\/miro.medium.com\/max\/1356\/1*brkY7M89o4x8pDW9chkPEg.png 904w, https:\/\/miro.medium.com\/max\/1488\/1*brkY7M89o4x8pDW9chkPEg.png 992w, https:\/\/miro.medium.com\/max\/1500\/1*brkY7M89o4x8pDW9chkPEg.png 1000w\" width=\"1357\" height=\"644\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<p id=\"c05c\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">This is the second article of my web scraping guide.\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/medium.com\/better-programming\/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a\" target=\"_blank\" rel=\"noopener\" rel=\"nofollow\" >In the first article<\/a>, I showed you how you can find, extract, and clean the data from one single web page on IMDb.<\/p>\n<p id=\"6246\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">In this article, you\u2019ll learn how to scrape multiple web pages \u2014 a list that\u2019s 20 pages and 1,000 movies total \u2014<strong class=\"io jj\">\u00a0<\/strong>with a Python web scraper.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"8fb0\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Where We Left Off<\/h1>\n<p id=\"c308\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">In the previous article, we scraped and cleaned the data of the\u00a0<code class=\"ig ks kt ku kv b\">title<\/code>,\u00a0<code class=\"ig ks kt ku kv b\">year<\/code>\u00a0of release,\u00a0<code class=\"ig ks kt ku kv b\">imdb_ratings<\/code>,\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>,\u00a0<code class=\"ig ks kt ku kv b\">length<\/code>\u00a0of movie, number of\u00a0<code class=\"ig ks kt ku kv b\">votes<\/code>, and the\u00a0<code class=\"ig ks kt ku kv b\">us_gross<\/code>\u00a0earnings of all movies on the first page of\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/www.imdb.com\/search\/title\/?groups=top_1000&amp;ref_=adv_prv\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >IMDb\u2019s Top 1,000 movies<\/a>.<\/p>\n<h2 id=\"6d8c\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">This was the code we used:<\/h2>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abr ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"page.py\" src=\"https:\/\/medium.com\/media\/232a55f8a2521b617e3b89006dc18b2b\" width=\"680\" height=\"1720\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"627a\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">And our results looked like this:<\/h2>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"cx cy lg\">\n<div class=\"if s fh ig\">\n<div class=\"lh ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/60\/1*5_OURkiwl-hc1J459m9riA.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"583\" height=\"587\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe1\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe1\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/875\/1*5_OURkiwl-hc1J459m9riA.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 583px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*5_OURkiwl-hc1J459m9riA.png 276w, https:\/\/miro.medium.com\/max\/828\/1*5_OURkiwl-hc1J459m9riA.png 552w, https:\/\/miro.medium.com\/max\/875\/1*5_OURkiwl-hc1J459m9riA.png 583w\" width=\"583\" height=\"587\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe2\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe2\" \/><\/div>\n<\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"67d3\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">What We\u2019ll Cover<\/h1>\n<p id=\"2032\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">I\u2019ll be guiding you through these steps:<\/p>\n<ol class=\"\">\n<li id=\"8d52\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh li lj lk ek\" data-selectable-paragraph=\"\">You\u2019ll request the unique URLs for every page on\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/www.imdb.com\/search\/title\/?groups=top_1000&amp;ref_=adv_prv\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >this IMDb list<\/a>.<\/li>\n<li id=\"a6d3\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh li lj lk ek\" data-selectable-paragraph=\"\">You\u2019ll iterate through each page using a\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop, and you\u2019ll scrape each movie one by one.<\/li>\n<li id=\"2bb0\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh li lj lk ek\" data-selectable-paragraph=\"\">You\u2019ll control the loop\u2019s rate to avoid flooding the server with requests.<\/li>\n<li id=\"ea02\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh li lj lk ek\" data-selectable-paragraph=\"\">You\u2019ll extract, clean, and download this final data.<\/li>\n<li id=\"33d2\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh li lj lk ek\" data-selectable-paragraph=\"\">You\u2019ll use basic data-quality best practices.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"4bc0\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Introducing New Tools<\/h1>\n<p id=\"dec4\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">These are the additional tools we\u2019ll use in our scraper:<\/p>\n<ul class=\"\">\n<li id=\"1234\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">The\u00a0<code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">sleep()<\/strong><\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/docs.python.org\/3\/library\/time.html?highlight=time%20module#time.sleep\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >function<\/a>\u00a0from Python\u2019s\u00a0<code class=\"ig ks kt ku kv b\">time<\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/docs.python.org\/3\/library\/time.html?highlight=time%20module#module-time\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >module<\/a>\u00a0will control the loop\u2019s rate by pausing the execution of the loop for a specified amount of seconds.<\/li>\n<li id=\"6425\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">The\u00a0<code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">randint()<\/strong><\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/docs.python.org\/3\/library\/random.html?highlight=random%20module#random.randint\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >function<\/a>\u00a0from Python\u2019s\u00a0<code class=\"ig ks kt ku kv b\">random<\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/docs.python.org\/3\/library\/random.html?highlight=random%20module#module-random\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >module<\/a>\u00a0will vary the amount of waiting time between requests \u2014 within your specified interval<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"e7a3\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Time to Code<\/h1>\n<p id=\"ffd6\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">As mentioned in the first article, I recommend following along in a\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/repl.it\/~\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >Repl.it<\/a>\u00a0environment if you don\u2019t already have an IDE.<\/p>\n<p id=\"f4ed\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">I\u2019ll also be writing out this guide as if we were starting fresh, minus all the first guide\u2019s explanations, so you aren\u2019t required to copy and paste the first article\u2019s code beforehand.<\/p>\n<p id=\"5242\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">You can compare the first article\u2019s code with this article\u2019s final code to see how it all worked \u2014 you\u2019ll notice a few slight changes.<\/p>\n<p id=\"dd91\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Alternatively, you can go straight to the code\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/github.com\/angelicadietzel\/data-projects\/tree\/master\/multi-page-imdb-scraper\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >here<\/a>.<\/p>\n<p id=\"3bb0\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Now, let\u2019s begin!<\/p>\n<h2 id=\"6904\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">Import tools<\/h2>\n<p id=\"f026\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Let\u2019s import our previous tools and our new tools \u2014\u00a0<code class=\"ig ks kt ku kv b\">time<\/code>\u00a0and\u00a0<code class=\"ig ks kt ku kv b\">random<\/code>.<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abq ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"final_import_tools.py\" src=\"https:\/\/medium.com\/media\/b0304bd8f1be2dd101379693d2900204\" width=\"680\" height=\"210\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"66e3\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">Initialize your storage<\/h2>\n<p id=\"ea9a\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Like previously, we\u2019re going to continue to use our empty lists as storage for all the data we scrape:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abq ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"empty_lists.py\" src=\"https:\/\/medium.com\/media\/e88d73e593b3c8f31d1f8311f1fe1c14\" width=\"680\" height=\"210\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"5a1f\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">English movie titles<\/h2>\n<p id=\"18fa\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">After we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"headers.py\" src=\"https:\/\/medium.com\/media\/cb1ec83cf48b5d1d13c8950f027eb006\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"f874\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">Analyzing our URL<\/h2>\n<p id=\"fed3\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Let\u2019s go to the\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/www.imdb.com\/search\/title\/?groups=top_1000&amp;ref_=adv_prv\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >URL of the page we\u2018re scraping<\/a>.<\/p>\n<p id=\"e54b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Now, let\u2019s click on the next page and see what<strong class=\"io jj\">\u00a0<\/strong>page 2\u2019s<strong class=\"io jj\">\u00a0<\/strong>URL looks like:<\/p>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"hw hx fh hy aj hz\" tabindex=\"0\" role=\"button\">\n<div class=\"cx cy lr\">\n<div class=\"if s fh ig\">\n<div class=\"ls ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/60\/1*5Sne9VN5k526B95rjsw29g.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"8000\" height=\"907\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe3\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe3\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/12000\/1*5Sne9VN5k526B95rjsw29g.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*5Sne9VN5k526B95rjsw29g.png 276w, https:\/\/miro.medium.com\/max\/828\/1*5Sne9VN5k526B95rjsw29g.png 552w, https:\/\/miro.medium.com\/max\/960\/1*5Sne9VN5k526B95rjsw29g.png 640w, https:\/\/miro.medium.com\/max\/1050\/1*5Sne9VN5k526B95rjsw29g.png 700w\" width=\"8000\" height=\"907\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe4\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe4\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"ff2a\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">And then page 3\u2019s URL:<\/p>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"hw hx fh hy aj hz\" tabindex=\"0\" role=\"button\">\n<div class=\"cx cy lt\">\n<div class=\"if s fh ig\">\n<div class=\"lu ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/60\/1*QQICBI4Qfox4FwHPIDp9aQ.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"7945\" height=\"948\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe5\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe5\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/11918\/1*QQICBI4Qfox4FwHPIDp9aQ.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*QQICBI4Qfox4FwHPIDp9aQ.png 276w, https:\/\/miro.medium.com\/max\/828\/1*QQICBI4Qfox4FwHPIDp9aQ.png 552w, https:\/\/miro.medium.com\/max\/960\/1*QQICBI4Qfox4FwHPIDp9aQ.png 640w, https:\/\/miro.medium.com\/max\/1050\/1*QQICBI4Qfox4FwHPIDp9aQ.png 700w\" width=\"7945\" height=\"948\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe6\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe6\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"2106\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">What do we notice about the URL from page 2 to page 3?<\/p>\n<p id=\"d457\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">We notice\u00a0<code class=\"ig ks kt ku kv b\">&amp;start=51<\/code>\u00a0is added into the URL when we go to page 2, and the number\u00a0<code class=\"ig ks kt ku kv b\">51<\/code>\u00a0turns to the number\u00a0<code class=\"ig ks kt ku kv b\">101<\/code>\u00a0on page 3.<\/p>\n<p id=\"e12c\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">This makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so on.<\/p>\n<p id=\"1548\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Why is this important? This information will help us tell our loop\u00a0<em class=\"lv\">how<\/em>\u00a0to go to the next page to scrape.<\/p>\n<h2 id=\"f081\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">Refresher on \u2018<code class=\"ig ks kt ku kv b\">for'<\/code>\u00a0loops<\/h2>\n<p id=\"030e\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Just like the loop we used to loop through each movie on the first page, we\u2019ll use a\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop to iterate through each page on the list.<\/p>\n<p id=\"ee16\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">To refresh, this is how a\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop works:<\/p>\n<pre class=\"hp hq hr hs ht lw lx bt\"><span id=\"4aae\" class=\"ek kw js dn kv b ly lz ma s mb\" data-selectable-paragraph=\"\">for &lt;variable&gt; in &lt;iterable&gt;:\r\n &lt;statement(s)&gt;<\/span><\/pre>\n<p id=\"529b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\">&lt;iterable&gt;<\/code>\u00a0is a collection of objects\u2014e.g. a list or tuple. The\u00a0<code class=\"ig ks kt ku kv b\">&lt;statement(s)&gt;\u00a0<\/code>are executed once for each item in\u00a0<code class=\"ig ks kt ku kv b\">&lt;iterable&gt;<\/code>. The loop\u00a0<code class=\"ig ks kt ku kv b\">&lt;variable&gt;<\/code>\u00a0takes on the value of the next element in\u00a0<code class=\"ig ks kt ku kv b\">&lt;iterable&gt;<\/code>\u00a0each time through the loop.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"1c7e\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Changing the URL Parameter<\/h1>\n<p id=\"62b8\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">As I mentioned earlier, each page\u2019s URL follows a certain logic as the web pages change. To make the URL requests we\u2019d have to vary the value of the page parameter, like this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"mult_page_loop.py\" src=\"https:\/\/medium.com\/media\/85b0e1f657ff62296b166537b1923d13\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"5c58\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking down the URL parameters:<\/strong><\/p>\n<ul class=\"\">\n<li id=\"86ae\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">pages<\/strong><\/code>\u00a0is the variable we create to store our page-parameter function for our loop to iterate through<\/li>\n<li id=\"6bc9\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">np.arrange(1,1001,50)<\/strong><\/code>\u00a0is a\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/realpython.com\/how-to-use-numpy-arange\/\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >function<\/a>\u00a0in the\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/numpy.org\/\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >NumPy<\/a>\u00a0Python library, and it takes four arguments \u2014 but we\u2019re only using the first three which are:\u00a0<code class=\"ig ks kt ku kv b\">start<\/code>,\u00a0<code class=\"ig ks kt ku kv b\">stop<\/code>, and\u00a0<code class=\"ig ks kt ku kv b\">step<\/code>.<strong class=\"io jj\">\u00a0<\/strong><code class=\"ig ks kt ku kv b\">step<\/code>\u00a0is the number that defines the spacing between each. So: Start at\u00a0<code class=\"ig ks kt ku kv b\">1<\/code>, stop at\u00a0<code class=\"ig ks kt ku kv b\">1001<\/code>, and step by\u00a0<code class=\"ig ks kt ku kv b\">50<\/code>.<\/li>\n<\/ul>\n<p id=\"08f2\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Start at\u00a0<\/strong><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">1<\/strong><\/code><strong class=\"io jj\">:\u00a0<\/strong>This will be our first page\u2019s URL.<\/p>\n<p id=\"88a3\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Stop at\u00a0<\/strong><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">1001<\/strong><\/code><strong class=\"io jj\">:\u00a0<\/strong>Why stop at 1001? The number in the stop parameter is the number that defines the end of the array, but it isn\u2019t included in the array. The last page for movies would be at the URL number of 951. This page has movies 951-1000. If we used 951, it wouldn\u2019t include this page in our scraper, so we have to go one page further to make sure we get the last page.<\/p>\n<p id=\"0468\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Step at\u00a0<\/strong><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">50<\/strong><\/code><strong class=\"io jj\">:\u00a0<\/strong>We want the URL number to change by 50 each time the loop comes around \u2014 this parameter tells it to do that.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"bee6\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Looping Through Each Page<\/h1>\n<p id=\"9f97\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Now we need to create another\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop that\u2019ll loop our scraper through the pages function we created above, which loops through each different URL we need. We can do this simply like this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"loop_two.py\" src=\"https:\/\/medium.com\/media\/15944599034d32b8852ddb75850f131f\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"5559\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking this loop down:<\/strong><\/p>\n<ul class=\"\">\n<li id=\"aa4d\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">page<\/strong><\/code>\u00a0is the variable that\u2019ll iterate through our\u00a0<code class=\"ig ks kt ku kv b\">pages<\/code>\u00a0function<\/li>\n<li id=\"f734\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">pages<\/strong><\/code>\u00a0is the function we created:\u00a0<code class=\"ig ks kt ku kv b\">np.arrange(1,1001,50)<\/code><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"e7ac\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Requesting the URL + \u2018html_soup\u2019 + \u2018movie_div\u2019<\/h1>\n<p id=\"4cba\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Inside this new loop is where we\u2019ll request our new URLs, add our\u00a0<code class=\"ig ks kt ku kv b\">html_soup<\/code>\u00a0(helps us parse the HTML files), and add our\u00a0<code class=\"ig ks kt ku kv b\">movie_div<\/code>\u00a0(stores each div container we\u2019re scraping). This is what it\u2019ll look like:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abp ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"newloop.py\" src=\"https:\/\/medium.com\/media\/a329ec4f0654e1a5a2195fef7bda8cdb\" width=\"680\" height=\"163\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"36e8\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking\u00a0<\/strong><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">page<\/strong><\/code><strong class=\"io jj\">\u00a0down:<\/strong><\/p>\n<ul class=\"\">\n<li id=\"3f3e\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">page<\/strong><\/code>\u00a0is the variable we\u2019re using which stores each of our new URLs<\/li>\n<li id=\"421c\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">requests.get()<\/strong><\/code>\u00a0is the method we use to grab the contents of each URL<\/li>\n<li id=\"ed8a\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">\u201chttps:\/\/www.imdb.com\/search\/title\/?groups=top_1000&amp;start=\"<\/strong><\/code>\u00a0is the part of the URL that stays the same when we change each page<\/li>\n<li id=\"1e03\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\">+\u00a0<strong class=\"io jj\">str(page)<\/strong><\/code>\u00a0tells the request to add each iteration of\u00a0<code class=\"ig ks kt ku kv b\">page<\/code>\u00a0(the page function we\u2019re using to change the page number of the URL) into the URL request. It also tells it to make sure it\u2019s a string we\u2019re using \u2014 not an integer or float \u2014 because it\u2019s an URL link we\u2019re building.<\/li>\n<li id=\"b09e\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\">+<strong class=\"io jj\">\u00a0\u201c&amp;ref_=adv_nxt\u201d<\/strong><\/code>\u00a0is added to the end of every URL because this also does not change when we go to the next page<\/li>\n<li id=\"cb9d\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">headers=headers<\/strong><\/code>\u00a0tells our scraper to bring us English-translated content from the URLs we\u2019re requesting<\/li>\n<\/ul>\n<p id=\"85c0\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking\u00a0<\/strong><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">soup<\/strong><\/code><strong class=\"io jj\">\u00a0down:<\/strong><\/p>\n<ul class=\"\">\n<li id=\"82e7\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">soup<\/strong><\/code><strong class=\"io jj\">\u00a0<\/strong>is the variable we create to assign the method\u00a0<code class=\"ig ks kt ku kv b\">BeautifulSoup<\/code>\u00a0to<\/li>\n<li id=\"b2a8\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">BeautifulSoup<\/strong><\/code>\u00a0is a method we\u2019re using that specifies a desired format of results<\/li>\n<li id=\"742b\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">(page.text, \u201chtml.parser\")<\/strong><\/code>\u00a0grabs the text contents of\u00a0<code class=\"ig ks kt ku kv b\">page<\/code>\u00a0and uses the HTML parser \u2014 this allows Python to read the components of the page rather than treating it as one long string<\/li>\n<\/ul>\n<p id=\"007c\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking\u00a0<\/strong><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">movie_div<\/strong><\/code><strong class=\"io jj\">\u00a0down:<\/strong><\/p>\n<ul class=\"\">\n<li id=\"1a81\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">movie_div<\/strong><\/code>\u00a0is the variable we use to store all of the\u00a0<code class=\"ig ks kt ku kv b\">div<\/code>\u00a0containers with a class of\u00a0<code class=\"ig ks kt ku kv b\">lister-item mode-advanced<\/code><\/li>\n<li id=\"0217\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">The\u00a0<code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">find_all()<\/strong><\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#find-all\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >method<\/a>\u00a0extracts all the\u00a0<code class=\"ig ks kt ku kv b\">div<\/code>\u00a0containers that have a\u00a0<code class=\"ig ks kt ku kv b\">class<\/code>\u00a0attribute of\u00a0<code class=\"ig ks kt ku kv b\">lister-item mode-advanced<\/code>\u00a0from what we\u2019ve stored in our variable\u00a0<code class=\"ig ks kt ku kv b\">soup<\/code><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"28a9\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Controlling the Crawl Rate<\/h1>\n<p id=\"7ed2\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Controlling the crawl rate is beneficial for the scraper and for the website we\u2019re scraping. If we avoid hammering the server with a lot of requests all at once, then we\u2019re much less likely to get our IP address banned \u2014 and we also avoid disrupting the activity of the website we scrape by allowing the server to respond to other user requests as well.<\/p>\n<p id=\"b4a0\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">We\u2019ll be adding this code to our new\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"sleep.py\" src=\"https:\/\/medium.com\/media\/64299fbbe98351f6519b78ada91ed255\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"c753\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking crawl rate down:<\/strong><\/p>\n<ul class=\"\">\n<li id=\"ed47\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">The\u00a0<code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">sleep()<\/strong><\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/docs.python.org\/3\/library\/time.html?highlight=time%20module#time.sleep\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >function<\/a>\u00a0will control the loop\u2019s rate by pausing the execution of the loop for a specified amount of time<\/li>\n<li id=\"f6dc\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">The\u00a0<code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">randint(2,10)<\/strong><\/code>\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/docs.python.org\/3\/library\/random.html?highlight=random%20module#random.randint\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >function<\/a>\u00a0will vary the amount of waiting time between requests for a number between 2-10 seconds. You can change these parameters to any that you like.<\/li>\n<\/ul>\n<p id=\"508b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Please note that this will delay the time it takes to grab all the data we need from every page, so be patient. There are 20 pages with a max of 10 seconds per loop, so it\u2019d take a max of 3.5 minutes to get all of the data with this code.<\/p>\n<p id=\"a09f\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">It\u2019s very important to practice good scraping and to scrape responsibly!<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h2 id=\"8989\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">Our code should now look like this:<\/h2>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abk ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"check.py\" src=\"https:\/\/medium.com\/media\/7752baeb40731a6737f2d534aa1a3186\" width=\"680\" height=\"696\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"b79a\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Scraping Code<\/h1>\n<p id=\"5620\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">We can add our scraping\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop code into our new\u00a0<code class=\"ig ks kt ku kv b\">for<\/code>\u00a0loop:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abj ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"scrapeloop.py\" src=\"https:\/\/medium.com\/media\/9c48782fbf2cb6ff1d6dca3cf3febdf2\" width=\"680\" height=\"589\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"9ca3\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Pointing Out Previous Errors<\/h1>\n<p id=\"9f37\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">I\u2019d like to point out a slight error I made in the\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/medium.com\/better-programming\/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a\" target=\"_blank\" rel=\"noopener\" rel=\"nofollow\" >previous article<\/a>\u00a0\u2014 a mistake I made regarding the cleaning of the\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data.<\/p>\n<p id=\"228c\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">I received this DM from an awesome dev who was running through my article and coding along but with a different IMDb URL than the one I used to teach in the guide.<\/p>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"cx cy mc\">\n<div class=\"if s fh ig\">\n<div class=\"md ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/60\/1*y7Eiu9x7-UCDHYobE8YDLw.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"526\" height=\"96\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe7\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe7\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/789\/1*y7Eiu9x7-UCDHYobE8YDLw.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 526px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*y7Eiu9x7-UCDHYobE8YDLw.png 276w, https:\/\/miro.medium.com\/max\/789\/1*y7Eiu9x7-UCDHYobE8YDLw.png 526w\" width=\"526\" height=\"96\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe8\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe8\" \/><\/div>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"88e6\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">In the extracting\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data code, we wrote this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abi ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"metascore.py\" src=\"https:\/\/medium.com\/media\/defc8924344e8187caf4badfced6c00a\" width=\"680\" height=\"99\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"98a6\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">This extraction code says if there is Metascore data there, grab it \u2014 but if the data is missing, then put a dash there and continue.<\/p>\n<p id=\"7ec0\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">In the cleaning of the<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data code, we wrote this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"cleaningmetascore.py\" src=\"https:\/\/medium.com\/media\/fc48d369d7a367c16b9defe1e9a7bf9d\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"e484\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">This cleaning code says to turn this pandas object into an integer data type, which worked for my URL I scraped because I didn\u2019t have any missing Metascore data \u2014 e.g., no dashes in place of missing data.<\/p>\n<p id=\"803b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">What I failed to notice is if someone scraped a different IMDb page than I did, they\u2019d possibly have missing\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data there, and once we scraped multiple pages in this guide, we\u2019ll have missing\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data as well.<\/p>\n<h2 id=\"fdab\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">What does this mean?<\/h2>\n<p id=\"7759\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">It means when we do get those dashes in place of missing data, we can\u2019t use the code\u00a0<code class=\"ig ks kt ku kv b\">.astype(int)<\/code>\u00a0to convert that entire<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data into an integer like I previously used \u2014 this would produce an error. We\u2019d need to turn our\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data into a float data type (decimal).<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"22c5\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Fixing the Cleaning of the\u00a0<code class=\"ig ks kt ku kv b\">Metascore<\/code>\u00a0Data Code<\/h1>\n<p id=\"4986\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Instead of this\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0data cleaning code:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"cleaningmetascore.py\" src=\"https:\/\/medium.com\/media\/fc48d369d7a367c16b9defe1e9a7bf9d\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"0f89\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">We\u2019ll use this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abl ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"cleanmetascore.py\" src=\"https:\/\/medium.com\/media\/d557d108cab93d987ec9b248fd8d23db\" width=\"680\" height=\"103\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"2d04\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Breaking down the new cleaning of the Metascore data:<\/strong><\/p>\n<p id=\"3283\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><em class=\"lv\">Top-cleaning code:<\/em><\/p>\n<ul class=\"\">\n<li id=\"5e0b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">movies[\u2018metascore\u2019]<\/strong><\/code>\u00a0is our Metascore data in our movies\u00a0<code class=\"ig ks kt ku kv b\">DataFrame<\/code>. We\u2019ll be assigning our new cleaned up data to our\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0column.<\/li>\n<li id=\"219c\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">movies[\u2018metascore\u2019]<\/strong><\/code>\u00a0tells pandas to go to the column\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0in our\u00a0<code class=\"ig ks kt ku kv b\">DataFrame<\/code><\/li>\n<li id=\"690b\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">.str.extract(\u2018(\\d+\u2019)<\/strong><\/code><strong class=\"io jj\">\u00a0\u2014\u00a0<\/strong>this method:\u00a0<code class=\"ig ks kt ku kv b\">(\u2018(\\d+\u2019)<\/code>\u00a0says to extract all the digits in the string<\/li>\n<\/ul>\n<p id=\"680b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><em class=\"lv\">Bottom-conversion code:<\/em><\/p>\n<ul class=\"\">\n<li id=\"466c\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">movies[\u2018metascore\u2019]<\/strong><\/code>\u00a0is stripped of the elements we don\u2019t need, and now we\u2019ll assign the conversion code data to it to finish it up<\/li>\n<li id=\"91cf\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">pd.to_numeric<\/strong><\/code>\u00a0is a\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.to_numeric.html\" target=\"_blank\" rel=\"noopener nofollow\" rel=\"nofollow\" >method<\/a>\u00a0we use to change this column to a float. The reason we use this is because we have a lot of dashes in this column, and we can\u2019t just convert it to a float using\u00a0<code class=\"ig ks kt ku kv b\">.astype(float)<\/code>\u00a0\u2014 this would catch an error.<\/li>\n<li id=\"0e2c\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">errors=\u2019coerce\u2019<\/strong><\/code>\u00a0will transform the nonnumeric values, our dashes, into not-a-number (NaN) values<em class=\"lv\">\u00a0<\/em>because we have dashes in place of the data that\u2019s missing.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"feea\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Add the DataFrame and Cleaning Code<\/h1>\n<p id=\"7db5\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Let\u2019s add our\u00a0<code class=\"ig ks kt ku kv b\">DataFrame<\/code>\u00a0and cleaning code to our new scraper, which will go below our loops. If you have any questions regarding how this code works, go to the\u00a0<a class=\"ck ji\" href=\"https:\/\/byy3.com\/go\/?url=https:\/\/medium.com\/better-programming\/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a\" target=\"_blank\" rel=\"noopener\" rel=\"nofollow\" >first article<\/a>\u00a0to see what each line executes.<\/p>\n<p id=\"d176\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">The code should look like this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abm ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"finalclean.py\" src=\"https:\/\/medium.com\/media\/e0b41400b1d260454785d84180f0f8d2\" width=\"680\" height=\"504\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"a79e\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Save to CSV<\/h1>\n<p id=\"120e\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">We have all the elements of our scraper ready \u2014 now it\u2019s time to save all the data we\u2019re about to scrape into our CSV.<\/p>\n<p id=\"9512\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Below is the code you can add to the bottom of your program to save your data to a CSV file:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"to_csv.py\" src=\"https:\/\/medium.com\/media\/fc8b8c9ce40a78b33342536d47700e64\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"4560\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">In case you need a refresher, if you\u2019re in Repl.it, you can create an empty CSV file by hovering near \u201cFiles\u201d and clicking the \u201cAdd file\u201d option. Name it, and save it with a\u00a0<code class=\"ig ks kt ku kv b\">.csv<\/code>\u00a0extension. Then, add the code to the end of your program:<\/p>\n<p id=\"db22\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\">movies.to_csv(\u2018the_name_of_your_csv_here.csv\u2019)<\/code><\/p>\n<p id=\"d356\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">If we run and save our\u00a0<code class=\"ig ks kt ku kv b\">.csv<\/code>, we should get a file with a list of movies and all the data from 0-999:<\/p>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"cx cy me\">\n<div class=\"if s fh ig\">\n<div class=\"mf ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/56\/1*PJ6-gXGMdHJmbi9jUChn3g.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"627\" height=\"679\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe9\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe9\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/941\/1*PJ6-gXGMdHJmbi9jUChn3g.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 627px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*PJ6-gXGMdHJmbi9jUChn3g.png 276w, https:\/\/miro.medium.com\/max\/828\/1*PJ6-gXGMdHJmbi9jUChn3g.png 552w, https:\/\/miro.medium.com\/max\/941\/1*PJ6-gXGMdHJmbi9jUChn3g.png 627w\" width=\"627\" height=\"679\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe10\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe10\" \/><\/div>\n<\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"e18f\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Basic Data-Quality Best Practices (Optional)<\/h1>\n<p id=\"1c27\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">Here, I\u2019ll discuss some basic data-quality tricks you can use when cleaning your data. You don\u2019t need to apply any of this to our final scraper.<\/p>\n<p id=\"ae87\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Usually, a dataset with a lot of missing data isn\u2019t a good dataset at all. Below are ways we can look up, manipulate, and change our data \u2014 for future reference.<\/p>\n<h2 id=\"dafb\" class=\"kw js dn ce jt kx ky eo jw kz la er jz es lb eu kd ev lc ex kh ey ld fa kl le ek\" data-selectable-paragraph=\"\">Missing data<\/h2>\n<p id=\"5fb4\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">One of the most common problems in a dataset is missing data. In our case, the data wasn\u2019t available. There are a couple of ways to check and deal with missing data:<\/p>\n<ul class=\"\">\n<li id=\"665a\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">Check where we\u2019re missing data and how much is missing<\/li>\n<li id=\"88d3\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">Add in a default value for the missing data<\/li>\n<li id=\"ef55\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">Delete the rows that have missing data<\/li>\n<li id=\"3594\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\">Delete the columns that have a high incidence of missing data<\/li>\n<\/ul>\n<p id=\"e68e\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">We\u2019ll go through each of these in turn.<\/p>\n<p id=\"899d\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Check missing data:<\/strong><\/p>\n<p id=\"8edb\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">We can easily check for missing data like this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abh ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"is_null.py\" src=\"https:\/\/medium.com\/media\/17adc46875e25998ea724a6a18778e4d\" width=\"680\" height=\"61\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"0ff0\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">The output:<\/p>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"cx cy mg\">\n<div class=\"if s fh ig\">\n<div class=\"mh ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/60\/1*CRIIqciwjnFtZCN9xdehaQ.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"325\" height=\"204\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe11\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe11\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/488\/1*CRIIqciwjnFtZCN9xdehaQ.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 325px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*CRIIqciwjnFtZCN9xdehaQ.png 276w, https:\/\/miro.medium.com\/max\/488\/1*CRIIqciwjnFtZCN9xdehaQ.png 325w\" width=\"325\" height=\"204\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe12\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe12\" \/><\/div>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"8ed2\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">This shows us\u00a0<em class=\"lv\">where<\/em>\u00a0the data is missing and\u00a0<em class=\"lv\">how much<\/em>\u00a0data is missing. We have 165 missing values in\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0and 161 missing in\u00a0<code class=\"ig ks kt ku kv b\">us_grossMillions<\/code>\u2014 a total of 326 missing data in our dataset.<\/p>\n<p id=\"3e52\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Add default value for missing data:<\/strong><\/p>\n<p id=\"fd1e\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">If you wanted to change your NaN values to something else specific, you can do so like this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abn ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"fillna.py\" src=\"https:\/\/medium.com\/media\/c97d4a72974f23019eec0dec1eae5d01\" width=\"680\" height=\"146\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"3be7\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">For this example, I want the words\u00a0<code class=\"ig ks kt ku kv b\">\u201cNone Given\u201d<\/code>\u00a0in place of\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0NaN values and empty quotes (nothing) in place of\u00a0<code class=\"ig ks kt ku kv b\">us_grossMillions<\/code>\u00a0NaN values.<\/p>\n<p id=\"9f6d\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">If you print those columns, you can see our NaN values have been changed as specified:<\/p>\n<figure class=\"hp hq hr hs ht hf cx cy paragraph-image\">\n<div class=\"cx cy mi\">\n<div class=\"if s fh ig\">\n<div class=\"mj ii s\">\n<div class=\"ia ib t u v ic aj bk id ie\"><img loading=\"lazy\" decoding=\"async\" class=\"t u v ic aj ij ik aq xk\" data-original=\"https:\/\/miro.medium.com\/max\/60\/1*AxIfSR69nEIkNfeziiqIKA.png?q=20\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" width=\"506\" height=\"511\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe13\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe13\" \/><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"xd xe t u v ic aj c\" data-original=\"https:\/\/miro.medium.com\/max\/759\/1*AxIfSR69nEIkNfeziiqIKA.png\" src=\"https:\/\/byy3.com\/wp-content\/themes\/MNews%20V2.4\/images\/post-loading.gif\" sizes=\"auto, 506px\" srcset=\"https:\/\/miro.medium.com\/max\/414\/1*AxIfSR69nEIkNfeziiqIKA.png 276w, https:\/\/miro.medium.com\/max\/759\/1*AxIfSR69nEIkNfeziiqIKA.png 506w\" width=\"506\" height=\"511\" title=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe14\" alt=\"Scrape Multiple Pages of a Website Using a Python Web Scraper IMDb\u2019s Top\u63d2\u56fe14\" \/><\/div>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"a46b\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Beware:<\/strong>\u00a0Our\u00a0<code class=\"ig ks kt ku kv b\">metascore<\/code>\u00a0column was an\u00a0<code class=\"ig ks kt ku kv b\">int<\/code>, and our\u00a0<code class=\"ig ks kt ku kv b\">us_grossMillions<\/code>\u00a0column was a\u00a0<code class=\"ig ks kt ku kv b\">float<\/code>\u00a0prior to this change \u2014 and you can see how they\u2019re both\u00a0<code class=\"ig ks kt ku kv b\">objects<\/code>\u00a0now because of the change. Be careful when changing your data, and always check to see what your data types are when making any alterations.<\/p>\n<p id=\"c8f9\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Delete rows with missing data:<\/strong><\/p>\n<p id=\"59ed\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Sometimes the best route to take when having a lot of missing data is to just remove them altogether. We can do this a couple of different ways:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abn ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"remove.py\" src=\"https:\/\/medium.com\/media\/8069ef58d435d734e7522e7743809ef3\" width=\"680\" height=\"146\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p id=\"da46\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\"><strong class=\"io jj\">Delete columns with missing data:<\/strong><\/p>\n<p id=\"aec3\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Sometimes when we have too many missing values in a column, it\u2019s best to get rid of them. We can do so like this:<\/p>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abl ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"dropcolumns.py\" src=\"https:\/\/medium.com\/media\/5874ee32ba58cdf715cfba98b5050e5a\" width=\"680\" height=\"103\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<ul class=\"\">\n<li id=\"3049\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">axis=1<\/strong><\/code>\u00a0is the parameter we use \u2014 it means to operate on columns, not rows.\u00a0<code class=\"ig ks kt ku kv b\">Axis=0<\/code>\u00a0means rows. We could\u2019ve used this parameter in our delete-rows section, but the default is already\u00a0<code class=\"ig ks kt ku kv b\">0<\/code>, so I didn\u2019t use it.<\/li>\n<li id=\"92e3\" class=\"im in dn io b em ll iq ir ep lm it iu iv ln ix iy iz lo jb jc jd lp jf jg jh lq lj lk ek\" data-selectable-paragraph=\"\"><code class=\"ig ks kt ku kv b\"><strong class=\"io jj\">how=\u2018any\u2019<\/strong><\/code><strong class=\"io jj\">\u00a0<\/strong>means<strong class=\"io jj\">\u00a0<\/strong>if any\u00a0<code class=\"ig ks kt ku kv b\">NA<\/code>\u00a0values are present to drop that column.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"3354\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">The Final Code<\/h1>\n<figure class=\"hp hq hr hs ht hf\">\n<div class=\"if s fh\">\n<div class=\"abo ii s\"><iframe loading=\"lazy\" class=\"t u v ic aj\" title=\"finalscraper.py\" src=\"https:\/\/medium.com\/media\/b0173513a78c9836ac474ba1403012bf\" width=\"680\" height=\"1976\" frameborder=\"0\" scrolling=\"auto\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p fc jk jl jm\" role=\"separator\"><\/div>\n<section class=\"df dg dh di dj\">\n<div class=\"n p\">\n<div class=\"ab ac ae af ag dk ai aj\">\n<h1 id=\"da28\" class=\"jr js dn ce jt ju jv iq jw jx jy it jz ka kb kc kd ke kf kg kh ki kj kk kl km ek\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"0d21\" class=\"im in dn io b em kn iq ir ep ko it iu iv kp ix iy iz kq jb jc jd kr jf jg jh df ek\" data-selectable-paragraph=\"\">There you have it! We\u2019ve successfully extracted data of the top 1,000 best movies of all time on IMDb, which included multiple pages, and saved it into a CSV file.<\/p>\n<p id=\"1c69\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">I hope you enjoyed building a Python scraper. If you followed along, let me know how it went.<\/p>\n<p id=\"5df4\" class=\"im in dn io b em ip iq ir ep is it iu iv iw ix iy iz ja jb jc jd je jf jg jh df ek\" data-selectable-paragraph=\"\">Happy coding!<\/p>\n<\/div>\n<\/div>\n<\/section>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>This is the second article of my web scraping guide.\u00a0In [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[33,402,352,61,51,400,401],"class_list":["post-849","post","type-post","status-publish","format-standard","hentry","category-python","tag-python","tag-python-scrapy","tag-scrape","tag-scrapy"],"_links":{"self":[{"href":"https:\/\/byy3.com\/index.php?rest_route=\/wp\/v2\/posts\/849","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/byy3.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/byy3.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/byy3.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/byy3.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=849"}],"version-history":[{"count":0,"href":"https:\/\/byy3.com\/index.php?rest_route=\/wp\/v2\/posts\/849\/revisions"}],"wp:attachment":[{"href":"https:\/\/byy3.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=849"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/byy3.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=849"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/byy3.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=849"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}