Product Analysis using Web Scraping Technique in Python open code

admin python 2021年1月28日

5.19W 0 50

Web Scraping is one of the Data Scraping technique in which data is extracted from the websites for analysis.In this project we will learn the how to analyze the product in an online shop like flipkart, for example we will analyze various brands of Mobile Tablets sold in the flipkart web site and suggest the medium range product in price range.

Using the web scraping techniques we will be able to get the details prices,specifications, reviews,highlights and ratings for any product in the website.

We will be using the python modules urlopen(from urllib library) and BeautifulSoup(from bs4 library).

import bs4
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

First we need to read the contents of the web page which displays the search results of the product we are going to analyse ,the tablet and parse the html content using the BeautifulSoup module.

myurl = “https://www.flipkart.com/tablets/pr?sid=tyy,hry&marketplace=FLIPKART"
uclient = uReq(myurl)
page_html = uclient.read()
uclient.close()
psoup = soup(page_html, “html.parser”)

Below is the screen short of the web page we are going to get the data for analysis,right click on the page and do inspect, it will show the HTML document tags for each element in the page.

Right click on the Page index tab in the page select the inspect option and search for the elements listing and find the class name for that element and use it for getting the href of the link to the search result pages for all the product listed,as show in the picture below

Now find all the page references and store it in a list,as detailed in the code below

page_urls = list()
for containers in psoup.findAll(‘div’,{‘class’:’_2MImiq’}):
a_list = containers.findAll(‘a’,{‘class’:’ge-49M’})
#uncomment and use the below 2 lines to get all the search 10 pages
#for a in a_list:
# page_urls.append(‘https://www.flipkart.com’+ a[‘href’])
#use the below 2 lines to get single search page
a= a_list[0]
page_urls.append(‘https://www.flipkart.com'+ a[‘href’])

Once we get the page url link, we need to inspect each product listed in the page as below

for url in page_urls:
print(url)
uclient = uReq(url)
page_html = uclient.read()
uclient.close()
psoup = soup(page_html, “html.parser”)
prod_urls = list()
for containers in psoup.findAll(‘div’,{‘class’:’_13oc-S’}):
for a in containers:
a_list = a.findAll(‘a’,{‘class’:’_1fQZEK’})
#print(a_list[‘href’])
#prod_urls.append(‘https://www.flipkart.com’+ a_list[‘href’])
for a in a_list:
prod_urls.append(‘https://www.flipkart.com'+ a[‘href’])
for p_url in prod_urls:
uclient = uReq(p_url)
page_html = uclient.read()
uclient.close()
psoup = soup(page_html, “html.parser”)
all_procuct_items = psoup.find(‘div’, attrs={‘class’ : ‘_3k-BhJ’})
all_procuct_items = all_procuct_items.findAll(‘tr’, attrs={‘class’ : ‘_1s_Smc row’})
#container variable contains the html of product title which is stored in div tag and class is “_1AtVbE col-12–12”
container= psoup.findAll(“div”,{“class”:”_1AtVbE col-12–12"})
for product_item in container:
product_dict = dict()
rating_dict= dict()
price_dict = dict()
brandname_dict = dict()
n = product_item.findAll(“span”,{“class”:”B_NuCI”})
p = product_item.findAll(“div”,{“class”:”_30jeq3 _16Jk6d”})
r = product_item.findAll(“div”,{“class”:”_2d4LTz”})
for i in n:
product_dict[‘name’] = i.text
strtmp = i.text.split(“ “)
brandname_dict[‘brandname’] = strtmp[0]
brandnames_list.append(strtmp[0])
products_list.append(i.text)
for j in p:
jStr = j.text
jStr = jStr.replace(“₹”, “”)
jStr = jStr.replace(“,”, “”)
price_dict[‘price’] = int( jStr)
prices_list.append(int( jStr))
for k in r:
#print (i.text)
rating_dict[‘rating’] = k.text
ratings_list.append(k.text)

We can now export data extracted for the details of the products to the csv file using the code below

df = pd.DataFrame({‘Brand’:brandnames_list,’Price’:prices_list,’Ratings’:ratings_list,’ProductName’:products_list})
df.to_csv (‘export_products_dataframe.csv’)

We can use the box plot ,cat plot and bar plot available in the seaborn module for visualizing the price range of the product as shown in the pictures below,using the code below

df[‘Price’] = df[‘Price’].astype(np.float)
sns.boxplot(x=df[‘Price’])
sns.catplot(x = “Price”, # x variable name
y = “Brand”, # y variable name
hue = “Ratings”, # group variable name
data = df, # dataframe to plot
kind = “bar”)
df.groupby(‘Brand’).plot(x=’Brand’, y=’Price’)
sns.barplot(x = ‘Brand’,
y = ‘Price’,
data = df)

AI Enthusiast

Sep 1, 2020

Overloading in Python

In Object Oriented programming approach the concept of polymorphism is about having a class objects in many forms. Polymorphism is of two types one is static polymorphism and the other is dynamic polymorphism.

Python do not have compile time polymorphism directly, that is because python programs do not use compilation specific to the machine in which they run ,instead the python programs are compiled to be run on a python virtual machine, and hence compilation is not specific to the machine, which makes python portable across platforms,more easier to learn and machine independent ,the picture below gives an overview of the components used to convert a python language program to a running code. …

import bs4import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom scipy import statsimport plotly_express as pxfrom urllib.request import urlopen as uReqfrom bs4 import BeautifulSoup as soupmyurl = "https://www.flipkart.com/tablets/pr?sid=tyy,hry&marketplace=FLIPKART"uclient = uReq(myurl)page_html = uclient.read()uclient.close()psoup = soup(page_html, "html.parser")products_list = list()prices_list = list()ratings_list= list()modelnames_list= list()brandnames_list = list()rams_list= list()colors_list= list()displays_list= list()connectivities_list= list()page_urls = list()for containers in psoup.findAll('div',{'class':'_2MImiq'}): page_href_list = containers.findAll('a',{'class':'ge-49M'}) #uncomment the below 2 lines to get all the search 10 pages #for a in a_list: # page_urls.append('https://www.flipkart.com'+ a['href']) #uncomment the below 2 lines to get single search page page_href= page_href_list[0] page_urls.append('https://www.flipkart.com'+ page_href['href'])for url in page_urls: print(url) uclient = uReq(url) page_html = uclient.read() uclient.close() psoup = soup(page_html, "html.parser") prod_urls = list()for containers in psoup.findAll('div',{'class':'_13oc-S'}): for a in containers: a_list = a.findAll('a',{'class':'_1fQZEK'}) #print(a_list['href']) #prod_urls.append('https://www.flipkart.com'+ a_list['href'])  for a in a_list: prod_urls.append('https://www.flipkart.com'+ a['href']) for p_url in prod_urls: uclient = uReq(p_url) page_html = uclient.read() uclient.close() psoup = soup(page_html, "html.parser") all_procuct_items = psoup.find('div', attrs={'class' : '_3k-BhJ'}) all_procuct_items = all_procuct_items.findAll('tr', attrs={'class' : '_1s_Smc row'}) for procuct_item in all_procuct_items: dataname = procuct_item.findAll("td", {"class": "_1hKmbr col col-3-12"}) datadetails = procuct_item.findAll("td", {"class": "URwL2w col col-9-12"}) """ color = dict() display = dict() modelname = dict() connectivity = dict() for names in dataname: if( 'Name' in names.text): for details in datadetails: modelname['modelname'] = details.text modelnames.append(details.text) elif( 'Color' in names.text): for details in datadetails: color['color'] = details.text colors.append(details.text) elif("Display" in names.text): for details in datadetails: display['display'] = details.text displays.append(details.text) elif("Connectivity" in names.text): for details in datadetails: connectivity['connectivity'] = details.text connectivities.append(details.text) """ container= psoup.findAll("div",{"class":"_1AtVbE col-12-12"}) #container variable contains the html of product title which is stored in div tag and class is "_1AtVbE col-12-12" for product_item in container: product_dict = dict() rating_dict= dict() price_dict = dict() brandname_dict = dict() n = product_item.findAll("span",{"class":"B_NuCI"}) p = product_item.findAll("div",{"class":"_30jeq3 _16Jk6d"})  r = product_item.findAll("div",{"class":"_2d4LTz"}) for i in n: product_dict['name'] = i.text strtmp = i.text.split(" ") brandname_dict['brandname'] = strtmp[0] brandnames_list.append(strtmp[0]) products_list.append(i.text) for j in p: jStr = j.text jStr = jStr.replace("₹", "") jStr = jStr.replace(",", "") price_dict['price'] = int( jStr) prices_list.append(int( jStr)) for k in r: #print (i.text) rating_dict['rating'] = k.text ratings_list.append(k.text)print ("Number of items in the products list = ", len(products_list))print ("Number of items in the prices list = ", len(prices_list))print ("Number of items in the ratings list = ", len(ratings_list))df = pd.DataFrame({'Brand':brandnames_list,'Price':prices_list,'Ratings':ratings_list,'ProductName':products_list})df.to_csv ('export_products_dataframe.csv')df['Price'] = df['Price'].astype(np.float)sns.boxplot(x=df['Price'])sns.catplot(x = "Price", # x variable name  y = "Brand", # y variable name hue = "Ratings", # group variable name data = df, # dataframe to plot kind = "bar")df.groupby('Brand').plot(x='Brand', y='Price')sns.barplot(x = 'Brand', y = 'Price', data = df)