Friday, February 14, 2020

Python with Selenium Webdriver

## Python with Selenium Helper file :-
-----------------------------------------

##Check version/ Is already installed? via CMD
-----------------------------------------------
python --version
pip --version

CTRL + '/' => Comment or Uncomment any line 
CTRL + ALT + SHIFT => Formate Code automatically


##Install Selenium in Python via CMD
---------------------------------------
pip install -U selenium


##Upgrade PIP version via CMD
------------------------------------
python -m pip install --upgrade pip


##Learn Selenium Python : Step by Step
------------------------------------------------------
Step 1 : download python - https://www.python.org/downloads/
Step 2 : Install and check python and pip is installed successfully
              python --version
              pip --version
Step 3 : install selenium libraries
              pip install -U selenium
Step 4 : Download PyCharm - community edition
      https://www.jetbrains.com/pycharm/dow...
Step 5 : Create new project in PyCharm
Step 6 : Adding selenium scripts to the project
   note: those who face problem on click btnK replace it with "q".
Step 7 : Run from IDE
         Run from Command Line


For more deatils about sample project kindly refer a below URL:

https://github.com/autom99/PythonSampleProject/tree/master/PythonWithSelenium


Saturday, February 8, 2020

Python Web Scraping Using BeautifulSoup

Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. It is also called as web data mining or web harvesting.
In a simple word, Web scraping is used to collect large information from websites. 
Applications of web scraping:
  • Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites including using it to compare the prices of products.
  • Email address gathering: Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.
  • Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.
  • Research and Development: Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used to carry out Surveys or for R&D.
  • Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.



Implementing Web Scraping in Python with BeautifulSoup


There are mainly two ways to extract data from a website:
  • Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
  • Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.
This article discusses the steps involved in web scraping using the implementation of Web Scraping in Python with Beautiful Soup
Steps involved in web scraping:


  1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.
  2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data.
    There are many HTML parser libraries available but the most advanced one is html5lib.
  3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.
In below example, I am using PycharmIDE:

source code: TestWebScrap.py

-----------------------------------------------------------------
import requests
from bs4 import BeautifulSoup

page = requests.get("https://autom99.blogspot.com/")
soup = BeautifulSoup(page.content,'html.parser')
print(soup.prettify())

posts = soup.find_all(class_ ='posts')
print(posts)

postHierarchy = soup.find_all(class_ = 'hierarchy')
print(postHierarchy)

print(soup.find_all(id='ArchiveList'))
---------------------------------------------------------------------

Also, we can export the results into EXCEL as followed by the below code:

import pandas as pd
................-----------------

all = pd.DataFrame({
    'Posts': posts,
    'PostHierarchy': postHierarchy,
})
print(all)
------------------------


For more details refer to a URL my Github: