# Lecture 8. Web Page and Crawler

### Instructor: Luping Yu

### April 23, 2024

***
## Web Page

Before we start writing code, we need to understand a little bit about the structure of a web page.

When we visit a web page, our web browser makes a request to a web server. This request is called a <code>GET</code> request, since we're getting files from the server. The server then sends back files that tell our browser how to render the page for us. These files will typically include:

* HTML — the main content of the page.
* CSS — used to add styling to make the page look nicer.
* JS — Javascript files add interactivity to web pages.

After our browser receives all the files, it **renders** the page and displays it to us.

There's a lot that happens behind the scenes to render a page nicely, but we don't need to worry about most of it when we're web scraping. <u>When we perform web scraping, we’re interested in the main content of the web page</u>, so we look primarily at the <code>HTML</code>.

***
### What is HTML?
* HTML stands for <u>Hyper Text Markup Language</u>
* HTML is the standard **markup language** for creating Web pages
* HTML describes the structure of a Web page
* HTML consists of a series of **elements**
* HTML elements tell the browser how to display the content
* HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.

***
### How to View HTML Source?

View HTML Source Code:
* Right-click in an HTML page and select "View Page Source" (in Chrome) or "View Source" (in Edge), or similar in other browsers. This will open a window containing the HTML source code of the page.

Chrome Developer Tools <code>F12</code>:
* More Tools ---> Developer Tools
* Chrome DevTools is a set of web developer tools built directly into the Google Chrome browser.

![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/20220509_f0.png)

***
### A Simple HTML Document (a.k.a. "Page")

In [1]:
%%HTML

<!DOCTYPE html>
<html>

    <head>
        <title>My page title</title>
    </head>

    <body>
        <h1>Hello!</h1>
        <p>My second paragraph.</p>
    </body>

</html>

### Example Explained

* The <code>&lt;!DOCTYPE html&gt;</code> declaration defines that this document is an HTML document
* The <code>&lt;html&gt;</code> element is the root element of an HTML page
* The <code>&lt;head&gt;</code> element contains meta information about the HTML page
* The <code>&lt;title&gt;</code> element specifies a title for the HTML page (which is shown in the browser's title bar or in the page's tab)
* The <code>&lt;body&gt;</code> element defines the document's body, and is a container for all the **visible** contents, such as headings, paragraphs, images, hyperlinks, tables, lists, etc.
* The <code>&lt;h1&gt;</code> element defines a large heading
* The <code>&lt;p&gt;</code> element defines a paragraph

***
### HTML Element

An HTML <code>element</code> is defined by a **start tag**, some **content**, and an **end tag**:

![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/20220509_f1.png)


The HTML **element** is everything from the start tag to the end tag:

In [2]:
%%HTML

<h1>My First Heading</h1>
<p>My first paragraph.</p>

***
#### 1. Empty HTML Elements
* Empty elements (also called self-closing or void elements) are not container tags.
* A typical example of an empty element, is the <code>&lt;br&gt;</code> element, which represents a line break. Some other common empty elements are <code>&lt;img&gt;</code>, <code>&lt;input&gt;</code>, etc.

In [3]:
%%HTML

<p>This paragraph contains <br> a line break.</p>
<img src="https://www.xmu.edu.cn/images/logo2.png" alt="xmu">
<input type="text" name="username">

***
#### 2. Nesting HTML Elements
* Most HTML elements can contain any number of further elements, which are, in turn, made up of tags, attributes, and content or other elements.
* The following example shows some elements nested inside the <code>&lt;p&gt;</code> element.

In [4]:
%%HTML

<p>Here is some <b>bold</b> text.</p>
<p>Here is some <em>emphasized</em> text.</p>
<p>Here is some <mark>highlighted</mark> text.</p>

***
#### 3. HTML Links

HTML links are defined with the <code>&lt;a&gt;</code> tag:
* The link's destination is specified in the <code>href</code> attribute. 
* <code>Attributes</code> are used to provide **additional information** about HTML elements.

In [5]:
%%HTML

<a href="https://sm.xmu.edu.cn/">This is a link</a>

***
#### 4. Writing Comments in HTML
* Comments are usually added with the purpose of making the source code easier to understand.
* You can also comment out part of your HTML code for debugging purpose.
* An HTML comment begins with <code>&lt;!--, and ends with --&gt;</code>, as shown in the example below:

In [6]:
%%HTML

<!-- This is an HTML comment -->
<!-- This is a multi-line HTML comment 
     that spans across more than one line -->
<p>This is a normal piece of text.</p>

***
#### 5. HTML Elements Types

The basic elements of an HTML page are:

* A text header, denoted using the <code>&lt;h1&gt;, &lt;h2&gt;, &lt;h3&gt;, &lt;h4&gt;, &lt;h5&gt;, &lt;h6&gt;</code> tags.
* A paragraph, denoted using the <code>&lt;p&gt;</code> tag.
* A link, denoted using the <code>&lt;a&gt;</code> (anchor) tag.
* A list, denoted using the <code>&lt;ul&gt;</code> (unordered list), <code>&lt;ol&gt;</code> (ordered list) and <code>&lt;li&gt;</code> (list element) tags.
* An image, denoted using the <code>&lt;img&gt;</code> tag
* A divider, denoted using the <code>&lt;div&gt;</code> tag
* A text span, denoted using the <code>&lt;span&gt;</code> tag


Elements can be placed in two distinct groups: **block level** and **inline level** elements. The former make up the document's structure, while the latter dress up the contents of a block.

* A block element occupies 100% of the available width and it is rendered with a line break before and after. Whereas, an inline element will take up only as much space as it needs.
    * The most commonly used block-level elements are <code>&lt;div&gt;, &lt;p&gt;, &lt;h1&gt;</code> through <code>&lt;h6&gt;, &lt;form&gt;, &lt;ol&gt;, &lt;ul&gt;, &lt;li&gt;</code>, and so on. Whereas, the commonly used inline-level elements are <code>&lt;img&gt;, &lt;a&gt;, &lt;span&gt;, &lt;strong&gt;, &lt;b&gt;, &lt;em&gt;, &lt;i&gt;, &lt;code&gt;, &lt;input&gt;, &lt;button&gt;</code>, etc.
* The block-level elements should not be placed within inline-level elements. For example, the <code>&lt;p&gt;</code> element should not be placed inside the <code>&lt;b&gt;</code> element.



***
### HTML Attributes

<code>Attributes</code> define **additional characteristics or properties** of the element such as width and height of an image. 
* Attributes are always specified in the start tag (or opening tag) and usually consists of name/value pairs like <code>name="value"</code>.
    * Some attributes are required for certain elements. For instance, an <code>&lt;img&gt;</code> tag must contain a src and alt attributes.

Let's take a look at some examples of the attributes usages:

In [7]:
%%HTML

<img src="https://www.xmu.edu.cn/images/logo2.png" width="200" height="100" alt="xmu">
<a href="https://www.google.com/" title="Search Engine">Google</a>
<input type="text" value="Guido van Rossum">

In the above example <code>src</code> inside the <code>&lt;img&gt;</code> tag is an attribute and image path provided is its value. Similarly <code>href</code> inside the <code>&lt;a&gt;</code> tag is an attribute and the link provided is its value, and so on.
      
There are some **general purpose** attributes, such as <code>id, title, class, style</code>, etc. that you can use on the majority of HTML elements.
    
***
#### 1.The <code>id</code> Attribute
    
The <code>id</code> attribute is used to give a unique identifier to an element within a document. This makes it easier to select the element using CSS or JavaScript.
*  The <code>id</code> of an element must be **unique** within a single document.
    * No two elements in the same document can be named with the same id.
    * Each element can have only one id.

In [8]:
%%HTML

<input type="text" id="firstName">
<div id="container">Some content</div>
<p id="infoText">This is a paragraph.</p>

***
#### 2. The <code>class</code> Attribute

Like <code>id</code> attribute, the <code>class</code> attribute is also used to identify elements. But unlike <code>id</code>, the class attribute **does not have to be unique** in the document.
* This means you can apply the same class to multiple elements in a document.
* Any style rules that are written to that class will be applied to all the elements having that class.

In [9]:
%%HTML

<input type="text" class="highlight">
<p class="highlight">This is a paragraph.</p>

<p>This is a paragraph.</p>

***
#### 3. The <code>style</code> Attribute

HTML is quite limited when it comes to the presentation of a web page. It was originally designed as a simple way of presenting information.

**CSS (Cascading Style Sheets)** was introduced in December 1996 by the World Wide Web Consortium (W3C) to provide a better way to style HTML elements.

![avatar](https://github.com/lazydingding/gallery/blob/main/Screen%20Shot%202022-05-09%20at%2023.55.26.png?raw=true)

The <code>style</code> attribute allows you to specify CSS styling rules such as color, font, border, etc. directly within the element. Let's check out an example to see how it works:

In [10]:
%%HTML

<p style="color: blue;">This is a paragraph.</p>
<div style="border: 1px solid red;">Some content</div>

***
#### 4. HTML Attributes Types

Each element can also have attributes - each element has a different set of attributes relevant to the element.

There are a few global elements, the most common of them are:
* <code>id</code> - Denotes the unique ID of an element in a page. Used for locating elements by using links, JavaScript, and more.
* <code>class</code> - Denotes the CSS class of an element.
* <code>style</code> - Denotes the CSS styles to apply to an element.

***
## Crawler

Some websites offer data sets that are downloadable in CSV format, or accessible via an **Application Programming Interface (API)**. But many websites with useful data don't offer these convenient options.

Web scraping is the process of **gathering information** from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping!

The words "web scraping" usually refer to a process that involves automation. Some websites don't like it when automatic scrapers gather their data, while others don't mind.

Common tools:
* <code>requests</code>
* <code>BeautifulSoup</code>
* <code>Selenium</code>
* <code>Pandas.read_html()</code>

***
### How Does Web Scraping Work?

When we scrape the web, we write code that sends a request to the server that's hosting the page we specified. The server will return the source code — HTML, mostly — for the page (or pages) we requested.

So far, we're essentially doing the same thing a web browser does — sending a server request with a specific URL and asking the server to return the code for that page.

But unlike a web browser, our web scraping code won't interpret the page's source code and display the page visually. Instead, we'll write some custom code that filters through the page's source code looking for specific elements we've specified, and extracting whatever content we've instructed it to extract.

For example, if we wanted to get all of the data from inside a table that was displayed on a web page, our code would be written to go through these steps in sequence:

* Request the content (source code) of a specific URL from the server.
* Download the content that is returned.
* Identify the elements of the page that are part of the table we want.
* Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

Python and <code>BeautifulSoup</code> have built-in features designed to make this relatively straightforward.

***
### The requests library

Now that we understand the structure of a web page, it's time to get into the fun part: scraping the content we want!

The first thing we'll need to do to scrape a web page is to download the page. We can download pages using the Python <code>requests</code> library.

The requests library will make a <code>GET</code> request to a web server, which will download the HTML contents of a given web page for us.

Let's try downloading a simple [sample website](https://dataquestio.github.io/web-scraping-pages/simple.html):

In [11]:
import requests

page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")

page

<Response [200]>

A <code>status_code</code> of 200 means that the page downloaded successfully.

We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

In [12]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

***
### Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the <code>BeautifulSoup</code> library to parse this document, and extract the text from the <code>p tag</code>. If we want to extract a single tag, we can use the <code>find_all</code> method, which will find all the instances of a tag on a page.

We first have to import the library, and create an instance of the <code>BeautifulSoup</code> class to parse our document:

In [13]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [14]:
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that <code>find_all</code> returns a list, so we'll have to loop through, or use list indexing, it to extract text:

In [15]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

***
### Searching for tags by class and id

We introduced classes and ids earlier, but it probably wasn't clear why they were useful.

<code>Classes</code> and <code>ids</code> are used by CSS to determine which HTML elements to apply certain styles to. But when we're scraping, we can also use them to specify the elements we want to scrape.

To illustrate this principle, we'll work with another [sample website](https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html):

In [16]:
page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Now, we can use the <code>find_all</code> method to search for items by class or by id. Let's look for any tag that has the class <code>outer-text</code>:

In [17]:
soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by <code>id</code>:

In [18]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

***
### Using CSS Selectors

We can also search for items using <code>CSS</code> selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

* <code>p a</code> — finds all a tags inside of a p tag.
* <code>body p a</code> — finds all a tags inside of a p tag inside of a body tag.
* <code>html body</code> — finds all body tags inside of an html tag.
* <code>p.outer-text</code> — finds all p tags with a class of outer-text.
* <code>p#first</code> — finds all p tags with an id of first.
* <code>body p.outer-text</code> — finds any p tags with a class of outer-text inside of a body tag.

<code>BeautifulSoup</code> objects support searching a page via CSS selectors using the <code>select</code> method. We can use CSS selectors to find all the <code>p</code> tags in our page that are inside of a <code>div</code> like this:

In [19]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

***
### Example 1. Downloading weather data

We now know enough to proceed with extracting information about the local weather from the National Weather Service website.

The first step is to find the page we want to scrape. We'll extract weather information about downtown San Francisco from [this page](https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.YnnblBNBzpY). The page has information about the extended forecast for the next week, including time of day, temperature, and a brief description of the conditions.

Specifically, let's extract data about the extended forecast.

***
#### 1. Exploring page structure with Chrome DevTools

The first thing we'll need to do is inspect the page using <mark>Chrome Devtools</mark>. If you’re using another browser, Firefox and Safari have equivalents.

You can start the developer tools in Chrome by clicking <mark>View -> Developer -> Developer Tools</mark>. Make sure the Elements panel is highlighted.

The elements panel will show you all the HTML tags on the page, and let you navigate through them. It's a really handy feature!

We can then scroll up in the elements panel to find the "outermost" element that contains all of the text that corresponds to the extended forecasts. In this case, it's a <code>div</code> tag with the id <code>seven-day-forecast</code>.

If we click around on the console, and explore the div, we'll discover that each forecast item (like "Tonight", "Thursday", and "Thursday Night") is contained in a <code>div</code> with the class <code>tombstone-container</code>.

***
#### 2. Time to Start Scraping!

We now know enough to download the page and start parsing it. In the below code, we will:

* Download the web page containing the forecast.
* Create a BeautifulSoup class to parse the page.
* Find the div with id seven-day-forecast, and assign to seven_day
* Inside seven_day, find each individual forecast item.
* Extract and print the first forecast item.

In [20]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')

soup

<!DOCTYPE html>

<html class="no-js">
<head>
<!-- Meta -->
<meta content="width=device-width" name="viewport"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/><title>National Weather Service</title><meta content="National Weather Service" name="DC.title"><meta content="NOAA National Weather Service National Weather Service" name="DC.description"/><meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/><meta content="" name="DC.date.created" scheme="ISO8601"/><meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/><meta content="weather, National Weather Service" name="DC.keywords"/><meta content="NOAA's National Weather Service" name="DC.publisher"/><meta content="National Weather Service" name="DC.contributor"/><meta content="//www.weather.gov/disclaimer.php" name="DC.rights"/><meta content="General" name="rating"/><meta content="index,follow" name="robots"/>
<!-- Icons -->
<link href="./images/favicon.ico" rel="shortcut 

In [21]:
seven_day = soup.find(id="seven-day-forecast")

seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
                San Francisco CA    </h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. "/></p><p class="short-desc">Mostly Cloudy</p><p class="temp temp-low">Low: 53 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tuesday<br/><br/></p>
<p><img alt="Tuesday: Mostly cloudy, then gradually becoming sunny, with a high near 64. South southwest 

In [22]:
forecast_items = seven_day.find_all(class_="tombstone-container")

forecast_items

[<div class="tombstone-container">
 <p class="period-name">Tonight<br/><br/></p>
 <p><img alt="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. "/></p><p class="short-desc">Mostly Cloudy</p><p class="temp temp-low">Low: 53 °F</p></div>,
 <div class="tombstone-container">
 <p class="period-name">Tuesday<br/><br/></p>
 <p><img alt="Tuesday: Mostly cloudy, then gradually becoming sunny, with a high near 64. South southwest wind 8 to 14 mph, with gusts as high as 20 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Tuesday: Mostly cloudy, then gradually becoming sunny, with a high near 64. South southwest wind 8 to 14 mph, with gusts as high as 20 mph. "/></p><p class="short-desc">Decreasing<br/>Clouds</p><p class="temp temp-high">High: 64 °F</p></div>,
 <div class="tombstone-container">
 <p class=

In [23]:
tonight = forecast_items[0]

tonight

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. "/></p><p class="short-desc">Mostly Cloudy</p><p class="temp temp-low">Low: 53 °F</p></div>

***
#### 3. Extracting information from the page

As we can see, inside the forecast item <code>tonight</code> is all the information we want. There are four pieces of information we can extract:

* The name of the forecast item — in this case, <mark>Tonight</mark>.
* The description of the conditions — this is stored in the <code>title</code> property of <code>img</code>.
* A short description of the conditions — in this case, <mark>Mostly Cloudy</mark>.
* The temperature low — in this case, <mark>53 degrees</mark>.


We'll extract the name of the forecast item, the short description, and the temperature first, since they're all similar:

In [24]:
tonight

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. "/></p><p class="short-desc">Mostly Cloudy</p><p class="temp temp-low">Low: 53 °F</p></div>

In [25]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Tonight
Mostly Cloudy
Low: 53 °F


Now, we can extract the <code>title</code> attribute from the <code>img</code> tag. To do this, we just treat the <code>BeautifulSoup</code> object like a dictionary, and pass in the attribute we want as a key:

In [26]:
tonight

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. " class="forecast-icon" src="newimages/medium/nbkn.png" title="Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. "/></p><p class="short-desc">Mostly Cloudy</p><p class="temp temp-low">Low: 53 °F</p></div>

In [27]:
img = tonight.find("img")
desc = img['alt']
print(desc)

Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. 


***
#### 4. Extracting all the information from the page
Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to **extract everything at once**.

In the below code, we will:

* Select all items with the class <code>period-name</code> inside an item with the class <code>tombstone-container</code> in <code>seven_day</code>.
* Use a list comprehension to call the <code>get_text</code> method on each <code>BeautifulSoup</code> object.

In [28]:
period_tags = seven_day.select(".tombstone-container .period-name")

period_tags

[<p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>,
 <p class="period-name">Thursday<br/><br/></p>,
 <p class="period-name">Thursday<br/>Night</p>,
 <p class="period-name">Friday<br/><br/></p>,
 <p class="period-name">Friday<br/>Night</p>]

In [29]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight']

As we can see above, our technique gets us each of the period names, in order. We can apply the same technique to get the other three fields:

In [30]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Mostly Cloudy', 'DecreasingClouds', 'Mostly Cloudy', 'Partly Sunny', 'Partly Cloudy', 'Partly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Mostly Clear']
['Low: 53 °F', 'High: 64 °F', 'Low: 54 °F', 'High: 64 °F', 'Low: 52 °F', 'High: 63 °F', 'Low: 54 °F', 'High: 64 °F', 'Low: 51 °F']
['Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. ', 'Tuesday: Mostly cloudy, then gradually becoming sunny, with a high near 64. South southwest wind 8 to 14 mph, with gusts as high as 20 mph. ', 'Tuesday Night: Mostly cloudy, with a low around 54. South southwest wind 6 to 10 mph. ', 'Wednesday: Partly sunny, with a high near 64. South wind 6 to 10 mph. ', 'Wednesday Night: Partly cloudy, with a low around 52. West wind 7 to 10 mph. ', 'Thursday: Partly sunny, with a high near 63.', 'Thursday Night: Mostly cloudy, with a low around 54.', 'Friday: Mostly sunny, with a high near 64.', 'Friday Night: Mostly clear, with a low around 51.']


***
#### 5. Combining our data into a Pandas Dataframe

We can now combine the data into a <code>Pandas DataFrame</code> and analyze it.

In [31]:
import pandas as pd

weather = pd.DataFrame({"period": periods,
                        "short_desc": short_descs,
                        "temp": temps,
                        "desc":descs
})

weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Mostly Cloudy,Low: 53 °F,"Tonight: Mostly cloudy, with a low around 53. ..."
1,Tuesday,DecreasingClouds,High: 64 °F,"Tuesday: Mostly cloudy, then gradually becomin..."
2,TuesdayNight,Mostly Cloudy,Low: 54 °F,"Tuesday Night: Mostly cloudy, with a low aroun..."
3,Wednesday,Partly Sunny,High: 64 °F,"Wednesday: Partly sunny, with a high near 64. ..."
4,WednesdayNight,Partly Cloudy,Low: 52 °F,"Wednesday Night: Partly cloudy, with a low aro..."
5,Thursday,Partly Sunny,High: 63 °F,"Thursday: Partly sunny, with a high near 63."
6,ThursdayNight,Mostly Cloudy,Low: 54 °F,"Thursday Night: Mostly cloudy, with a low arou..."
7,Friday,Mostly Sunny,High: 64 °F,"Friday: Mostly sunny, with a high near 64."
8,FridayNight,Mostly Clear,Low: 51 °F,"Friday Night: Mostly clear, with a low around 51."


***
### Example 2. Pandas Web Scraping

Pandas makes it easy to scrape a <code>table</code> tag on a web page. You can use the function <code>read_html(url)</code> to get webpage contents.

It's only suitable for fetching <code>Table</code> type data, then let's see what kind of pages meet the conditions?

***
#### HTML Tables

HTML tables allow web developers to arrange data into rows and columns:

In [32]:
%%HTML

<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Apple</td>
    <td>Tim Cook</td>
    <td>United States</td>
  </tr>
  <tr>
    <td>Tencent</td>
    <td>Pony Ma</td>
    <td>China</td>
  </tr>
</table>

Company,Contact,Country
Apple,Tim Cook,United States
Tencent,Pony Ma,China


Open the web version with a browser, F12 view the HTML structure of the web page, you will find that the structure of the eligible web page has a common feature.

If you find <code>Table</code> format, you can use <code>pd.read_html( )</code>

***
#### Sina finance - Institutional ownership

Take the aggregate shareholding data of Sina Financial institutions as an example:



In [33]:
url = 'http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/nbjy/index.phtml'

df = pd.read_html(url)[0]

df

Unnamed: 0,股票代码,股票名称,变动人,变动类型,变动 股数,成交均价,变动金额(万元),变动后 持股数,变动原因,变动日期,持股种类,与董监高 关系,董监高 职务
0,2145,中核钛白,沈鑫,购买,3350000,4.02,1346.7,14150000,竞价交易,2024-04-19,A股,本人,非独立董事
1,688660,电气风电,吴改,购买,10200,3.25,3.32,40255,二级市场买卖,2024-04-19,A股,本人,非独立董事
2,688660,电气风电,王勇,购买,10000,3.26,3.26,33248,二级市场买卖,2024-04-19,A股,本人,审计委员会委员
3,688660,电气风电,王红春,购买,8000,3.24,2.59,33153,二级市场买卖,2024-04-19,A股,本人,职工代表监事
4,688660,电气风电,乔银平,购买,12240,3.27,4.0,42540,二级市场买卖,2024-04-19,A股,本人,董事
5,688660,电气风电,刘向楠,购买,9088,3.25,2.95,29376,二级市场买卖,2024-04-19,A股,本人,职工监事
6,688660,电气风电,黄锋锋,购买,11000,3.25,3.58,40658,二级市场买卖,2024-04-19,A股,本人,财务总监
7,688560,明冠新材,闫勇,购买,30000,11.32,33.96,830000,二级市场买卖,2024-04-19,A股,本人,董事
8,603938,三孚股份,孙任靖,购买,5000,12.5,6.25,150409520,二级市场买卖,2024-04-19,A股,本人,董事长
9,688176,亚虹医药,PAN KE,购买,155000,5.65,87.58,130103577,二级市场买卖,2024-04-19,A股,本人,董事长


In [34]:
df = pd.DataFrame()

for i in range(1, 3):
    url = 'http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/nbjy/index.phtml?p=%s' % i
    df = pd.concat([df, pd.read_html(url)[0]])
    print("Page %s completed!" % i)
    
df

Page 1 completed!
Page 2 completed!


Unnamed: 0,股票代码,股票名称,变动人,变动类型,变动 股数,成交均价,变动金额(万元),变动后 持股数,变动原因,变动日期,持股种类,与董监高 关系,董监高 职务
0,2145,中核钛白,沈鑫,购买,3350000,4.02,1346.70,14150000,竞价交易,2024-04-19,A股,本人,非独立董事
1,688660,电气风电,吴改,购买,10200,3.25,3.32,40255,二级市场买卖,2024-04-19,A股,本人,非独立董事
2,688660,电气风电,王勇,购买,10000,3.26,3.26,33248,二级市场买卖,2024-04-19,A股,本人,审计委员会委员
3,688660,电气风电,王红春,购买,8000,3.24,2.59,33153,二级市场买卖,2024-04-19,A股,本人,职工代表监事
4,688660,电气风电,乔银平,购买,12240,3.27,4.00,42540,二级市场买卖,2024-04-19,A股,本人,董事
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35,603456,九洲药业,沙裕杰,购买,10000,16.05,16.05,90000,二级市场买卖,2024-04-11,A股,本人,执行董事
36,688691,灿芯股份,庄志青,--,--,--,0.00,3092850,,2024-04-11,A股,本人,董事
37,688257,新锐股份,杨汉民,出售,-80000,25,200.00,309608,二级市场买卖,2024-04-11,A股,本人,事业部技术副总经理
38,872190,雷神科技,路凯林,购买,110000,16.36,179.96,643471,竞价交易,2024-04-11,A股,本人,董事
