{ "cells": [ { "cell_type": "markdown", "id": "4511df5d", "metadata": {}, "source": [ "# Lecture 8. Web Page and Crawler\n", "\n", "### Instructor: Luping Yu\n", "\n", "### April 23, 2024\n", "\n", "***\n", "## Web Page\n", "\n", "Before we start writing code, we need to understand a little bit about the structure of a web page.\n", "\n", "When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we're getting files from the server. The server then sends back files that tell our browser how to render the page for us. These files will typically include:\n", "\n", "* HTML — the main content of the page.\n", "* CSS — used to add styling to make the page look nicer.\n", "* JS — Javascript files add interactivity to web pages.\n", "\n", "After our browser receives all the files, it **renders** the page and displays it to us.\n", "\n", "There's a lot that happens behind the scenes to render a page nicely, but we don't need to worry about most of it when we're web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML.\n", "\n", "***\n", "### What is HTML?\n", "* HTML stands for Hyper Text Markup Language\n", "* HTML is the standard **markup language** for creating Web pages\n", "* HTML describes the structure of a Web page\n", "* HTML consists of a series of **elements**\n", "* HTML elements tell the browser how to display the content\n", "* HTML elements label pieces of content such as \"this is a heading\", \"this is a paragraph\", \"this is a link\", etc.\n", "\n", "***\n", "### How to View HTML Source?\n", "\n", "View HTML Source Code:\n", "* Right-click in an HTML page and select \"View Page Source\" (in Chrome) or \"View Source\" (in Edge), or similar in other browsers. This will open a window containing the HTML source code of the page.\n", "\n", "Chrome Developer Tools F12:\n", "* More Tools ---> Developer Tools\n", "* Chrome DevTools is a set of web developer tools built directly into the Google Chrome browser.\n", "\n", "![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/20220509_f0.png)\n", "\n", "***\n", "### A Simple HTML Document (a.k.a. \"Page\")" ] }, { "cell_type": "code", "execution_count": 1, "id": "093ac56d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " My page title\n", " \n", "\n", " \n", "

Hello!

\n", "

My second paragraph.

\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "\n", "\n", "\n", " \n", " My page title\n", " \n", "\n", " \n", "

Hello!

\n", "

My second paragraph.

\n", " \n", "\n", "" ] }, { "cell_type": "markdown", "id": "1318f27f", "metadata": {}, "source": [ "### Example Explained\n", "\n", "* The <!DOCTYPE html> declaration defines that this document is an HTML document\n", "* The <html> element is the root element of an HTML page\n", "* The <head> element contains meta information about the HTML page\n", "* The <title> element specifies a title for the HTML page (which is shown in the browser's title bar or in the page's tab)\n", "* The <body> element defines the document's body, and is a container for all the **visible** contents, such as headings, paragraphs, images, hyperlinks, tables, lists, etc.\n", "* The <h1> element defines a large heading\n", "* The <p> element defines a paragraph\n", "\n", "***\n", "### HTML Element\n", "\n", "An HTML element is defined by a **start tag**, some **content**, and an **end tag**:\n", "\n", "![avatar](https://raw.githubusercontent.com/lazydingding/gallery/main/20220509_f1.png)\n" ] }, { "cell_type": "markdown", "id": "ad62f207", "metadata": {}, "source": [ "The HTML **element** is everything from the start tag to the end tag:" ] }, { "cell_type": "code", "execution_count": 2, "id": "6e9ebfa8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "

My First Heading

\n", "

My first paragraph.

\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "

My First Heading

\n", "

My first paragraph.

" ] }, { "cell_type": "markdown", "id": "ad057de2", "metadata": {}, "source": [ "***\n", "#### 1. Empty HTML Elements\n", "* Empty elements (also called self-closing or void elements) are not container tags.\n", "* A typical example of an empty element, is the <br> element, which represents a line break. Some other common empty elements are <img>, <input>, etc." ] }, { "cell_type": "code", "execution_count": 3, "id": "26a78326", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "

This paragraph contains
a line break.

\n", "\"xmu\"\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "

This paragraph contains
a line break.

\n", "\"xmu\"\n", "" ] }, { "cell_type": "markdown", "id": "382c336a", "metadata": {}, "source": [ "***\n", "#### 2. Nesting HTML Elements\n", "* Most HTML elements can contain any number of further elements, which are, in turn, made up of tags, attributes, and content or other elements.\n", "* The following example shows some elements nested inside the <p> element." ] }, { "cell_type": "code", "execution_count": 4, "id": "71be0b0b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "

Here is some bold text.

\n", "

Here is some emphasized text.

\n", "

Here is some highlighted text.

\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "

Here is some bold text.

\n", "

Here is some emphasized text.

\n", "

Here is some highlighted text.

" ] }, { "cell_type": "markdown", "id": "c96320d9", "metadata": {}, "source": [ "***\n", "#### 3. HTML Links\n", "\n", "HTML links are defined with the <a> tag:\n", "* The link's destination is specified in the href attribute. \n", "* Attributes are used to provide **additional information** about HTML elements." ] }, { "cell_type": "code", "execution_count": 5, "id": "051866eb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "This is a link\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "This is a link" ] }, { "cell_type": "markdown", "id": "4821c0c2", "metadata": {}, "source": [ "***\n", "#### 4. Writing Comments in HTML\n", "* Comments are usually added with the purpose of making the source code easier to understand.\n", "* You can also comment out part of your HTML code for debugging purpose.\n", "* An HTML comment begins with <!--, and ends with -->, as shown in the example below:" ] }, { "cell_type": "code", "execution_count": 6, "id": "6c5a2953", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "

This is a normal piece of text.

\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "\n", "\n", "

This is a normal piece of text.

" ] }, { "cell_type": "markdown", "id": "9df922d6", "metadata": {}, "source": [ "***\n", "#### 5. HTML Elements Types\n", "\n", "The basic elements of an HTML page are:\n", "\n", "* A text header, denoted using the <h1>, <h2>, <h3>, <h4>, <h5>, <h6> tags.\n", "* A paragraph, denoted using the <p> tag.\n", "* A link, denoted using the <a> (anchor) tag.\n", "* A list, denoted using the <ul> (unordered list), <ol> (ordered list) and <li> (list element) tags.\n", "* An image, denoted using the <img> tag\n", "* A divider, denoted using the <div> tag\n", "* A text span, denoted using the <span> tag\n", "\n", "\n", "Elements can be placed in two distinct groups: **block level** and **inline level** elements. The former make up the document's structure, while the latter dress up the contents of a block.\n", "\n", "* A block element occupies 100% of the available width and it is rendered with a line break before and after. Whereas, an inline element will take up only as much space as it needs.\n", " * The most commonly used block-level elements are <div>, <p>, <h1> through <h6>, <form>, <ol>, <ul>, <li>, and so on. Whereas, the commonly used inline-level elements are <img>, <a>, <span>, <strong>, <b>, <em>, <i>, <code>, <input>, <button>, etc.\n", "* The block-level elements should not be placed within inline-level elements. For example, the <p> element should not be placed inside the <b> element.\n", "\n", "\n", "\n", "***\n", "### HTML Attributes\n", "\n", "Attributes define **additional characteristics or properties** of the element such as width and height of an image. \n", "* Attributes are always specified in the start tag (or opening tag) and usually consists of name/value pairs like name=\"value\".\n", " * Some attributes are required for certain elements. For instance, an <img> tag must contain a src and alt attributes.\n", "\n", "Let's take a look at some examples of the attributes usages:" ] }, { "cell_type": "code", "execution_count": 7, "id": "5c599515", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\"xmu\"\n", "Google\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "\"xmu\"\n", "Google\n", "" ] }, { "cell_type": "markdown", "id": "e1c069bb", "metadata": {}, "source": [ "In the above example src inside the <img> tag is an attribute and image path provided is its value. Similarly href inside the <a> tag is an attribute and the link provided is its value, and so on.\n", " \n", "There are some **general purpose** attributes, such as id, title, class, style, etc. that you can use on the majority of HTML elements.\n", " \n", "***\n", "#### 1.The id Attribute\n", " \n", "The id attribute is used to give a unique identifier to an element within a document. This makes it easier to select the element using CSS or JavaScript.\n", "* The id of an element must be **unique** within a single document.\n", " * No two elements in the same document can be named with the same id.\n", " * Each element can have only one id." ] }, { "cell_type": "code", "execution_count": 8, "id": "cce551bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
Some content
\n", "

This is a paragraph.

\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "\n", "
Some content
\n", "

This is a paragraph.

" ] }, { "cell_type": "markdown", "id": "e86b141c", "metadata": {}, "source": [ "***\n", "#### 2. The class Attribute\n", "\n", "Like id attribute, the class attribute is also used to identify elements. But unlike id, the class attribute **does not have to be unique** in the document.\n", "* This means you can apply the same class to multiple elements in a document.\n", "* Any style rules that are written to that class will be applied to all the elements having that class." ] }, { "cell_type": "code", "execution_count": 9, "id": "21196d70", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "

This is a paragraph.

\n", "\n", "

This is a paragraph.

\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "\n", "

This is a paragraph.

\n", "\n", "

This is a paragraph.

" ] }, { "cell_type": "markdown", "id": "78e2876f", "metadata": {}, "source": [ "***\n", "#### 3. The style Attribute\n", "\n", "HTML is quite limited when it comes to the presentation of a web page. It was originally designed as a simple way of presenting information.\n", "\n", "**CSS (Cascading Style Sheets)** was introduced in December 1996 by the World Wide Web Consortium (W3C) to provide a better way to style HTML elements.\n", "\n", "![avatar](https://github.com/lazydingding/gallery/blob/main/Screen%20Shot%202022-05-09%20at%2023.55.26.png?raw=true)\n", "\n", "The style attribute allows you to specify CSS styling rules such as color, font, border, etc. directly within the element. Let's check out an example to see how it works:" ] }, { "cell_type": "code", "execution_count": 10, "id": "bd4ca72d", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "

This is a paragraph.

\n", "
Some content
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "

This is a paragraph.

\n", "
Some content
" ] }, { "cell_type": "markdown", "id": "0850937a", "metadata": {}, "source": [ "***\n", "#### 4. HTML Attributes Types\n", "\n", "Each element can also have attributes - each element has a different set of attributes relevant to the element.\n", "\n", "There are a few global elements, the most common of them are:\n", "* id - Denotes the unique ID of an element in a page. Used for locating elements by using links, JavaScript, and more.\n", "* class - Denotes the CSS class of an element.\n", "* style - Denotes the CSS styles to apply to an element." ] }, { "cell_type": "markdown", "id": "6aed93df", "metadata": {}, "source": [ "***\n", "## Crawler\n", "\n", "Some websites offer data sets that are downloadable in CSV format, or accessible via an **Application Programming Interface (API)**. But many websites with useful data don't offer these convenient options.\n", "\n", "Web scraping is the process of **gathering information** from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping!\n", "\n", "The words \"web scraping\" usually refer to a process that involves automation. Some websites don't like it when automatic scrapers gather their data, while others don't mind.\n", "\n", "Common tools:\n", "* requests\n", "* BeautifulSoup\n", "* Selenium\n", "* Pandas.read_html()\n", "\n", "***\n", "### How Does Web Scraping Work?\n", "\n", "When we scrape the web, we write code that sends a request to the server that's hosting the page we specified. The server will return the source code — HTML, mostly — for the page (or pages) we requested.\n", "\n", "So far, we're essentially doing the same thing a web browser does — sending a server request with a specific URL and asking the server to return the code for that page.\n", "\n", "But unlike a web browser, our web scraping code won't interpret the page's source code and display the page visually. Instead, we'll write some custom code that filters through the page's source code looking for specific elements we've specified, and extracting whatever content we've instructed it to extract.\n", "\n", "For example, if we wanted to get all of the data from inside a table that was displayed on a web page, our code would be written to go through these steps in sequence:\n", "\n", "* Request the content (source code) of a specific URL from the server.\n", "* Download the content that is returned.\n", "* Identify the elements of the page that are part of the table we want.\n", "* Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.\n", "\n", "Python and BeautifulSoup have built-in features designed to make this relatively straightforward.\n", "\n", "***\n", "### The requests library\n", "\n", "Now that we understand the structure of a web page, it's time to get into the fun part: scraping the content we want!\n", "\n", "The first thing we'll need to do to scrape a web page is to download the page. We can download pages using the Python requests library.\n", "\n", "The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.\n", "\n", "Let's try downloading a simple [sample website](https://dataquestio.github.io/web-scraping-pages/simple.html):" ] }, { "cell_type": "code", "execution_count": 11, "id": "a4e32f5d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import requests\n", "\n", "page = requests.get(\"https://dataquestio.github.io/web-scraping-pages/simple.html\")\n", "\n", "page" ] }, { "cell_type": "markdown", "id": "8ee39f1a", "metadata": {}, "source": [ "A status_code of 200 means that the page downloaded successfully.\n", "\n", "We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.\n", "\n", "We can print out the HTML content of the page using the content property:" ] }, { "cell_type": "code", "execution_count": 12, "id": "7d853cfa", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "b'\\n\\n \\n A simple example page\\n \\n \\n

Here is some simple content for this page.

\\n \\n'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page.content" ] }, { "cell_type": "markdown", "id": "d8ff3c98", "metadata": {}, "source": [ "***\n", "### Parsing a page with BeautifulSoup\n", "\n", "As you can see above, we now have downloaded an HTML document.\n", "\n", "We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. If we want to extract a single tag, we can use the find_all method, which will find all the instances of a tag on a page.\n", "\n", "We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:" ] }, { "cell_type": "code", "execution_count": 13, "id": "9f1f4166", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "\n", "\n", "A simple example page\n", "\n", "\n", "

Here is some simple content for this page.

\n", "\n", "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(page.content, 'html.parser')\n", "\n", "soup" ] }, { "cell_type": "code", "execution_count": 14, "id": "14f3f661", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[

Here is some simple content for this page.

]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('p')" ] }, { "cell_type": "markdown", "id": "f4145e4a", "metadata": {}, "source": [ "Note that find_all returns a list, so we'll have to loop through, or use list indexing, it to extract text:" ] }, { "cell_type": "code", "execution_count": 15, "id": "5b6d61f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Here is some simple content for this page.'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('p')[0].get_text()" ] }, { "cell_type": "markdown", "id": "56cc9635", "metadata": {}, "source": [ "***\n", "### Searching for tags by class and id\n", "\n", "We introduced classes and ids earlier, but it probably wasn't clear why they were useful.\n", "\n", "Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. But when we're scraping, we can also use them to specify the elements we want to scrape.\n", "\n", "To illustrate this principle, we'll work with another [sample website](https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html):" ] }, { "cell_type": "code", "execution_count": 16, "id": "a92c5cbd", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "A simple example page\n", "\n", "\n", "
\n", "

\n", " First paragraph.\n", "

\n", "

\n", " Second paragraph.\n", "

\n", "
\n", "

\n", "\n", " First outer paragraph.\n", " \n", "

\n", "

\n", "\n", " Second outer paragraph.\n", " \n", "

\n", "\n", "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page = requests.get(\"https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html\")\n", "soup = BeautifulSoup(page.content, 'html.parser')\n", "soup" ] }, { "cell_type": "markdown", "id": "16138d44", "metadata": {}, "source": [ "Now, we can use the find_all method to search for items by class or by id. Let's look for any tag that has the class outer-text:" ] }, { "cell_type": "code", "execution_count": 17, "id": "68938e0d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[

\n", " \n", " First outer paragraph.\n", " \n", "

,\n", "

\n", " \n", " Second outer paragraph.\n", " \n", "

]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all(class_='outer-text')" ] }, { "cell_type": "markdown", "id": "8006b16f", "metadata": {}, "source": [ "We can also search for elements by id:" ] }, { "cell_type": "code", "execution_count": 18, "id": "1c3cbb33", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[

\n", " First paragraph.\n", "

]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all(id=\"first\")" ] }, { "cell_type": "markdown", "id": "bbfe68c4", "metadata": {}, "source": [ "***\n", "### Using CSS Selectors\n", "\n", "We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:\n", "\n", "* p a — finds all a tags inside of a p tag.\n", "* body p a — finds all a tags inside of a p tag inside of a body tag.\n", "* html body — finds all body tags inside of an html tag.\n", "* p.outer-text — finds all p tags with a class of outer-text.\n", "* p#first — finds all p tags with an id of first.\n", "* body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.\n", "\n", "BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:" ] }, { "cell_type": "code", "execution_count": 19, "id": "ecc27ce5", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[

\n", " First paragraph.\n", "

,\n", "

\n", " Second paragraph.\n", "

]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select(\"div p\")" ] }, { "cell_type": "markdown", "id": "861bf0e0", "metadata": {}, "source": [ "***\n", "### Example 1. Downloading weather data\n", "\n", "We now know enough to proceed with extracting information about the local weather from the National Weather Service website.\n", "\n", "The first step is to find the page we want to scrape. We'll extract weather information about downtown San Francisco from [this page](https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.YnnblBNBzpY). The page has information about the extended forecast for the next week, including time of day, temperature, and a brief description of the conditions.\n", "\n", "Specifically, let's extract data about the extended forecast.\n", "\n", "***\n", "#### 1. Exploring page structure with Chrome DevTools\n", "\n", "The first thing we'll need to do is inspect the page using Chrome Devtools. If you’re using another browser, Firefox and Safari have equivalents.\n", "\n", "You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. Make sure the Elements panel is highlighted.\n", "\n", "The elements panel will show you all the HTML tags on the page, and let you navigate through them. It's a really handy feature!\n", "\n", "We can then scroll up in the elements panel to find the \"outermost\" element that contains all of the text that corresponds to the extended forecasts. In this case, it's a div tag with the id seven-day-forecast.\n", "\n", "If we click around on the console, and explore the div, we'll discover that each forecast item (like \"Tonight\", \"Thursday\", and \"Thursday Night\") is contained in a div with the class tombstone-container.\n", "\n", "***\n", "#### 2. Time to Start Scraping!\n", "\n", "We now know enough to download the page and start parsing it. In the below code, we will:\n", "\n", "* Download the web page containing the forecast.\n", "* Create a BeautifulSoup class to parse the page.\n", "* Find the div with id seven-day-forecast, and assign to seven_day\n", "* Inside seven_day, find each individual forecast item.\n", "* Extract and print the first forecast item." ] }, { "cell_type": "code", "execution_count": 20, "id": "2784ed76", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "\n", "\n", "\n", "\n", "National Weather Service\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\"National\n", "\"National\n", "\"United\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
Sorry, the location you searched for was not found. Please try another search.
\n", "
Multiple locations were found. Please select one of the following:
\n", "
\n", "\n", "
\n", "
\n", "

Your local forecast office is

\n", "

\n", "
\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Critical Fire Weather Conditions Continue Across the Plains

\n", "

\n", " Locally critical fire weather conditions continue for portions of the Northern/Central Plains and much of northern lower Michigan Monday evening. Red Flag Warnings are currently in effect. Dry conditions and gusty winds will persist across southern Colorado Tuesday. \n", "\n", " Read More >\n", "

\n", "
\n", "
\n", "
\n", "\n", "\n", "
\n", "En Español\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "Current conditions at\n", "

SAN FRANCISCO DOWNTOWN (SFOC1)

\n", "Lat: 37.77056°NLon: 122.42694°WElev: 150.0ft.\n", "
\n", "
\n", "
\n", "\n", "
\n", "

NA

\n", "

55°F

\n", "

13°C

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Humidity76%
Wind SpeedNA NA MPH
BarometerNA
Dewpoint47°F (8°C)
VisibilityNA
Last update\n", " 22 Apr 08:43 PM PDT
\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "Extended Forecast for\n", "

\n", " San Francisco CA

\n", "
\n", "
\n", "
  • \n", "
    \n", "

    Tonight

    \n", "

    \"Tonight:

    Mostly Cloudy

    Low: 53 °F

  • \n", "
    \n", "

    Tuesday

    \n", "

    \"Tuesday:

    Decreasing
    Clouds

    High: 64 °F

  • \n", "
    \n", "

    Tuesday
    Night

    \n", "

    \"Tuesday

    Mostly Cloudy

    Low: 54 °F

  • \n", "
    \n", "

    Wednesday

    \n", "

    \"Wednesday:

    Partly Sunny

    High: 64 °F

  • \n", "
    \n", "

    Wednesday
    Night

    \n", "

    \"Wednesday

    Partly Cloudy

    Low: 52 °F

  • \n", "
    \n", "

    Thursday

    \n", "

    \"Thursday:

    Partly Sunny

    High: 63 °F

  • \n", "
    \n", "

    Thursday
    Night

    \n", "

    \"Thursday

    Mostly Cloudy

    Low: 54 °F

  • \n", "
    \n", "

    Friday

    \n", "

    \"Friday:

    Mostly Sunny

    High: 64 °F

  • \n", "
    \n", "

    Friday
    Night

    \n", "

    \"Friday

    Mostly Clear

    Low: 51 °F

\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "

Detailed Forecast

\n", "
\n", "
\n", "
Tonight
Mostly cloudy, with a low around 53. South southwest wind around 10 mph.
Tuesday
Mostly cloudy, then gradually becoming sunny, with a high near 64. South southwest wind 8 to 14 mph, with gusts as high as 20 mph.
Tuesday Night
Mostly cloudy, with a low around 54. South southwest wind 6 to 10 mph.
Wednesday
Partly sunny, with a high near 64. South wind 6 to 10 mph.
Wednesday Night
Partly cloudy, with a low around 52. West wind 7 to 10 mph.
Thursday
Partly sunny, with a high near 63.
Thursday Night
Mostly cloudy, with a low around 54.
Friday
Mostly sunny, with a high near 64.
Friday Night
Mostly clear, with a low around 51.
Saturday
Sunny, with a high near 64.
Saturday Night
Mostly clear, with a low around 51.
Sunday
Sunny, with a high near 65.
Sunday Night
Mostly clear, with a low around 52.
Monday
Sunny, with a high near 65.
\n", "
\n", "\n", "\n", "
\n", "
\n", "

Additional Forecasts and Information

\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", " Click Map For Forecast\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "\"Map\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
Point Forecast:
\n", "
San Francisco CA
 37.77°N 122.41°W (Elev. 131 ft)
\n", "
\n", "
\n", "\n", "
5:53 pm PDT Apr 22, 2024
\n", "
\n", "
\n", "\n", "
9pm PDT Apr 22, 2024-6pm PDT Apr 29, 2024
\n", "
\n", "
\n", "
 
\n", "\n", "
\n", "
\n", "
 
\n", "
\n", "\"Get\n", "\"Get\n", "
\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "

Additional Resources

\n", "
\n", "
\n", "\n", "
\n", "

Radar & Satellite Image

\n", "\"Link \"Link
\n", "\n", "\n", "
\n", "

Hourly Weather Forecast

\n", "\n", "
\n", "\n", "\n", "
\n", "

National Digital Forecast Database

\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page = requests.get(\"https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168\")\n", "soup = BeautifulSoup(page.content, 'html.parser')\n", "\n", "soup" ] }, { "cell_type": "code", "execution_count": 21, "id": "89a10ca9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", "
\n", "Extended Forecast for\n", "

\n", " San Francisco CA

\n", "
\n", "
\n", "
  • \n", "
    \n", "

    Tonight

    \n", "

    \"Tonight:

    Mostly Cloudy

    Low: 53 °F

  • \n", "
    \n", "

    Tuesday

    \n", "

    \"Tuesday:

    Decreasing
    Clouds

    High: 64 °F

  • \n", "
    \n", "

    Tuesday
    Night

    \n", "

    \"Tuesday

    Mostly Cloudy

    Low: 54 °F

  • \n", "
    \n", "

    Wednesday

    \n", "

    \"Wednesday:

    Partly Sunny

    High: 64 °F

  • \n", "
    \n", "

    Wednesday
    Night

    \n", "

    \"Wednesday

    Partly Cloudy

    Low: 52 °F

  • \n", "
    \n", "

    Thursday

    \n", "

    \"Thursday:

    Partly Sunny

    High: 63 °F

  • \n", "
    \n", "

    Thursday
    Night

    \n", "

    \"Thursday

    Mostly Cloudy

    Low: 54 °F

  • \n", "
    \n", "

    Friday

    \n", "

    \"Friday:

    Mostly Sunny

    High: 64 °F

  • \n", "
    \n", "

    Friday
    Night

    \n", "

    \"Friday

    Mostly Clear

    Low: 51 °F

\n", "
\n", "
" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seven_day = soup.find(id=\"seven-day-forecast\")\n", "\n", "seven_day" ] }, { "cell_type": "code", "execution_count": 22, "id": "8e52269d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
\n", "

Tonight

\n", "

\"Tonight:

Mostly Cloudy

Low: 53 °F

,\n", "
\n", "

Tuesday

\n", "

\"Tuesday:

Decreasing
Clouds

High: 64 °F

,\n", "
\n", "

Tuesday
Night

\n", "

\"Tuesday

Mostly Cloudy

Low: 54 °F

,\n", "
\n", "

Wednesday

\n", "

\"Wednesday:

Partly Sunny

High: 64 °F

,\n", "
\n", "

Wednesday
Night

\n", "

\"Wednesday

Partly Cloudy

Low: 52 °F

,\n", "
\n", "

Thursday

\n", "

\"Thursday:

Partly Sunny

High: 63 °F

,\n", "
\n", "

Thursday
Night

\n", "

\"Thursday

Mostly Cloudy

Low: 54 °F

,\n", "
\n", "

Friday

\n", "

\"Friday:

Mostly Sunny

High: 64 °F

,\n", "
\n", "

Friday
Night

\n", "

\"Friday

Mostly Clear

Low: 51 °F

]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "forecast_items = seven_day.find_all(class_=\"tombstone-container\")\n", "\n", "forecast_items" ] }, { "cell_type": "code", "execution_count": 23, "id": "4ea1336c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", "

Tonight

\n", "

\"Tonight:

Mostly Cloudy

Low: 53 °F

" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tonight = forecast_items[0]\n", "\n", "tonight" ] }, { "cell_type": "markdown", "id": "f7371f36", "metadata": {}, "source": [ "***\n", "#### 3. Extracting information from the page\n", "\n", "As we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:\n", "\n", "* The name of the forecast item — in this case, Tonight.\n", "* The description of the conditions — this is stored in the title property of img.\n", "* A short description of the conditions — in this case, Mostly Cloudy.\n", "* The temperature low — in this case, 53 degrees.\n", "\n", "\n", "We'll extract the name of the forecast item, the short description, and the temperature first, since they're all similar:" ] }, { "cell_type": "code", "execution_count": 24, "id": "8480048d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", "

Tonight

\n", "

\"Tonight:

Mostly Cloudy

Low: 53 °F

" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tonight" ] }, { "cell_type": "code", "execution_count": 25, "id": "f1711be2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tonight\n", "Mostly Cloudy\n", "Low: 53 °F\n" ] } ], "source": [ "period = tonight.find(class_=\"period-name\").get_text()\n", "short_desc = tonight.find(class_=\"short-desc\").get_text()\n", "temp = tonight.find(class_=\"temp\").get_text()\n", "print(period)\n", "print(short_desc)\n", "print(temp)" ] }, { "cell_type": "markdown", "id": "89162747", "metadata": {}, "source": [ "Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:" ] }, { "cell_type": "code", "execution_count": 26, "id": "b8c0dd2a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", "

Tonight

\n", "

\"Tonight:

Mostly Cloudy

Low: 53 °F

" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tonight" ] }, { "cell_type": "code", "execution_count": 27, "id": "7fe928b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. \n" ] } ], "source": [ "img = tonight.find(\"img\")\n", "desc = img['alt']\n", "print(desc)" ] }, { "cell_type": "markdown", "id": "fb6e2e43", "metadata": {}, "source": [ "***\n", "#### 4. Extracting all the information from the page\n", "Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to **extract everything at once**.\n", "\n", "In the below code, we will:\n", "\n", "* Select all items with the class period-name inside an item with the class tombstone-container in seven_day.\n", "* Use a list comprehension to call the get_text method on each BeautifulSoup object." ] }, { "cell_type": "code", "execution_count": 28, "id": "dd0f3bc0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[

Tonight

,\n", "

Tuesday

,\n", "

Tuesday
Night

,\n", "

Wednesday

,\n", "

Wednesday
Night

,\n", "

Thursday

,\n", "

Thursday
Night

,\n", "

Friday

,\n", "

Friday
Night

]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "period_tags = seven_day.select(\".tombstone-container .period-name\")\n", "\n", "period_tags" ] }, { "cell_type": "code", "execution_count": 29, "id": "dca45725", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Tonight',\n", " 'Tuesday',\n", " 'TuesdayNight',\n", " 'Wednesday',\n", " 'WednesdayNight',\n", " 'Thursday',\n", " 'ThursdayNight',\n", " 'Friday',\n", " 'FridayNight']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "period_tags = seven_day.select(\".tombstone-container .period-name\")\n", "periods = [pt.get_text() for pt in period_tags]\n", "periods" ] }, { "cell_type": "markdown", "id": "10d3d6b0", "metadata": {}, "source": [ "As we can see above, our technique gets us each of the period names, in order. We can apply the same technique to get the other three fields:" ] }, { "cell_type": "code", "execution_count": 30, "id": "a2212799", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Mostly Cloudy', 'DecreasingClouds', 'Mostly Cloudy', 'Partly Sunny', 'Partly Cloudy', 'Partly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Mostly Clear']\n", "['Low: 53 °F', 'High: 64 °F', 'Low: 54 °F', 'High: 64 °F', 'Low: 52 °F', 'High: 63 °F', 'Low: 54 °F', 'High: 64 °F', 'Low: 51 °F']\n", "['Tonight: Mostly cloudy, with a low around 53. South southwest wind around 10 mph. ', 'Tuesday: Mostly cloudy, then gradually becoming sunny, with a high near 64. South southwest wind 8 to 14 mph, with gusts as high as 20 mph. ', 'Tuesday Night: Mostly cloudy, with a low around 54. South southwest wind 6 to 10 mph. ', 'Wednesday: Partly sunny, with a high near 64. South wind 6 to 10 mph. ', 'Wednesday Night: Partly cloudy, with a low around 52. West wind 7 to 10 mph. ', 'Thursday: Partly sunny, with a high near 63.', 'Thursday Night: Mostly cloudy, with a low around 54.', 'Friday: Mostly sunny, with a high near 64.', 'Friday Night: Mostly clear, with a low around 51.']\n" ] } ], "source": [ "short_descs = [sd.get_text() for sd in seven_day.select(\".tombstone-container .short-desc\")]\n", "temps = [t.get_text() for t in seven_day.select(\".tombstone-container .temp\")]\n", "descs = [d[\"title\"] for d in seven_day.select(\".tombstone-container img\")]\n", "print(short_descs)\n", "print(temps)\n", "print(descs)" ] }, { "cell_type": "markdown", "id": "de5fecf2", "metadata": {}, "source": [ "***\n", "#### 5. Combining our data into a Pandas Dataframe\n", "\n", "We can now combine the data into a Pandas DataFrame and analyze it." ] }, { "cell_type": "code", "execution_count": 31, "id": "10d411e1", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
periodshort_desctempdesc
0TonightMostly CloudyLow: 53 °FTonight: Mostly cloudy, with a low around 53. ...
1TuesdayDecreasingCloudsHigh: 64 °FTuesday: Mostly cloudy, then gradually becomin...
2TuesdayNightMostly CloudyLow: 54 °FTuesday Night: Mostly cloudy, with a low aroun...
3WednesdayPartly SunnyHigh: 64 °FWednesday: Partly sunny, with a high near 64. ...
4WednesdayNightPartly CloudyLow: 52 °FWednesday Night: Partly cloudy, with a low aro...
5ThursdayPartly SunnyHigh: 63 °FThursday: Partly sunny, with a high near 63.
6ThursdayNightMostly CloudyLow: 54 °FThursday Night: Mostly cloudy, with a low arou...
7FridayMostly SunnyHigh: 64 °FFriday: Mostly sunny, with a high near 64.
8FridayNightMostly ClearLow: 51 °FFriday Night: Mostly clear, with a low around 51.
\n", "
" ], "text/plain": [ " period short_desc temp \\\n", "0 Tonight Mostly Cloudy Low: 53 °F \n", "1 Tuesday DecreasingClouds High: 64 °F \n", "2 TuesdayNight Mostly Cloudy Low: 54 °F \n", "3 Wednesday Partly Sunny High: 64 °F \n", "4 WednesdayNight Partly Cloudy Low: 52 °F \n", "5 Thursday Partly Sunny High: 63 °F \n", "6 ThursdayNight Mostly Cloudy Low: 54 °F \n", "7 Friday Mostly Sunny High: 64 °F \n", "8 FridayNight Mostly Clear Low: 51 °F \n", "\n", " desc \n", "0 Tonight: Mostly cloudy, with a low around 53. ... \n", "1 Tuesday: Mostly cloudy, then gradually becomin... \n", "2 Tuesday Night: Mostly cloudy, with a low aroun... \n", "3 Wednesday: Partly sunny, with a high near 64. ... \n", "4 Wednesday Night: Partly cloudy, with a low aro... \n", "5 Thursday: Partly sunny, with a high near 63. \n", "6 Thursday Night: Mostly cloudy, with a low arou... \n", "7 Friday: Mostly sunny, with a high near 64. \n", "8 Friday Night: Mostly clear, with a low around 51. " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "weather = pd.DataFrame({\"period\": periods,\n", " \"short_desc\": short_descs,\n", " \"temp\": temps,\n", " \"desc\":descs\n", "})\n", "\n", "weather" ] }, { "cell_type": "markdown", "id": "6dfbf7f7", "metadata": {}, "source": [ "***\n", "### Example 2. Pandas Web Scraping\n", "\n", "Pandas makes it easy to scrape a table tag on a web page. You can use the function read_html(url) to get webpage contents.\n", "\n", "It's only suitable for fetching Table type data, then let's see what kind of pages meet the conditions?\n", "\n", "***\n", "#### HTML Tables\n", "\n", "HTML tables allow web developers to arrange data into rows and columns:" ] }, { "cell_type": "code", "execution_count": 32, "id": "870f580a", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CompanyContactCountry
AppleTim CookUnited States
TencentPony MaChina
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%HTML\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CompanyContactCountry
AppleTim CookUnited States
TencentPony MaChina
" ] }, { "cell_type": "markdown", "id": "b2c0afe6", "metadata": {}, "source": [ "Open the web version with a browser, F12 view the HTML structure of the web page, you will find that the structure of the eligible web page has a common feature.\n", "\n", "If you find Table format, you can use pd.read_html( )\n", "\n", "***\n", "#### Sina finance - Institutional ownership\n", "\n", "Take the aggregate shareholding data of Sina Financial institutions as an example:\n", "\n" ] }, { "cell_type": "code", "execution_count": 33, "id": "a4419283", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
股票代码股票名称变动人变动类型变动 股数成交均价变动金额(万元)变动后 持股数变动原因变动日期持股种类与董监高 关系董监高 职务
02145中核钛白沈鑫购买33500004.021346.7014150000竞价交易2024-04-19A股本人非独立董事
1688660电气风电吴改购买102003.253.3240255二级市场买卖2024-04-19A股本人非独立董事
2688660电气风电王勇购买100003.263.2633248二级市场买卖2024-04-19A股本人审计委员会委员
3688660电气风电王红春购买80003.242.5933153二级市场买卖2024-04-19A股本人职工代表监事
4688660电气风电乔银平购买122403.274.0042540二级市场买卖2024-04-19A股本人董事
5688660电气风电刘向楠购买90883.252.9529376二级市场买卖2024-04-19A股本人职工监事
6688660电气风电黄锋锋购买110003.253.5840658二级市场买卖2024-04-19A股本人财务总监
7688560明冠新材闫勇购买3000011.3233.96830000二级市场买卖2024-04-19A股本人董事
8603938三孚股份孙任靖购买500012.56.25150409520二级市场买卖2024-04-19A股本人董事长
9688176亚虹医药PAN KE购买1550005.6587.58130103577二级市场买卖2024-04-19A股本人董事长
102145中核钛白沈鑫购买108000003.984298.4010800000竞价交易2024-04-18A股本人非独立董事
112286保龄宝张国刚购买89005.544.9355900竞价交易2024-04-18A股本人副总经理
12688560明冠新材闫勇购买20000011.58231.60800000二级市场买卖2024-04-18A股本人董事
13603031安孚科技梁红颖送、转945000.0030450分红送转2024-04-18A股本人副总经理
14603031安孚科技刘荣海送、转3393000.00109330分红送转2024-04-18A股本人非独立董事
15603031安孚科技王晓飞送、转553500.0017835分红送转2024-04-18A股本人副总经理
16688311盟升电子陈英回购注销-1050016.3717.1910500股权激励回购注销实施2024-04-18A股本人副总经理
17688311盟升电子毛钢烈回购注销-1344016.3722.0013440股权激励回购注销实施2024-04-18A股本人非独立董事
18688311盟升电子覃光全回购注销-1470016.3724.0614700股权激励回购注销实施2024-04-18A股本人非独立董事
19688311盟升电子向荣回购注销-2730016.3744.694467288股权激励回购注销实施2024-04-18A股本人董事长
20688176亚虹医药PAN KE购买3000005.79173.70129948577二级市场买卖2024-04-18A股本人董事长
21688176亚虹医药杨明远购买185005.810.7318500二级市场买卖2024-04-18A股本人非独立董事
22688169石头科技张磊出售-50038519.253768二级市场买卖2024-04-18A股本人核心技术人员
23600576祥源文旅陈亚文购买60005.63.36148500二级市场买卖2024-04-17A股本人监事
242640跨境通杨建新出售-17567002.24393.50150859148竞价交易2024-04-17A股本人董事长
25688628优利德孙乔------0.00不详二级市场买卖2024-04-17A股本人核心技术人员
26688560明冠新材闫勇购买5707011.7667.11600000二级市场买卖2024-04-17A股本人董事
27600576祥源文旅陈云钊购买90005.555.0030300二级市场买卖2024-04-16A股本人职工代表监事
28600576祥源文旅王琦购买186005.379.9988500二级市场买卖2024-04-16A股本人职工监事
29600576祥源文旅徐中平购买203005.4211.00306300二级市场买卖2024-04-16A股本人非独立董事
30600576祥源文旅詹纯伟购买46005.352.46147000二级市场买卖2024-04-16A股本人监事
31688183生益电子王小平购买380008.8633.6738000二级市场买卖2024-04-16A股本人核心技术人员
322286保龄宝王强购买66005.143.39107400竞价交易2024-04-16A股本人非独立董事
332286保龄宝张国刚购买100005.155.1547000竞价交易2024-04-16A股本人副总经理
342640跨境通杨建新出售-25388002.17550.92152615848竞价交易2024-04-16A股本人董事长
35688560明冠新材闫勇购买14293011.16159.51542930二级市场买卖2024-04-16A股本人董事
363013地铁设计廖景购买7800014.1109.9878000竞价交易2024-04-16A股本人审计委员会委员
37605222起帆电缆周桂幸其它-2100000014.7330933.0063991400其他2024-04-16A股本人副董事长
38301021英诺激光张勇购买630015.029.4612200竞价交易2024-04-16A股本人副总经理
39688538和辉光电李凤玲购买1150002.124.15735000二级市场买卖2024-04-16A股本人董事会秘书
\n", "
" ], "text/plain": [ " 股票代码 股票名称 变动人 变动类型 变动 股数 成交均价 变动金额(万元) 变动后 持股数 \\\n", "0 2145 中核钛白 沈鑫 购买 3350000 4.02 1346.70 14150000 \n", "1 688660 电气风电 吴改 购买 10200 3.25 3.32 40255 \n", "2 688660 电气风电 王勇 购买 10000 3.26 3.26 33248 \n", "3 688660 电气风电 王红春 购买 8000 3.24 2.59 33153 \n", "4 688660 电气风电 乔银平 购买 12240 3.27 4.00 42540 \n", "5 688660 电气风电 刘向楠 购买 9088 3.25 2.95 29376 \n", "6 688660 电气风电 黄锋锋 购买 11000 3.25 3.58 40658 \n", "7 688560 明冠新材 闫勇 购买 30000 11.32 33.96 830000 \n", "8 603938 三孚股份 孙任靖 购买 5000 12.5 6.25 150409520 \n", "9 688176 亚虹医药 PAN KE 购买 155000 5.65 87.58 130103577 \n", "10 2145 中核钛白 沈鑫 购买 10800000 3.98 4298.40 10800000 \n", "11 2286 保龄宝 张国刚 购买 8900 5.54 4.93 55900 \n", "12 688560 明冠新材 闫勇 购买 200000 11.58 231.60 800000 \n", "13 603031 安孚科技 梁红颖 送、转 9450 0 0.00 30450 \n", "14 603031 安孚科技 刘荣海 送、转 33930 0 0.00 109330 \n", "15 603031 安孚科技 王晓飞 送、转 5535 0 0.00 17835 \n", "16 688311 盟升电子 陈英 回购注销 -10500 16.37 17.19 10500 \n", "17 688311 盟升电子 毛钢烈 回购注销 -13440 16.37 22.00 13440 \n", "18 688311 盟升电子 覃光全 回购注销 -14700 16.37 24.06 14700 \n", "19 688311 盟升电子 向荣 回购注销 -27300 16.37 44.69 4467288 \n", "20 688176 亚虹医药 PAN KE 购买 300000 5.79 173.70 129948577 \n", "21 688176 亚虹医药 杨明远 购买 18500 5.8 10.73 18500 \n", "22 688169 石头科技 张磊 出售 -500 385 19.25 3768 \n", "23 600576 祥源文旅 陈亚文 购买 6000 5.6 3.36 148500 \n", "24 2640 跨境通 杨建新 出售 -1756700 2.24 393.50 150859148 \n", "25 688628 优利德 孙乔 -- -- -- 0.00 不详 \n", "26 688560 明冠新材 闫勇 购买 57070 11.76 67.11 600000 \n", "27 600576 祥源文旅 陈云钊 购买 9000 5.55 5.00 30300 \n", "28 600576 祥源文旅 王琦 购买 18600 5.37 9.99 88500 \n", "29 600576 祥源文旅 徐中平 购买 20300 5.42 11.00 306300 \n", "30 600576 祥源文旅 詹纯伟 购买 4600 5.35 2.46 147000 \n", "31 688183 生益电子 王小平 购买 38000 8.86 33.67 38000 \n", "32 2286 保龄宝 王强 购买 6600 5.14 3.39 107400 \n", "33 2286 保龄宝 张国刚 购买 10000 5.15 5.15 47000 \n", "34 2640 跨境通 杨建新 出售 -2538800 2.17 550.92 152615848 \n", "35 688560 明冠新材 闫勇 购买 142930 11.16 159.51 542930 \n", "36 3013 地铁设计 廖景 购买 78000 14.1 109.98 78000 \n", "37 605222 起帆电缆 周桂幸 其它 -21000000 14.73 30933.00 63991400 \n", "38 301021 英诺激光 张勇 购买 6300 15.02 9.46 12200 \n", "39 688538 和辉光电 李凤玲 购买 115000 2.1 24.15 735000 \n", "\n", " 变动原因 变动日期 持股种类 与董监高 关系 董监高 职务 \n", "0 竞价交易 2024-04-19 A股 本人 非独立董事 \n", "1 二级市场买卖 2024-04-19 A股 本人 非独立董事 \n", "2 二级市场买卖 2024-04-19 A股 本人 审计委员会委员 \n", "3 二级市场买卖 2024-04-19 A股 本人 职工代表监事 \n", "4 二级市场买卖 2024-04-19 A股 本人 董事 \n", "5 二级市场买卖 2024-04-19 A股 本人 职工监事 \n", "6 二级市场买卖 2024-04-19 A股 本人 财务总监 \n", "7 二级市场买卖 2024-04-19 A股 本人 董事 \n", "8 二级市场买卖 2024-04-19 A股 本人 董事长 \n", "9 二级市场买卖 2024-04-19 A股 本人 董事长 \n", "10 竞价交易 2024-04-18 A股 本人 非独立董事 \n", "11 竞价交易 2024-04-18 A股 本人 副总经理 \n", "12 二级市场买卖 2024-04-18 A股 本人 董事 \n", "13 分红送转 2024-04-18 A股 本人 副总经理 \n", "14 分红送转 2024-04-18 A股 本人 非独立董事 \n", "15 分红送转 2024-04-18 A股 本人 副总经理 \n", "16 股权激励回购注销实施 2024-04-18 A股 本人 副总经理 \n", "17 股权激励回购注销实施 2024-04-18 A股 本人 非独立董事 \n", "18 股权激励回购注销实施 2024-04-18 A股 本人 非独立董事 \n", "19 股权激励回购注销实施 2024-04-18 A股 本人 董事长 \n", "20 二级市场买卖 2024-04-18 A股 本人 董事长 \n", "21 二级市场买卖 2024-04-18 A股 本人 非独立董事 \n", "22 二级市场买卖 2024-04-18 A股 本人 核心技术人员 \n", "23 二级市场买卖 2024-04-17 A股 本人 监事 \n", "24 竞价交易 2024-04-17 A股 本人 董事长 \n", "25 二级市场买卖 2024-04-17 A股 本人 核心技术人员 \n", "26 二级市场买卖 2024-04-17 A股 本人 董事 \n", "27 二级市场买卖 2024-04-16 A股 本人 职工代表监事 \n", "28 二级市场买卖 2024-04-16 A股 本人 职工监事 \n", "29 二级市场买卖 2024-04-16 A股 本人 非独立董事 \n", "30 二级市场买卖 2024-04-16 A股 本人 监事 \n", "31 二级市场买卖 2024-04-16 A股 本人 核心技术人员 \n", "32 竞价交易 2024-04-16 A股 本人 非独立董事 \n", "33 竞价交易 2024-04-16 A股 本人 副总经理 \n", "34 竞价交易 2024-04-16 A股 本人 董事长 \n", "35 二级市场买卖 2024-04-16 A股 本人 董事 \n", "36 竞价交易 2024-04-16 A股 本人 审计委员会委员 \n", "37 其他 2024-04-16 A股 本人 副董事长 \n", "38 竞价交易 2024-04-16 A股 本人 副总经理 \n", "39 二级市场买卖 2024-04-16 A股 本人 董事会秘书 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/nbjy/index.phtml'\n", "\n", "df = pd.read_html(url)[0]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 34, "id": "d105a9a1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Page 1 completed!\n", "Page 2 completed!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
股票代码股票名称变动人变动类型变动 股数成交均价变动金额(万元)变动后 持股数变动原因变动日期持股种类与董监高 关系董监高 职务
02145中核钛白沈鑫购买33500004.021346.7014150000竞价交易2024-04-19A股本人非独立董事
1688660电气风电吴改购买102003.253.3240255二级市场买卖2024-04-19A股本人非独立董事
2688660电气风电王勇购买100003.263.2633248二级市场买卖2024-04-19A股本人审计委员会委员
3688660电气风电王红春购买80003.242.5933153二级市场买卖2024-04-19A股本人职工代表监事
4688660电气风电乔银平购买122403.274.0042540二级市场买卖2024-04-19A股本人董事
..........................................
35603456九洲药业沙裕杰购买1000016.0516.0590000二级市场买卖2024-04-11A股本人执行董事
36688691灿芯股份庄志青------0.003092850NaN2024-04-11A股本人董事
37688257新锐股份杨汉民出售-8000025200.00309608二级市场买卖2024-04-11A股本人事业部技术副总经理
38872190雷神科技路凯林购买11000016.36179.96643471竞价交易2024-04-11A股本人董事
39688037芯源微王绍勇购买298096.5128.7648980二级市场买卖2024-04-11A股本人核心技术人员
\n", "

80 rows × 13 columns

\n", "
" ], "text/plain": [ " 股票代码 股票名称 变动人 变动类型 变动 股数 成交均价 变动金额(万元) 变动后 持股数 变动原因 \\\n", "0 2145 中核钛白 沈鑫 购买 3350000 4.02 1346.70 14150000 竞价交易 \n", "1 688660 电气风电 吴改 购买 10200 3.25 3.32 40255 二级市场买卖 \n", "2 688660 电气风电 王勇 购买 10000 3.26 3.26 33248 二级市场买卖 \n", "3 688660 电气风电 王红春 购买 8000 3.24 2.59 33153 二级市场买卖 \n", "4 688660 电气风电 乔银平 购买 12240 3.27 4.00 42540 二级市场买卖 \n", ".. ... ... ... ... ... ... ... ... ... \n", "35 603456 九洲药业 沙裕杰 购买 10000 16.05 16.05 90000 二级市场买卖 \n", "36 688691 灿芯股份 庄志青 -- -- -- 0.00 3092850 NaN \n", "37 688257 新锐股份 杨汉民 出售 -80000 25 200.00 309608 二级市场买卖 \n", "38 872190 雷神科技 路凯林 购买 110000 16.36 179.96 643471 竞价交易 \n", "39 688037 芯源微 王绍勇 购买 2980 96.51 28.76 48980 二级市场买卖 \n", "\n", " 变动日期 持股种类 与董监高 关系 董监高 职务 \n", "0 2024-04-19 A股 本人 非独立董事 \n", "1 2024-04-19 A股 本人 非独立董事 \n", "2 2024-04-19 A股 本人 审计委员会委员 \n", "3 2024-04-19 A股 本人 职工代表监事 \n", "4 2024-04-19 A股 本人 董事 \n", ".. ... ... ... ... \n", "35 2024-04-11 A股 本人 执行董事 \n", "36 2024-04-11 A股 本人 董事 \n", "37 2024-04-11 A股 本人 事业部技术副总经理 \n", "38 2024-04-11 A股 本人 董事 \n", "39 2024-04-11 A股 本人 核心技术人员 \n", "\n", "[80 rows x 13 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame()\n", "\n", "for i in range(1, 3):\n", " url = 'http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/nbjy/index.phtml?p=%s' % i\n", " df = pd.concat([df, pd.read_html(url)[0]])\n", " print(\"Page %s completed!\" % i)\n", " \n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }