last Update 10.3.19
This commit is contained in:
parent
43fb42f8f4
commit
0156b6b36b
16629
Python-Syntax.html
Normal file
16629
Python-Syntax.html
Normal file
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
13305
Web Crawling.html
Normal file
13305
Web Crawling.html
Normal file
File diff suppressed because it is too large
Load Diff
154
Web Crawling.ipynb
Normal file
154
Web Crawling.ipynb
Normal file
@ -0,0 +1,154 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Webcrawling\n",
|
||||
"Crawlen beschreibt den Prozess, bei dem Daten aus einer Webseite extrahiert werden. Das Vorgehen ist meist wie folgt:<br> \n",
|
||||
"\n",
|
||||
"1. Seite als HTML-Code herunterladen\n",
|
||||
"2. HTML-Code einlesen und Elemente suchen\n",
|
||||
"3. Aus gefundenen Element(en) werden Informationen extrahiert.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## 1.: HTML-Code herunterladen \n",
|
||||
"mit **requests**-Paket"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests # requests importieren\n",
|
||||
"url = \"http://www.beispiel.de\"\"\n",
|
||||
"r = requests.get(url) # das 'r' steht für 'Response', kann aber durch jedes beliebiges Zeichen ersetzt werden"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Die Anfrage wurde an den Server gestellt. Der Statuscode der Webseite kann genauer betrachtet werden. Wichtige HTTP Status-Codes sind:\n",
|
||||
"- 500: Internal Server Error\n",
|
||||
"- 404: Seite nicht gefunden\n",
|
||||
"- 200: Alles Okay"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"200\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(r.status_code)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. HTML-Code einlesen und Elemente suchen"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from bs4 import BeautifulSoup # BeautifulSoup importieren\n",
|
||||
"doc = BeautifulSoup(r.text, \"html.parser\") # HTML-Datei mit BeautifulSoup Einlesen"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Mit dem heruntergeladenen und eingelesenen HTML-Code können Elemente extrahiert werden. Dafür bieten sich CSS-Selektroen an.\n",
|
||||
"Beispiel:\n",
|
||||
"\n",
|
||||
"- `<div class=\"test\">Hallo</div>`: mit Hilfe von `.test` finden, der Punkt steht hierbei für eine Klasse\n",
|
||||
"- `<div id=\"test\">Hallo</div>`: mit Hilfe von `#test` finden, die Raute steht hierbei für einID-Attribut"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[<p>Diese Domain ist für illustrative Beispiele in Dokumenten eingerichtet. <br/>Sie können diese Domain ohne Erlaubnis als Beispiel verwenden und verlinken.</p>, <p><a href=\"/mehr.html\">Mehr Infos</a>\n",
|
||||
"</p>]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(doc.find_all(\"p\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Aus gefundenen Element(en) werden Informationen extrahiert."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0. Diese Domain ist für illustrative Beispiele in Dokumenten eingerichtet. Sie können diese Domain ohne Erlaubnis als Beispiel verwenden und verlinken.\n",
|
||||
"1. Mehr Infos\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"i = 0\n",
|
||||
"for p in doc.find_all(\"p\"): # im Dokument alle <p>-Konstruktoren finden\n",
|
||||
" print(str(i) + \". \" + p.text) # für alle <p> dessen Text ausgeben\n",
|
||||
" i = i + 1"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
Loading…
Reference in New Issue
Block a user