Python | WEBスクレイピングのサンプル

最近なにかと話題になっているWEBスクレイピングですが、
実現方法も色々な手法があります。

また、実現するための言語やライブラリもたくさんあり、
どの言語を選択したら良いのかと迷われている方も多いと思います。

本記事では、Pythonで、WEBスクレイピングのサンプルを解説していきます。

WEBスクレイピングのライブラリ
BeautifulSoup4のインストール
requestsのインストール
Requestsを使ったデータの取得方法
BeautifulSoupを使ったデータ解析の方法
まとめ

WEBスクレイピングのライブラリ

Pythonのスクレイピングによく使われるライブラリとしては、Requests、Beautiful Soup、Seleniumが有名です。
これらのライブラリを組み合わせて、WEBスクレイピングし、データクレンジング後に、データ保存という流れになります。

RequestsはHTMLデータの取得に利用します。
PythonではRequestsを利用して、簡単にWEBサイトからデータを自動的に取得することができます。

取得したデータを、Beautiful Soupなどのライブラリを利用して、パースしてデータを解析していきます。

Seleniumは、JavaScriptが使われているサイトからのデータの取得や、サイトへのログインなどに使います。
Seleniumは、データ取得だけでなく、データの抽出も行うことができますが、ブラウザを操作してデータを取得しますので、動作が遅くなります。

そのため、できるだけRequestsやBeautiful Soupを使い、Seleniumは必要最低限の箇所で使うことをお勧めします。

本記事では、Requests、Beautiful Soupの利用方法を中心に解説していきます。

BeautifulSoup4のインストール

Beautiful Soupは、標準ではインストールされていません。
pipやcondaを利用して別途インストールする必要があります。最新のバージョンはBeautiful Soup4になります。

pipを利用してインストールする場合は、以下のコマンドを入力します。

pip install beautifulsoup4

1	pip install beautifulsoup4

requestsのインストール

requestsをpipでインストールする場合は、以下のコマンドを入力します。

pip install requests

1	pip install requests

以上で、必要なライブラリのインストールは終わりました。
実際のプログラムでは、これらのライブラリをインポートしておく必要があります。

import requests
from bs4 import BeautifulSoup

1 2	import requests from bs4 import BeautifulSoup

以上で、Beautiful Soupを使うための準備は完了となります。

Requestsを使ったデータの取得方法

まずはrequestsを利用して、WEBサイトの情報を取得します。

import requests
from bs4 import BeautifulSoup

import common

url = "https://hogehoge.com/"

res = requests.get(url)
print(res.text)

import requests

from bs4 import BeautifulSoup

import common

url = "https://hogehoge.com/"

res = requests.get(url)

print(res.text)

res.textで内容を確認することができます。

python3 sample.py

1	python3 sample.py

<!DOCTYPE html>
<html lang="ja" prefix="og: http://ogp.me/ns#">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="プログラム開発の情報や、雑記を投稿しているサイトです。">
<meta name="author" content="修ちゃんの技術資料">
<link rel="start" href="https://prettytabby.com" title="TOP">
<!-- OGP -->
<meta property="og:site_name" content="修ちゃんの技術資料">
<meta property="og:description" content="プログラム開発の情報や、雑記を投稿しているサイトです。">
<meta property="og:title" content="修ちゃんの技術資料">
<meta property="og:url" content="https://prettytabby.com/">
<meta property="og:type" content="website">
<!-- twitter:card -->
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:site" content="@t_prettytabby">
<title>修ちゃんの技術資料 &#8211; プログラム開発の情報や、雑記を投稿しているサイトです。</title>
<link rel='dns-prefetch' href='//secure.gravatar.com' />
<link rel='dns-prefetch' href='//s.w.org' />
<link rel='dns-prefetch' href='//v0.wordpress.com' />
<link rel='dns-prefetch' href='//i0.wp.com' />
<link rel='dns-prefetch' href='//i1.wp.com' />
<link rel='dns-prefetch' href='//i2.wp.com' />
<link rel='dns-prefetch' href='//c0.wp.com' />
<link rel="alternate" type="application/rss+xml" title="修ちゃんの技術資料 &raquo; フィード" href="https://prettytabby.com/feed/" />
<link rel="alternate" type="application/rss+xml" title="修ちゃんの技術資料 &raquo; コメントフィード" href="https://prettytabby.com/comments/feed/" />
                <script>
                        window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/prettytabby.com\/wp-includes\/js\/wp-emoji-release.min.js?ver=5.5.1"}};
                        !function(e,a,t){var r,n,o,i,p=a.createElement("canvas"),s=p.getContext&&p.getContext("2d");function c(e,t){var a=String.fromCharCode;s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,e),0,0);var r=p.toDataURL();return s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,t),0,0),r===p.toDataURL()}function l(e){if(!s||!s.fillText)return!1;switch(s.textBaseline="top",s.font="600 32px Arial",e){case"flag":return!c([127987,65039,8205,9895,65039],[127987,65039,8203,9895,65039])&&(!c([55356,56826,55356,56819],[55356,56826,8203,55356,56819])&&!c([55356,57332,56128,56423,56128,56418,56128,56421,56128,56430,56128,56423,56128,56447],[55356,57332,8203,56128,56423,8203,56128,56418,8203,56128,56421,8203,56128,56430,8203,56128,56423,8203,56128,56447]));case"emoji":return!c([55357,56424,8205,55356,57212],[55357,56424,8203,55356,57212])}return!1}function d(e){var t=a.createElement("script");t.src=e,t.defer=t.type="text/javascript",a.getElementsByTagName("head")[0].appendChild(t)}for(i=Array("flag","emoji"),t.supports={everything:!0,everythingExceptFlag:!0},o=0;o<i.length;o++)t.supports[i[o]]=l(i[o]),t.supports.everything=t.supports.everything&&t.supports[i[o]],"flag"!==i[o]&&(t.supports.everythingExceptFlag=t.supports.everythingExceptFlag&&t.supports[i[o]]);t.supports.everythingExceptFlag=t.supports.everythingExceptFlag&&!t.supports.flag,t.DOMReady=!1,t.readyCallback=function(){t.DOMReady=!0},t.supports.everything||(n=function(){t.readyCallback()},a.addEventListener?(a.addEventListener("DOMContentLoaded",n,!1),e.addEventListener("load",n,!1)):(e.attachEvent("onload",n),a.attachEvent("onreadystatechange",function(){"complete"===a.readyState&&t.readyCallback()})),(r=t.source||{}).concatemoji?d(r.concatemoji):r.wpemoji&&r.twemoji&&(d(r.twemoji),d(r.wpemoji)))}(window,document,window._wpemojiSettings);
                </script>
                <style>
img.wp-smiley,
img.emoji {
        display: inline !important;
        border: none !important;
        box-shadow: none !important;
        height: 1em !important;
        width: 1em !important;
        margin: 0 .07em !important;
        vertical-align: -0.1em !important;
        background: none !important;
        padding: 0 !important;
}
</style>
        <link rel='stylesheet' id='crayon-css'  href='https://prettytabby.com/wp-content/plugins/crayon-syntax-highlighter/css/min/crayon.min.css?ver=_2.7.2_beta' media='all' />
<link rel='stylesheet' id='crayon-theme-classic-css'  href='https://prettytabby.com/wp-content/plugins/crayon-syntax-highlighter/themes/classic/classic.css?ver=_2.7.2_beta' media='all' />
<link rel='stylesheet' id='crayon-font-monaco-css'  href='https://prettytabby.com/wp-content/plugins/crayon-syntax-highlighter/fonts/monaco.css?ver=_2.7.2_beta' media='all' />
<link rel='stylesheet' id='wp-block-library-css'  href='https://c0.wp.com/c/5.5.1/wp-includes/css/dist/block-library/style.min.css' media='all' />
<style id='wp-block-library-inline-css'>
.has-text-align-justify{text-align:justify;}
</style>
<link rel='stylesheet' id='font-awesome-css'  href='https://prettytabby.com/wp-content/plugins/arconix-shortcodes/includes/css/font-awesome.min.css?ver=4.6.3' media='all' />
<link rel='stylesheet' id='arconix-shortcodes-css'  href='https://prettytabby.com/wp-content/plugins/arconix-shortcodes/includes/css/arconix-shortcodes.min.css?ver=2.1.7' media='all' />
<link rel='stylesheet' id='contact-form-7-css'  href='https://prettytabby.com/wp-content/plugins/contact-form-7/includes/css/styles.css?ver=5.2.2' media='all' />
<link rel='stylesheet' id='responsive-lightbox-fancybox-css'  href='https://prettytabby.com/wp-content/plugins/responsive-lightbox/assets/fancybox/jquery.fancybox.min.css?ver=2.2.3' media='all' />
<link rel='stylesheet' id='jetpack_css-css'  href='https://c0.wp.com/p/jetpack/9.0.2/css/jetpack.css' media='all' />
<script src='https://c0.wp.com/c/5.5.1/wp-includes/js/jquery/jquery.js' id='jquery-core-js'></script>
<script id='crayon_js-js-extra'>
var CrayonSyntaxSettings = {"version":"_2.7.2_beta","is_admin":"0","ajaxurl":"https:\/\/prettytabby.com\/wp-admin\/admin-ajax.php","prefix":"crayon-","setting":"crayon-setting","selected":"crayon-setting-selected","changed":"crayon-setting-changed","special":"crayon-setting-special","orig_value":"data-orig-value","debug":""};
var CrayonSyntaxStrings = {"copy":"Press %s to Copy, %s to Paste","minimize":"Click To Expand Code"};
</script>
<link rel='shortlink' href='https://wp.me/cl4oh' />
<style type='text/css'>
</style>
<style type='text/css'>img#wpstats{display:none}</style>                        <style type="text/css">
                                /* If html does not have either class, do not show lazy loaded images. */
                                html:not( .jetpack-lazy-images-js-enabled ):not( .js ) .jetpack-lazy-image {
                                        display: none;
                                }
                        </style>
                        <script>
                                document.documentElement.classList.add(
                                        'jetpack-lazy-images-js-enabled'
                                );
                        </script>

<!-- Jetpack Open Graph Tags -->
<meta property="og:type" content="website" />
<meta property="og:title" content="修ちゃんの技術資料" />
<meta property="og:description" content="プログラム開発の情報や、雑記を投稿しているサイトです。" />
<meta property="og:url" content="https://prettytabby.com/" />
<meta property="og:site_name" content="修ちゃんの技術資料" />
<meta property="og:image" content="https://i2.wp.com/prettytabby.com/wp-content/uploads/2018/05/cropped-af5c551016a4bcd429f352583986ceb3-1.png?fit=512%2C512&amp;ssl=1" />
<meta property="og:image:width" content="512" />
<meta property="og:image:height" content="512" />
<meta property="og:locale" content="ja_JP" />

<!DOCTYPE html>

<head>

<title>修ちゃんの技術資料 – プログラム開発の情報や、雑記を投稿しているサイトです。</title>

window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/prettytabby.com\/wp-includes\/js\/wp-emoji-release.min.js?ver=5.5.1"}};

</script>

<style>

img.wp-smiley,

img.emoji {

display: inline !important;

border: none !important;

box-shadow: none !important;

height: 1em !important;

width: 1em !important;

margin: 0 .07em !important;

vertical-align: -0.1em !important;

background: none !important;

padding: 0 !important;

}

</style>

.has-text-align-justify{text-align:justify;}

</style>

var CrayonSyntaxSettings = {"version":"_2.7.2_beta","is_admin":"0","ajaxurl":"https:\/\/prettytabby.com\/wp-admin\/admin-ajax.php","prefix":"crayon-","setting":"crayon-setting","selected":"crayon-setting-selected","changed":"crayon-setting-changed","special":"crayon-setting-special","orig_value":"data-orig-value","debug":""};

var CrayonSyntaxStrings = {"copy":"Press %s to Copy, %s to Paste","minimize":"Click To Expand Code"};

</script>

</style>

/* If html does not have either class, do not show lazy loaded images. */

html:not( .jetpack-lazy-images-js-enabled ):not( .js ) .jetpack-lazy-image {

display: none;

}

</style>

document.documentElement.classList.add(

'jetpack-lazy-images-js-enabled'

);

</script>

BeautifulSoupを使ったデータ解析の方法

BeautifulSoup()を用いて、ダ取得した情報の解析が必要になります。

import requests
from bs4 import BeautifulSoup

import common

url = "https://prettytabby.com/"

res = requests.get(url)

# BeautifulSoupを利用してWebページを解析する
soup = BeautifulSoup(res.text, 'html.parser')

elems = soup.find_all("a")

for p in elems:
print( p.get('href') )

import requests

from bs4 import BeautifulSoup

import common

url = "https://prettytabby.com/"

res = requests.get(url)

# BeautifulSoupを利用してWebページを解析する

soup = BeautifulSoup(res.text, 'html.parser')

elems = soup.find_all("a")

for p in elems:

print( p.get('href') )

https://prettytabby.com
https://www.facebook.com/tabby.pretty.39
https://twitter.com/t_prettytabby
https://www.instagram.com/prettytabby2006/
https://fun-learning.jp
https://prettytabby.com/related-sites/
https://prettytabby.com/programming-best/
https://prettytabby.com/fixed-top-2/
https://prettytabby.com/related-sites/pbsite/
https://prettytabby.com/profile/
https://prettytabby.com/inquiry/
https://prettytabby.com/ipa-information-2/
https://prettytabby.com/laravel-install/
https://prettytabby.com/diary-20201013/
https://prettytabby.com/program-design-1/
https://prettytabby.com/myapp-released-20200922/
https://prettytabby.com/myapp-ipa-st-released/
https://prettytabby.com/cakephp3-phpspreadsheet/
https://prettytabby.com/app-tsukanshi/
https://prettytabby.com/boujitsu-info-4/
https://prettytabby.com/cakephp-csvview-sample/
https://prettytabby.com/business/
https://prettytabby.com/cakephp-tips-1/
https://prettytabby.com/page/2/
https://prettytabby.com/page/3/
https://prettytabby.com/page/4/
https://prettytabby.com/page/5/
https://prettytabby.com/page/25/
https://prettytabby.com/page/2/
https://prettytabby.com/my-app-ipa-au-released/
https://prettytabby.com/how-to-trello/
https://prettytabby.com/vb-inifile-sample/
https://prettytabby.com/ipa-information-2/
https://prettytabby.com/myapp-gyosei-released-3/
https://prettytabby.com/my-app-ipa-au-released-2/
https://prettytabby.com/python-program-sample-list/
https://prettytabby.com/myapp-gyosei-released/
https://prettytabby.com/vb-datatable-sample/
https://prettytabby.com/vb-program-sample-list/
https://www.facebook.com/tabby.pretty.39
https://twitter.com/t_prettytabby
https://www.instagram.com/prettytabby2006/
https://prettytabby.com
None

https://prettytabby.com

https://www.facebook.com/tabby.pretty.39

https://twitter.com/t_prettytabby

https://www.instagram.com/prettytabby2006/

https://fun-learning.jp

https://prettytabby.com/related-sites/

https://prettytabby.com/programming-best/

https://prettytabby.com/fixed-top-2/

https://prettytabby.com/related-sites/pbsite/

https://prettytabby.com/profile/

https://prettytabby.com/inquiry/

https://prettytabby.com/ipa-information-2/

https://prettytabby.com/laravel-install/

https://prettytabby.com/diary-20201013/

https://prettytabby.com/program-design-1/

https://prettytabby.com/myapp-released-20200922/

https://prettytabby.com/myapp-ipa-st-released/

https://prettytabby.com/cakephp3-phpspreadsheet/

https://prettytabby.com/app-tsukanshi/

https://prettytabby.com/boujitsu-info-4/

https://prettytabby.com/cakephp-csvview-sample/

https://prettytabby.com/business/

https://prettytabby.com/cakephp-tips-1/

https://prettytabby.com/page/2/

https://prettytabby.com/page/3/

https://prettytabby.com/page/4/

https://prettytabby.com/page/5/

https://prettytabby.com/page/25/

https://prettytabby.com/page/2/

https://prettytabby.com/my-app-ipa-au-released/

https://prettytabby.com/how-to-trello/

https://prettytabby.com/vb-inifile-sample/

https://prettytabby.com/ipa-information-2/

https://prettytabby.com/myapp-gyosei-released-3/

https://prettytabby.com/my-app-ipa-au-released-2/

https://prettytabby.com/python-program-sample-list/

https://prettytabby.com/myapp-gyosei-released/

https://prettytabby.com/vb-datatable-sample/

https://prettytabby.com/vb-program-sample-list/

https://www.facebook.com/tabby.pretty.39

https://twitter.com/t_prettytabby

https://www.instagram.com/prettytabby2006/

https://prettytabby.com

None

まとめ

WEBスクレイピングの簡単なサンプルの解説となりましたが、
この基本的な流れを応用していくことで、効率よくWEBスクレイピングすることが可能となります。

WEBスクレイピングを実際に行う場合には、
クロール対象のサイトに迷惑がかからないように、事前に許可を得るなどし、細心の注意を払う必要があります。

これらの注意点を意識し、WEBスクレイピングを活用していただければと思います。

WEBスクレイピングのライブラリ

BeautifulSoup4のインストール

requestsのインストール

Requestsを使ったデータの取得方法

BeautifulSoupを使ったデータ解析の方法

まとめ

400 Bad Request