18 lines
17 KiB
JavaScript
18 lines
17 KiB
JavaScript
import{_ as t,c as a,o as s,ah as i}from"./chunks/framework.j4Nev8bF.js";const u=JSON.parse('{"title":"Python","description":"","frontmatter":{},"headers":[],"relativePath":"rebel_coding/step3.md","filePath":"rebel_coding/step3.md"}'),o={name:"rebel_coding/step3.md"};function n(r,e,l,p,h,d){return s(),a("div",null,[...e[0]||(e[0]=[i(`<h1 id="python" tabindex="-1">Python <a class="header-anchor" href="#python" aria-label="Permalink to “Python”"></a></h1><hr><p>What if you could automate processes in your building … but not just simple processes - intensive analysis and creation ?</p><p>Enter Python!</p><p>Python has access to systems level functionality allowing it to interact with the hardware in a larger variety of ways!</p><p>And though this is a bit more of a detailed tutorial, compared to what we've already done; we'll still stay in the shallow end of the pool for writing our first scraper.</p><p>There are few tools that one can use to bypass the method outlined belows; Beautiful Soup is one such tool. Though we're gonna opt for directly using the toolset around which Beautiful Soup is built.</p><blockquote><p>QN: A quick note about Package Managers</p></blockquote><p>Package managers ... manage the packages, libraries and software our systems use.</p><p>If you are using Ubuntu, you will likely be using a package manager called <code>apt</code>, Mac users may be using one called <code>homebrew</code>, while Windows users can use one called <code>chocolatey</code>.</p><p>Individual languages can also have their own package managers.</p><p>NodeJS has <code>npm</code>, Ruby uses a tool called RubyGems, and Python uses <code>pip</code>.</p><p>Depending on your chose operating system, you will need to learn how to use your operating systems package manager, and subsequently acquire <code>pip</code> for Python.</p><p>For those wanting to dive in head first, you can check out the official <code>pip</code> documentation: <a href="https://pip.pypa.io/en/stable/installing/" target="_blank" rel="noreferrer">https://pip.pypa.io/en/stable/installing/</a>.</p><p>Make sure to read the warnings.</p><p>Alright, let's dig in!!!</p><h2 id="your-first-python-scraper" tabindex="-1">#Your First Python Scraper <a class="header-anchor" href="#your-first-python-scraper" aria-label="Permalink to “#Your First Python Scraper”"></a></h2><h3 id="prep-virtual-environment" tabindex="-1">Prep Virtual Environment <a class="header-anchor" href="#prep-virtual-environment" aria-label="Permalink to “Prep Virtual Environment”"></a></h3><p>Now that we've got <code>pip</code>, we want to the ability to make Python <em>virtual environments</em>.</p><p>The reason we want to use a <em>virtual environment</em> is to keep our systems <em>clean</em>.</p><p>We will be downloading quite a few Python packages that may not be necessary afterwards; or that may interfere with packages we want to use later.</p><p>In short, uing virtual environments allows such keep the packages utilized for each project we pursue compartmentalized.</p><p>The command to run is <code>pip install virtualenv</code>; bells will ring, whistles will be blown, and when it's all done; you'll have Python Virtual Environments accessible on your computer!</p><h3 id="start-venv-install-lxml" tabindex="-1">Start VEnv & Install LXML <a class="header-anchor" href="#start-venv-install-lxml" aria-label="Permalink to “Start VEnv & Install LXML”"></a></h3><p>Next week need to create a virtual environment.</p><p>In order to spin-up our virtual environment we run the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> virtualenv pickYourOwnName</span></span></code></pre></div><p>More bells and whistles will sound, and when it's all down we'll have a new folder into which we will <code>cd</code>.</p><p>There are three folders that are created within our new folder; though for the sake of introductions and brevity, I will only highlight the following two:</p><p><code>bin</code> - this is where the commands for our virtual environment reside. <code>lib</code> - here is where all of our environments packages will reside.</p><p>In order to activate our virtual environment, from within our newly created folder run the command <code>source bin/activate</code>.</p><p>And next we will acquire the base packages we need to begin scraping:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> pip install requests</span></span>
|
||
<span class="line"><span> pip install lxml</span></span></code></pre></div><p>Error: If you get an error when trying to install LXML, that is totally natural and reasonable. Sometimes, some may say, that's the benefit of using a tool such as Beautiful Soup, it manages many dependencies, so that new users don't have to.</p><p>Though in truth, the effort required to supply LXML's dependencies are relatively minimal.</p><p>The package depends on a series of <code>c</code> files; for Mac users, admittedly, this may require acquiring and updating XCode to include their Command Line Tools package.</p><p>For Windows users may have their own issues, regarding Visual C++ components; notice that LXML is dependent on C-language packages.</p><p>If you run into any issues, this is your chance to check out what solutions others have found using your favorite search engine.</p><p>And if still unable to resolve the errors you receive, please reach out to <a href="mailto:canin@dreamfreely.org" target="_blank" rel="noreferrer">canin@dreamfreely.org</a>!</p><h3 id="create-new-python-file" tabindex="-1">Create New Python File <a class="header-anchor" href="#create-new-python-file" aria-label="Permalink to “Create New Python File”"></a></h3><p>Phew! We got through that entire process.</p><p>Congratuluation!!!</p><p>You've done some great work so far; we're navigating the command-line to build a custom toolset.</p><p>That is no small accomplishment!</p><p>Next up, we start building.</p><p>Open up Notepad, or your favorite text editor, and create new file; naming it however you like, though with the <code>.py</code> extention at the end.</p><h3 id="import-libraries" tabindex="-1">Import Libraries <a class="header-anchor" href="#import-libraries" aria-label="Permalink to “Import Libraries”"></a></h3><p>Our process for building our scraper file is very similar to the steps we took when building our webpage.</p><p>First we need to gather our necessary tools.</p><p>On the first line of our file we will import our first package by typing the command <code>import requests</code>.</p><p>Yup, it is that easy; so next we will import the tools we need from LXML with the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> from lxml import html</span></span></code></pre></div><p>Feels almost magically simple doesn't it ?</p><p>Lastly, lets grab one more toolset by adding the line</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> from pprint import pprint as ppr</span></span></code></pre></div><p>This is a tool that will allow us to print our data in a more readable format.</p><p>So let's get to scraping!!</p><h3 id="get-site-requests" tabindex="-1">Get Site (requests) <a class="header-anchor" href="#get-site-requests" aria-label="Permalink to “Get Site (requests)”"></a></h3><p>What website do you want to scrape?</p><p>Mind you, some websites load their data using JavaScript (many websites do, in fact.)</p><p>And these websites will require additional tools to scrape.</p><p>Nonetheless, the command to <em>scrape</em> a website is as follows:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> root = requests.get('https://www.linux.org')</span></span></code></pre></div><p>Operations will happen in the background, and when all is said and done, we will have a variable called <code>root</code> which contains our webpage.</p><p>But it's a Python object, and there's a bunch of other info attached to the variable that we don't need right now ...</p><h3 id="extract-code-lxml-html" tabindex="-1">Extract Code (lxml.html) <a class="header-anchor" href="#extract-code-lxml-html" aria-label="Permalink to “Extract Code (lxml.html)”"></a></h3><p>Huzzah, this is where we will use the <code>html</code> tool we brought in from <code>lxml</code> by running the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> base = html.fromstring(root.text)</span></span></code></pre></div><p>What we are doing is using the <code>html</code> tool to transform the <code>text</code> of the website's code into elements we can parse using another LMXL tool set.</p><h3 id="parse-code-xpath" tabindex="-1">Parse Code (xpath) <a class="header-anchor" href="#parse-code-xpath" aria-label="Permalink to “Parse Code (xpath)”"></a></h3><p>Enter XPATH!</p><blockquote><p>QN: Notice how HTML & LXML both have the same two letters at the end of them; they stand for <em>markup language</em>. And yes they are related.</p></blockquote><p>XML stands for <em>Extensible Markup Language</em>; and XPATH is a tool we can use to traverse and parse code written in this language.</p><p>Our previous command <code>html.fromstring</code> transformed the text of our code into XML elements, with which we can use their <code>xpath</code> property to navigate and extract specific data.</p><p>A fun command to run is <code>base.xpath('.//*')</code> as this will show us all of the root elements of the code we transformed using <code>html.fromstring</code>; any of it look familiar?</p><p>Now let's dig a bit deeper.</p><p>In the example available in the <a href="https://github.com/RebelCoding/startScraping/blob/master/startScraping.py" target="_blank" rel="noreferrer">Rebel Coding startScraping repository</a> the example code runs the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> items = base.xpath('.//*[@class="rgMasterTable"]/tbody/tr')</span></span></code></pre></div><p>What we are doing here is traversing our <code>base</code> element to find <em>any</em> object with the <code>class</code> of <code>rgMasterTable</code>.</p><p>Within that element we want to dig a bit further to our <code>tbody</code> element, and finally, we want to grab <em>all</em> of the table rows contained!</p><p>We put all of these rows into our variable called <code>items</code>; and now we have a list of row elements we can cycle through to extract more specific data.</p><h3 id="organize-display-code" tabindex="-1">Organize & Display Code <a class="header-anchor" href="#organize-display-code" aria-label="Permalink to “Organize & Display Code”"></a></h3><p>In our example script from RebelCoding/startScraping we use the following code to process our newly acquired items:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> for i in items[:5]:</span></span>
|
||
<span class="line"><span> d = {}</span></span>
|
||
<span class="line"><span> title = i.xpath('.//td[1]/*/a/font/text()')</span></span>
|
||
<span class="line"><span> d['title'] = title[0].strip()</span></span>
|
||
<span class="line"><span> d['link'] = i.xpath('.//td[1]/*/a/@href')[0]</span></span>
|
||
<span class="line"><span> ppr(d)</span></span></code></pre></div><p>We use a <em>for-loop</em> to run through the first 5 items in our list of items; and the first thing we do is create an empty dictionary in which to store our desired information.</p><p>We do this so that we can more easily access this information later.</p><p>Next, we use XPATH to specify the information we're after.</p><p>XPATH returns a list of elements by default; and if there are not items, it will return an empty list.</p><p>If there is one item, it will return a list with one item; and so in our next line, we extract that singular item and apply the <code>strip()</code> method to remove any excess empty space on either side of our news acquired <code>title</code>.</p><p>On the next line we shorten this process a bit, by simply adding the index position <code>[0]</code> to the end of our <code>xpath</code> command.</p><p>Lastly we use the Python tool <em>pretty print</em> to display our newly acquired data.</p><p>In order to run our code, we navigate to our file's location; hopefully you've saved it in our Virtual Environment's folder for ease of use.</p><p>And with our virtual environment activated we will run the command <code>python myFirstScrape.py</code>.</p><p>Though using whatever name you save your file as; having remembered the <code>.py</code> extension at the end.</p><p>Y'all just wrote your first web scraper!!!</p><p>Pour your a delicious glass of your favorite beverage or commence any other suitably celebrative action ~ cause y'all just did that!</p><h1 id="more-about-python" tabindex="-1">More About Python <a class="header-anchor" href="#more-about-python" aria-label="Permalink to “More About Python”"></a></h1><hr><p>We've glossed over quite a bit just to get ourselves up and running.</p><p>The example script provided in the Rebel Coding startScraping repository goes a bit deeper; so definitely check that out.</p><p>Though now you know how to use the Python <code>requests</code> package to mechanically grab websites; and you know how to use LXML to read the code from those sites!</p><p>Now let's wrap up by learning about the <em>full stack</em>, by which many of these sites are built and run.</p><h1 id="python-classes-js-objects" tabindex="-1">Python Classes & JS Objects <a class="header-anchor" href="#python-classes-js-objects" aria-label="Permalink to “Python Classes & JS Objects”"></a></h1><hr><h1 id="python-datetime-structure" tabindex="-1">Python DateTime Structure <a class="header-anchor" href="#python-datetime-structure" aria-label="Permalink to “Python DateTime Structure”"></a></h1><hr><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> for i in items[:5]:</span></span>
|
||
<span class="line"><span> d = {}</span></span>
|
||
<span class="line"><span> title = i.xpath('.//td[1]/*/a/font/text()')</span></span>
|
||
<span class="line"><span> d['title'] = title[0].strip()</span></span>
|
||
<span class="line"><span> d['link'] = i.xpath('.//td[1]/*/a/@href')[0]</span></span>
|
||
<span class="line"><span> date = i.xpath('.//td[2]/font/text()')</span></span>
|
||
<span class="line"><span> time = i.xpath('.//td[4]/font/span/font/text()')</span></span>
|
||
<span class="line"><span> time_complete = " ".join(date + time)</span></span>
|
||
<span class="line"><span> format_date = '%m/%d/%Y %I:%M %p'</span></span>
|
||
<span class="line"><span> d['real_date'] = datetime.strptime(time_complete, format_date)</span></span>
|
||
<span class="line"><span> ppr(d)</span></span></code></pre></div><h1 id="reading-writing-csv-json" tabindex="-1">Reading & Writing CSV/JSON <a class="header-anchor" href="#reading-writing-csv-json" aria-label="Permalink to “Reading & Writing CSV/JSON”"></a></h1><hr>`,109)])])}const m=t(o,[["render",n]]);export{u as __pageData,m as default};
|