Files
Library/docs/.vitepress/dist/assets/rebel_coding_step3.md.CFJMlHnF.js
2026-01-10 00:23:33 -05:00

18 lines
17 KiB
JavaScript
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
import{_ as t,c as a,o as s,ah as i}from"./chunks/framework.j4Nev8bF.js";const u=JSON.parse('{"title":"Python","description":"","frontmatter":{},"headers":[],"relativePath":"rebel_coding/step3.md","filePath":"rebel_coding/step3.md"}'),o={name:"rebel_coding/step3.md"};function n(r,e,l,p,h,d){return s(),a("div",null,[...e[0]||(e[0]=[i(`<h1 id="python" tabindex="-1">Python <a class="header-anchor" href="#python" aria-label="Permalink to “Python”"></a></h1><hr><p>What if you could automate processes in your building … but not just simple processes - intensive analysis and creation ?</p><p>Enter Python!</p><p>Python has access to systems level functionality allowing it to interact with the hardware in a larger variety of ways!</p><p>And though this is a bit more of a detailed tutorial, compared to what we&#39;ve already done; we&#39;ll still stay in the shallow end of the pool for writing our first scraper.</p><p>There are few tools that one can use to bypass the method outlined belows; Beautiful Soup is one such tool. Though we&#39;re gonna opt for directly using the toolset around which Beautiful Soup is built.</p><blockquote><p>QN: A quick note about Package Managers</p></blockquote><p>Package managers ... manage the packages, libraries and software our systems use.</p><p>If you are using Ubuntu, you will likely be using a package manager called <code>apt</code>, Mac users may be using one called <code>homebrew</code>, while Windows users can use one called <code>chocolatey</code>.</p><p>Individual languages can also have their own package managers.</p><p>NodeJS has <code>npm</code>, Ruby uses a tool called RubyGems, and Python uses <code>pip</code>.</p><p>Depending on your chose operating system, you will need to learn how to use your operating systems package manager, and subsequently acquire <code>pip</code> for Python.</p><p>For those wanting to dive in head first, you can check out the official <code>pip</code> documentation: <a href="https://pip.pypa.io/en/stable/installing/" target="_blank" rel="noreferrer">https://pip.pypa.io/en/stable/installing/</a>.</p><p>Make sure to read the warnings.</p><p>Alright, let&#39;s dig in!!!</p><h2 id="your-first-python-scraper" tabindex="-1">#Your First Python Scraper <a class="header-anchor" href="#your-first-python-scraper" aria-label="Permalink to “#Your First Python Scraper”"></a></h2><h3 id="prep-virtual-environment" tabindex="-1">Prep Virtual Environment <a class="header-anchor" href="#prep-virtual-environment" aria-label="Permalink to “Prep Virtual Environment”"></a></h3><p>Now that we&#39;ve got <code>pip</code>, we want to the ability to make Python <em>virtual environments</em>.</p><p>The reason we want to use a <em>virtual environment</em> is to keep our systems <em>clean</em>.</p><p>We will be downloading quite a few Python packages that may not be necessary afterwards; or that may interfere with packages we want to use later.</p><p>In short, uing virtual environments allows such keep the packages utilized for each project we pursue compartmentalized.</p><p>The command to run is <code>pip install virtualenv</code>; bells will ring, whistles will be blown, and when it&#39;s all done; you&#39;ll have Python Virtual Environments accessible on your computer!</p><h3 id="start-venv-install-lxml" tabindex="-1">Start VEnv &amp; Install LXML <a class="header-anchor" href="#start-venv-install-lxml" aria-label="Permalink to “Start VEnv &amp; Install LXML”"></a></h3><p>Next week need to create a virtual environment.</p><p>In order to spin-up our virtual environment we run the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> virtualenv pickYourOwnName</span></span></code></pre></div><p>More bells and whistles will sound, and when it&#39;s all down we&#39;ll have a new folder into which we will <code>cd</code>.</p><p>There are three folders that are created within our new folder; though for the sake of introductions and brevity, I will only highlight the following two:</p><p><code>bin</code> - this is where the commands for our virtual environment reside. <code>lib</code> - here is where all of our environments packages will reside.</p><p>In order to activate our virtual environment, from within our newly created folder run the command <code>source bin/activate</code>.</p><p>And next we will acquire the base packages we need to begin scraping:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> pip install requests</span></span>
<span class="line"><span> pip install lxml</span></span></code></pre></div><p>Error: If you get an error when trying to install LXML, that is totally natural and reasonable. Sometimes, some may say, that&#39;s the benefit of using a tool such as Beautiful Soup, it manages many dependencies, so that new users don&#39;t have to.</p><p>Though in truth, the effort required to supply LXML&#39;s dependencies are relatively minimal.</p><p>The package depends on a series of <code>c</code> files; for Mac users, admittedly, this may require acquiring and updating XCode to include their Command Line Tools package.</p><p>For Windows users may have their own issues, regarding Visual C++ components; notice that LXML is dependent on C-language packages.</p><p>If you run into any issues, this is your chance to check out what solutions others have found using your favorite search engine.</p><p>And if still unable to resolve the errors you receive, please reach out to <a href="mailto:canin@dreamfreely.org" target="_blank" rel="noreferrer">canin@dreamfreely.org</a>!</p><h3 id="create-new-python-file" tabindex="-1">Create New Python File <a class="header-anchor" href="#create-new-python-file" aria-label="Permalink to “Create New Python File”"></a></h3><p>Phew! We got through that entire process.</p><p>Congratuluation!!!</p><p>You&#39;ve done some great work so far; we&#39;re navigating the command-line to build a custom toolset.</p><p>That is no small accomplishment!</p><p>Next up, we start building.</p><p>Open up Notepad, or your favorite text editor, and create new file; naming it however you like, though with the <code>.py</code> extention at the end.</p><h3 id="import-libraries" tabindex="-1">Import Libraries <a class="header-anchor" href="#import-libraries" aria-label="Permalink to “Import Libraries”"></a></h3><p>Our process for building our scraper file is very similar to the steps we took when building our webpage.</p><p>First we need to gather our necessary tools.</p><p>On the first line of our file we will import our first package by typing the command <code>import requests</code>.</p><p>Yup, it is that easy; so next we will import the tools we need from LXML with the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> from lxml import html</span></span></code></pre></div><p>Feels almost magically simple doesn&#39;t it ?</p><p>Lastly, lets grab one more toolset by adding the line</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> from pprint import pprint as ppr</span></span></code></pre></div><p>This is a tool that will allow us to print our data in a more readable format.</p><p>So let&#39;s get to scraping!!</p><h3 id="get-site-requests" tabindex="-1">Get Site (requests) <a class="header-anchor" href="#get-site-requests" aria-label="Permalink to “Get Site (requests)”"></a></h3><p>What website do you want to scrape?</p><p>Mind you, some websites load their data using JavaScript (many websites do, in fact.)</p><p>And these websites will require additional tools to scrape.</p><p>Nonetheless, the command to <em>scrape</em> a website is as follows:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> root = requests.get(&#39;https://www.linux.org&#39;)</span></span></code></pre></div><p>Operations will happen in the background, and when all is said and done, we will have a variable called <code>root</code> which contains our webpage.</p><p>But it&#39;s a Python object, and there&#39;s a bunch of other info attached to the variable that we don&#39;t need right now ...</p><h3 id="extract-code-lxml-html" tabindex="-1">Extract Code (lxml.html) <a class="header-anchor" href="#extract-code-lxml-html" aria-label="Permalink to “Extract Code (lxml.html)”"></a></h3><p>Huzzah, this is where we will use the <code>html</code> tool we brought in from <code>lxml</code> by running the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> base = html.fromstring(root.text)</span></span></code></pre></div><p>What we are doing is using the <code>html</code> tool to transform the <code>text</code> of the website&#39;s code into elements we can parse using another LMXL tool set.</p><h3 id="parse-code-xpath" tabindex="-1">Parse Code (xpath) <a class="header-anchor" href="#parse-code-xpath" aria-label="Permalink to “Parse Code (xpath)”"></a></h3><p>Enter XPATH!</p><blockquote><p>QN: Notice how HTML &amp; LXML both have the same two letters at the end of them; they stand for <em>markup language</em>. And yes they are related.</p></blockquote><p>XML stands for <em>Extensible Markup Language</em>; and XPATH is a tool we can use to traverse and parse code written in this language.</p><p>Our previous command <code>html.fromstring</code> transformed the text of our code into XML elements, with which we can use their <code>xpath</code> property to navigate and extract specific data.</p><p>A fun command to run is <code>base.xpath(&#39;.//*&#39;)</code> as this will show us all of the root elements of the code we transformed using <code>html.fromstring</code>; any of it look familiar?</p><p>Now let&#39;s dig a bit deeper.</p><p>In the example available in the <a href="https://github.com/RebelCoding/startScraping/blob/master/startScraping.py" target="_blank" rel="noreferrer">Rebel Coding startScraping repository</a> the example code runs the following command:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> items = base.xpath(&#39;.//*[@class=&quot;rgMasterTable&quot;]/tbody/tr&#39;)</span></span></code></pre></div><p>What we are doing here is traversing our <code>base</code> element to find <em>any</em> object with the <code>class</code> of <code>rgMasterTable</code>.</p><p>Within that element we want to dig a bit further to our <code>tbody</code> element, and finally, we want to grab <em>all</em> of the table rows contained!</p><p>We put all of these rows into our variable called <code>items</code>; and now we have a list of row elements we can cycle through to extract more specific data.</p><h3 id="organize-display-code" tabindex="-1">Organize &amp; Display Code <a class="header-anchor" href="#organize-display-code" aria-label="Permalink to “Organize &amp; Display Code”"></a></h3><p>In our example script from RebelCoding/startScraping we use the following code to process our newly acquired items:</p><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> for i in items[:5]:</span></span>
<span class="line"><span> d = {}</span></span>
<span class="line"><span> title = i.xpath(&#39;.//td[1]/*/a/font/text()&#39;)</span></span>
<span class="line"><span> d[&#39;title&#39;] = title[0].strip()</span></span>
<span class="line"><span> d[&#39;link&#39;] = i.xpath(&#39;.//td[1]/*/a/@href&#39;)[0]</span></span>
<span class="line"><span> ppr(d)</span></span></code></pre></div><p>We use a <em>for-loop</em> to run through the first 5 items in our list of items; and the first thing we do is create an empty dictionary in which to store our desired information.</p><p>We do this so that we can more easily access this information later.</p><p>Next, we use XPATH to specify the information we&#39;re after.</p><p>XPATH returns a list of elements by default; and if there are not items, it will return an empty list.</p><p>If there is one item, it will return a list with one item; and so in our next line, we extract that singular item and apply the <code>strip()</code> method to remove any excess empty space on either side of our news acquired <code>title</code>.</p><p>On the next line we shorten this process a bit, by simply adding the index position <code>[0]</code> to the end of our <code>xpath</code> command.</p><p>Lastly we use the Python tool <em>pretty print</em> to display our newly acquired data.</p><p>In order to run our code, we navigate to our file&#39;s location; hopefully you&#39;ve saved it in our Virtual Environment&#39;s folder for ease of use.</p><p>And with our virtual environment activated we will run the command <code>python myFirstScrape.py</code>.</p><p>Though using whatever name you save your file as; having remembered the <code>.py</code> extension at the end.</p><p>Y&#39;all just wrote your first web scraper!!!</p><p>Pour your a delicious glass of your favorite beverage or commence any other suitably celebrative action ~ cause y&#39;all just did that!</p><h1 id="more-about-python" tabindex="-1">More About Python <a class="header-anchor" href="#more-about-python" aria-label="Permalink to “More About Python”"></a></h1><hr><p>We&#39;ve glossed over quite a bit just to get ourselves up and running.</p><p>The example script provided in the Rebel Coding startScraping repository goes a bit deeper; so definitely check that out.</p><p>Though now you know how to use the Python <code>requests</code> package to mechanically grab websites; and you know how to use LXML to read the code from those sites!</p><p>Now let&#39;s wrap up by learning about the <em>full stack</em>, by which many of these sites are built and run.</p><h1 id="python-classes-js-objects" tabindex="-1">Python Classes &amp; JS Objects <a class="header-anchor" href="#python-classes-js-objects" aria-label="Permalink to “Python Classes &amp; JS Objects”"></a></h1><hr><h1 id="python-datetime-structure" tabindex="-1">Python DateTime Structure <a class="header-anchor" href="#python-datetime-structure" aria-label="Permalink to “Python DateTime Structure”"></a></h1><hr><div class="language-"><button title="Copy Code" class="copy"></button><span class="lang"></span><pre class="shiki shiki-themes github-light github-dark" style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;" tabindex="0" dir="ltr"><code><span class="line"><span> for i in items[:5]:</span></span>
<span class="line"><span> d = {}</span></span>
<span class="line"><span> title = i.xpath(&#39;.//td[1]/*/a/font/text()&#39;)</span></span>
<span class="line"><span> d[&#39;title&#39;] = title[0].strip()</span></span>
<span class="line"><span> d[&#39;link&#39;] = i.xpath(&#39;.//td[1]/*/a/@href&#39;)[0]</span></span>
<span class="line"><span> date = i.xpath(&#39;.//td[2]/font/text()&#39;)</span></span>
<span class="line"><span> time = i.xpath(&#39;.//td[4]/font/span/font/text()&#39;)</span></span>
<span class="line"><span> time_complete = &quot; &quot;.join(date + time)</span></span>
<span class="line"><span> format_date = &#39;%m/%d/%Y %I:%M %p&#39;</span></span>
<span class="line"><span> d[&#39;real_date&#39;] = datetime.strptime(time_complete, format_date)</span></span>
<span class="line"><span> ppr(d)</span></span></code></pre></div><h1 id="reading-writing-csv-json" tabindex="-1">Reading &amp; Writing CSV/JSON <a class="header-anchor" href="#reading-writing-csv-json" aria-label="Permalink to “Reading &amp; Writing CSV/JSON”"></a></h1><hr>`,109)])])}const m=t(o,[["render",n]]);export{u as __pageData,m as default};