<spanclass="line"><span> pip install lxml</span></span></code></pre></div><p>Error: If you get an error when trying to install LXML, that is totally natural and reasonable. Sometimes, some may say, that's the benefit of using a tool such as Beautiful Soup, it manages many dependencies, so that new users don't have to.</p><p>Though in truth, the effort required to supply LXML's dependencies are relatively minimal.</p><p>The package depends on a series of <code>c</code> files; for Mac users, admittedly, this may require acquiring and updating XCode to include their Command Line Tools package.</p><p>For Windows users may have their own issues, regarding Visual C++ components; notice that LXML is dependent on C-language packages.</p><p>If you run into any issues, this is your chance to check out what solutions others have found using your favorite search engine.</p><p>And if still unable to resolve the errors you receive, please reach out to <ahref="mailto:canin@dreamfreely.org"target="_blank"rel="noreferrer">canin@dreamfreely.org</a>!</p><h3id="create-new-python-file"tabindex="-1">Create New Python File <aclass="header-anchor"href="#create-new-python-file"aria-label="Permalink to “Create New Python File”"></a></h3><p>Phew! We got through that entire process.</p><p>Congratuluation!!!</p><p>You've done some great work so far; we're navigating the command-line to build a custom toolset.</p><p>That is no small accomplishment!</p><p>Next up, we start building.</p><p>Open up Notepad, or your favorite text editor, and create new file; naming it however you like, though with the <code>.py</code> extention at the end.</p><h3id="import-libraries"tabindex="-1">Import Libraries <aclass="header-anchor"href="#import-libraries"aria-label="Permalink to “Import Libraries”"></a></h3><p>Our process for building our scraper file is very similar to the steps we took when building our webpage.</p><p>First we need to gather our necessary tools.</p><p>On the first line of our file we will import our first package by typing the command <code>import requests</code>.</p><p>Yup, it is that easy; so next we will import the tools we need from LXML with the following command:</p><divclass="language-"><buttontitle="Copy Code"class="copy"></button><spanclass="lang"></span><preclass="shiki shiki-themes github-light github-dark"style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;"tabindex="0"dir="ltr"><code><spanclass="line"><span> from lxml import html</span></span></code></pre></div><p>Feels almost magically simple doesn't it ?</p><p>Lastly, lets grab one more toolset by adding the line</p><divclass="language-"><buttontitle="Copy Code"class="copy"></button><spanclass="lang"></span><preclass="shiki shiki-themes github-light github-dark"style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;"tabindex="0"dir="ltr"><code><spanclass="line"><span> from pprint import pprint as ppr</span></span></code></pre></div><p>This is a tool that will allow us to print our data in a more readable format.</p><p>So let's get to scraping!!</p><h3id="get-site-requests"tabindex="-1">Get Site (requests) <aclass="header-anchor"href="#get-site-requests"aria-label="Permalink to “Get Site (requests)”"></a></h3><p>What website do you want to scrape?</p><p>Mind you, some websites load their data using JavaScript (many websites do, in fact.)</p><p>And these websites will require additional tools to scrape.</p><p>Nonetheless, the command to <em>scrape</em> a website is as follows:</p><divclass="language-"><buttontitle="Copy Code"class="copy"></button><spanclass="lang"></span><preclass="shiki shiki-themes github-light github-dark"style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;"tabindex="0"dir="ltr"><code><spanclass="line"><span> root = requests.get('https://www.linux.org')</span></span></code></pre></div><p>Operations will happen in the backgro
<spanclass="line"><span> d = {}</span></span>
<spanclass="line"><span> title = i.xpath('.//td[1]/*/a/font/text()')</span></span>
<spanclass="line"><span> ppr(d)</span></span></code></pre></div><p>We use a <em>for-loop</em> to run through the first 5 items in our list of items; and the first thing we do is create an empty dictionary in which to store our desired information.</p><p>We do this so that we can more easily access this information later.</p><p>Next, we use XPATH to specify the information we're after.</p><p>XPATH returns a list of elements by default; and if there are not items, it will return an empty list.</p><p>If there is one item, it will return a list with one item; and so in our next line, we extract that singular item and apply the <code>strip()</code> method to remove any excess empty space on either side of our news acquired <code>title</code>.</p><p>On the next line we shorten this process a bit, by simply adding the index position <code>[0]</code> to the end of our <code>xpath</code> command.</p><p>Lastly we use the Python tool <em>pretty print</em> to display our newly acquired data.</p><p>In order to run our code, we navigate to our file's location; hopefully you've saved it in our Virtual Environment's folder for ease of use.</p><p>And with our virtual environment activated we will run the command <code>python myFirstScrape.py</code>.</p><p>Though using whatever name you save your file as; having remembered the <code>.py</code> extension at the end.</p><p>Y'all just wrote your first web scraper!!!</p><p>Pour your a delicious glass of your favorite beverage or commence any other suitably celebrative action ~ cause y'all just did that!</p><h1id="more-about-python"tabindex="-1">More About Python <aclass="header-anchor"href="#more-about-python"aria-label="Permalink to “More About Python”"></a></h1><hr><p>We've glossed over quite a bit just to get ourselves up and running.</p><p>The example script provided in the Rebel Coding startScraping repository goes a bit deeper; so definitely check that out.</p><p>Though now you know how to use the Python <code>requests</code> package to mechanically grab websites; and you know how to use LXML to read the code from those sites!</p><p>Now let's wrap up by learning about the <em>full stack</em>, by which many of these sites are built and run.</p><h1id="python-classes-js-objects"tabindex="-1">Python Classes & JS Objects <aclass="header-anchor"href="#python-classes-js-objects"aria-label="Permalink to “Python Classes & JS Objects”"></a></h1><hr><h1id="python-datetime-structure"tabindex="-1">Python DateTime Structure <aclass="header-anchor"href="#python-datetime-structure"aria-label="Permalink to “Python DateTime Structure”"></a></h1><hr><divclass="language-"><buttontitle="Copy Code"class="copy"></button><spanclass="lang"></span><preclass="shiki shiki-themes github-light github-dark"style="--shiki-light:#24292e;--shiki-dark:#e1e4e8;--shiki-light-bg:#fff;--shiki-dark-bg:#24292e;"tabindex="0"dir="ltr"><code><spanclass="line"><span> for i in items[:5]:</span></span>
<spanclass="line"><span> d = {}</span></span>
<spanclass="line"><span> title = i.xpath('.//td[1]/*/a/font/text()')</span></span>
<spanclass="line"><span> ppr(d)</span></span></code></pre></div><h1id="reading-writing-csv-json"tabindex="-1">Reading & Writing CSV/JSON <aclass="header-anchor"href="#reading-writing-csv-json"aria-label="Permalink to “Reading & Writing CSV/JSON”"></a></h1><hr></div></div></main><footerclass="VPDocFooter"data-v-7011f0d8data-v-e257564d><!--[--><!--]--><!----><navclass="prev-next"aria-labelledby="doc-footer-aria-label"data-v-e257564d><spanclass="visually-hidden"id="doc-footer-aria-label"data-v-e257564d>Pager</span><divclass="pager"data-v-e257564d><aclass="VPLink link pager-link prev"href="/rebel_coding/step2.html"data-v-e257564d><!--[--><spanclass="desc"data-v-e257564d>Previous page</span><spanclass="title"data-v-e257564d>Step 2: JavaScript</span><!--]--></a></div><divclass="pager"data-v-e257564d><aclass="VPLink link pager-link next"href="/rebel_coding/step4.html"data-v-e257564d><!--[--><spanclass="desc"data-v-e257564d>Next page</span><spanclass="title"data-v-e257564d>Step 4: The Full Stack</span><!--]--></a></div></nav></footer><!--[--><!--]--></div></div></div><!--[--><!--]--></div></div><!----><!--[--><!--]--></div></div>