{"id":574,"date":"2021-07-16T07:00:11","date_gmt":"2021-07-16T12:00:11","guid":{"rendered":"https:\/\/wollen.org\/blog\/?p=574"},"modified":"2024-08-17T07:49:33","modified_gmt":"2024-08-17T12:49:33","slug":"working-with-sql-databases","status":"publish","type":"post","link":"https:\/\/wollen.org\/blog\/2021\/07\/working-with-sql-databases\/","title":{"rendered":"SQL databases aren&#8217;t actually that scary"},"content":{"rendered":"<p>Today let&#8217;s dip our toes into the SQL world. We&#8217;ve spent a lot of time working with <em>pandas<\/em> and CSV files but data won&#8217;t always be packaged that way. It&#8217;s a valuable skill to be comfortable with data in all its forms.<\/p>\n<p>We won&#8217;t spend too much time studying database queries\u2014in fact there&#8217;s only one basic query in the code below. Instead we&#8217;ll walk through a quick example of retrieving and visualizing a table.<\/p>\n<hr \/>\n<h4>1. The data.<\/h4>\n<p>This post uses the <em>chinook.db<\/em> example database that&#8217;s found all over the internet. You can learn more about it <a href=\"https:\/\/github.com\/lerocha\/chinook-database\" target=\"_blank\" rel=\"noopener\">here<\/a>. The database contains 11 separate tables but we&#8217;ll just focus on one: <em>invoices<\/em>. In a more complex SQL query you might bring multiple tables together with a <em>UNION<\/em> or <em>JOIN<\/em> operation.<\/p>\n<p>The <em>invoices<\/em> table contains 9 columns:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\">InvoiceId\r\nCustomerId\r\nInvoiceDate\r\nBillingAddress\r\nBillingCity\r\nBillingState\r\nBillingCountry\r\nBillingPostalCode\r\nTotal<\/pre>\n<p>Let&#8217;s analyze how invoices vary between countries. We can find the number of invoices sent to each country and their average bill.<\/p>\n<p>Begin with the imports. <em>sqlite3<\/em> is built into the Python standard library so there&#8217;s no excuse not to take advantage! <em>contextlib<\/em> provides a convenient way to manage context when querying databases. The concept is similar to opening a text file by writing <code>with open(\"file.txt\", \"r\") as f<\/code>. Whatever you place within the block only exists while the block is executed. The idea is to place a database connection in a context, then close it when you&#8217;re done.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import sqlite3\r\nfrom contextlib import closing\r\nimport matplotlib.pyplot as plt\r\nfrom matplotlib import ticker\r\nimport seaborn as sns<\/pre>\n<p>Before digging in let&#8217;s take a closer look at the table&#8217;s structure. There are a few different ways to retrieve column info but I find a <em>PRAGMA<\/em> operation to be the most readable. It also works independently from the cursor so you don&#8217;t have to retrieve a view before calling <code>cursor.description<\/code>.<\/p>\n<p>Notice how <code>contextlib.closing<\/code> is used below. Both the cursor and the database connection only exist in the context of this code block.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">with closing(sqlite3.connect(\"chinook.db\")) as connection:\r\n    with closing(connection.cursor()) as cursor:\r\n        column_info = cursor.execute(\"PRAGMA table_info(invoices)\").fetchall()\r\n        for col in column_info:\r\n            print(col[1])<\/pre>\n<p>At this point we&#8217;ve printed the names of all columns within the table. But notice we only printed the 1-indexed element of each item within <code>column_info<\/code>. That&#8217;s because elements include more information than just the column&#8217;s name. Below you can see all the information available:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\">(0, 'InvoiceId', 'INTEGER', 1, None, 1)\r\n(1, 'CustomerId', 'INTEGER', 1, None, 0)\r\n(2, 'InvoiceDate', 'DATETIME', 1, None, 0)\r\n(3, 'BillingAddress', 'NVARCHAR(70)', 0, None, 0)\r\n(4, 'BillingCity', 'NVARCHAR(40)', 0, None, 0)\r\n(5, 'BillingState', 'NVARCHAR(40)', 0, None, 0)\r\n(6, 'BillingCountry', 'NVARCHAR(40)', 0, None, 0)\r\n(7, 'BillingPostalCode', 'NVARCHAR(10)', 0, None, 0)\r\n(8, 'Total', 'NUMERIC(10,2)', 1, None, 0)<\/pre>\n<p>It&#8217;s often helpful to know a column&#8217;s datatype, for example.<\/p>\n<hr \/>\n<h4>2. Get the data from the database.<\/h4>\n<p>For this exercise we&#8217;ll focus on <em>BillingCountry<\/em> and <em>Total<\/em> columns. That&#8217;s all we need to calculate the number of invoices from each country and each country&#8217;s average bill. The query begins by <strong>selecting<\/strong> three columns <strong>from<\/strong> a particular table. For two of the columns it does an operation on them and assigns an alias.<\/p>\n<p>The query <strong>groups by<\/strong> country, which works very similarly to <code>pandas.groupby<\/code>. All rows that represent a German buyer, for example, are combined into a single row. We do this because we&#8217;re interested in describing each country&#8217;s buyers as a group.<\/p>\n<p>There are 24 unique countries within the table but we&#8217;ll limit the query to 10 for visualization purposes. That&#8217;s what <code>LIMIT 10<\/code> means. The query specifies <strong>descending order<\/strong> to return the top-10 countries ranked by invoice <strong>count<\/strong>.<\/p>\n<p>The <code>fetchall<\/code> method returns a list of tuples with as many elements as the query returns. Since we&#8217;re grouping by country and limiting the response to 10, <code>rows<\/code> will be a list of ten 3-tuples.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">rows = cursor.execute(\r\n    \"SELECT BillingCountry, COUNT(BillingCountry) AS count, AVG(Total) AS avg \"\r\n    \"FROM invoices \"\r\n    \"GROUP BY BillingCountry \"\r\n    \"ORDER BY count DESC \"\r\n    \"LIMIT 10\"\r\n).fetchall()<\/pre>\n<p>The response:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\">[\r\n ('USA', 91, 5.747912087912091),\r\n ('Canada', 56, 5.427857142857142),\r\n ('France', 35, 5.574285714285712),\r\n ...\r\n]<\/pre>\n<p>As you can see, USA is the most common customer country with Canada coming in second. USA customers spend slightly more per invoice than Canada and France.<\/p>\n<p>Now that we&#8217;ve moved data from an SQL table to a Python list, the world is our oyster. We could turn it into a DataFrame, plot it right now, or anything else. Let&#8217;s plot it with <em>Seaborn<\/em>.<\/p>\n<hr \/>\n<h4>3. Prepare the data for Seaborn.<\/h4>\n<p>The cleanest way to unpack <code>rows<\/code> into 3 variables is to use <code>zip<\/code> along with the <code>*<\/code>\u00a0 operator.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">country, count, average = list(zip(*rows))<\/pre>\n<p>The only caveat is that these variables will be tuples, not lists, which <em>Seaborn<\/em> expects. However that&#8217;s easily addressed by using <code>list()<\/code> in the plotting methods later.<\/p>\n<p>As an aside, if you aren&#8217;t familiar with the <code>*<\/code> operator, an asterisk is used to &#8220;unpack&#8221; iterables. Below is a more obvious demonstration:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">&gt;&gt;&gt; rows = [(\"John\", 1, 10), (\"Paul\", 2, 20), (\"George\", 3, 30)]\r\n&gt;&gt;&gt; print(*rows)\r\n('John', 1, 10) ('Paul', 2, 20) ('George', 3, 30)\r\n&gt;&gt;&gt; list(zip(*rows))\r\n[('John', 'Paul', 'George'), (1, 2, 3), (10, 20, 30)]<\/pre>\n<p>So just like when you iterate through a zipped pair of lists&#8230;<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">for x, y in zip(x_values, y_values):\r\n    print(x, y)<\/pre>\n<p>&#8230; The first elements go together, then the second elements, and so on. We use the same principle to separate <code>rows<\/code> into 3 iterables that we can plot.<\/p>\n<hr \/>\n<h4>4. Plot the data.<\/h4>\n<p>Now we&#8217;ll create 2 <em>barplots<\/em> to visualize the data. We&#8217;re going to place the plots side-by-side so create a 1&#215;2 subplot grid and make the figure size extra wide. Specify an axis for each plot, tweak the style and formatting as desired, and we&#8217;re ready to go.<\/p>\n<p>A few quick notes about the code:<\/p>\n<ul>\n<li><code>wspace<\/code> is useful when you have multiple plots on the same figure. It adjusts horizontal space between subplots.<\/li>\n<li>Remember we&#8217;re converting the data to lists. It&#8217;s currently in tuple form.<\/li>\n<li>You can iterate through subplots with a <code>for<\/code> loop. This might save you from repeating code.<\/li>\n<li>I use <code>matplotlib.ticker<\/code> to format y-tick labels. You could manually create a list of strings and use <code>set_yticklabels<\/code>, but I think this is a cleaner approach.<\/li>\n<\/ul>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">sns.set(style=\"darkgrid\", font=\"Ubuntu Condensed\")\r\nfig, axs = plt.subplots(1, 2, figsize=(16, 6))\r\nfig.subplots_adjust(left=0.034, right=0.987, bottom=0.183, top=0.941, wspace=0.12)\r\n\r\nsns.barplot(ax=axs[0], x=list(country), y=list(count), color=\"#9885bf\", alpha=0.85)\r\nsns.barplot(ax=axs[1], x=list(country), y=list(average), color=\"#76d94c\", alpha=0.85)\r\n\r\naxs[0].set_yticks(range(0, 120, 20))\r\naxs[0].set_ylim(0, 102)\r\naxs[1].set_yticks(range(8))\r\naxs[1].set_ylim(0, 7.1)\r\n\r\nfor ax in axs:\r\n    plt.setp(ax.xaxis.get_majorticklabels(), rotation=60, ha=\"right\", rotation_mode=\"anchor\", size=11)\r\n    plt.setp(ax.yaxis.get_majorticklabels(), size=14)\r\n\r\naxs[1].yaxis.set_major_formatter(ticker.StrMethodFormatter(\"${x:.2f}\"))\r\n\r\naxs[0].set_title(\"Number of Invoices\", size=15)\r\naxs[1].set_title(\"Average Invoice\", size=15)\r\n\r\nplt.show()<\/pre>\n<p><strong>The output:<\/strong><\/p>\n<p><a href=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1758 size-full\" src=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1.png\" alt=\"\" width=\"1600\" height=\"600\" srcset=\"https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1.png 1600w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1-300x113.png 300w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1-1024x384.png 1024w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1-768x288.png 768w, https:\/\/wollen.org\/blog\/wp-content\/uploads\/2021\/07\/chinook_db_output-1-1536x576.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/a>As you can see, USA customers are most common but they don&#8217;t spend as much, on average, as customers from India or the Czech Republic.<\/p>\n<hr style=\"width: 50%;\" \/>\n<p>Although <em>pandas<\/em> has a convenient <code>read_sql<\/code> method, certain operations are simply easier when working with <em>sqlite3<\/em> directly. I hope this post gives you a little more confidence the next time you&#8217;re face-to-face with a database.<\/p>\n<hr \/>\n<p><strong>Full Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\">import sqlite3\r\nfrom contextlib import closing\r\nimport matplotlib.pyplot as plt\r\nfrom matplotlib import ticker\r\nimport seaborn as sns\r\n\r\n\r\nwith closing(sqlite3.connect(\"chinook.db\")) as connection:\r\n    with closing(connection.cursor()) as cursor:\r\n        rows = cursor.execute(\r\n            \"SELECT BillingCountry, COUNT(BillingCountry) AS count, AVG(Total) AS avg \"\r\n            \"FROM invoices \"\r\n            \"GROUP BY BillingCountry \"\r\n            \"ORDER BY count DESC \"\r\n            \"LIMIT 10\"\r\n        ).fetchall()\r\n\r\ncountry, count, average = list(zip(*rows))\r\n\r\nsns.set(style=\"darkgrid\", font=\"Ubuntu Condensed\")\r\nfig, axs = plt.subplots(1, 2, figsize=(16, 6))\r\nfig.subplots_adjust(left=0.034, right=0.987, bottom=0.183, top=0.941, wspace=0.12)\r\n\r\nsns.barplot(ax=axs[0], x=list(country), y=list(count), color=\"#9885bf\", alpha=0.85)\r\nsns.barplot(ax=axs[1], x=list(country), y=list(average), color=\"#76d94c\", alpha=0.85)\r\n\r\naxs[0].set_yticks(range(0, 120, 20))\r\naxs[0].set_ylim(0, 102)\r\naxs[1].set_yticks(range(8))\r\naxs[1].set_ylim(0, 7.1)\r\n\r\nfor ax in axs:\r\n    plt.setp(ax.xaxis.get_majorticklabels(), rotation=60, ha=\"right\", rotation_mode=\"anchor\", size=11)\r\n    plt.setp(ax.yaxis.get_majorticklabels(), size=14)\r\n\r\naxs[1].yaxis.set_major_formatter(ticker.StrMethodFormatter(\"${x:.2f}\"))\r\n\r\naxs[0].set_title(\"Number of Invoices\", size=15)\r\naxs[1].set_title(\"Average Invoice\", size=15)\r\n\r\nplt.show()<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today let&#8217;s dip our toes into the SQL world. We&#8217;ve spent a lot of time working with pandas and CSV files but data won&#8217;t always be packaged that way. It&#8217;s a valuable skill to be comfortable with data in all<\/p>\n","protected":false},"author":1,"featured_media":584,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[97],"tags":[100,101,22,24,25,103,36,98,99,102],"class_list":["post-574","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sql","tag-chinook","tag-chinook-db","tag-data","tag-matplotlib","tag-python","tag-query","tag-seaborn","tag-sql","tag-sqlite","tag-sqlite3"],"_links":{"self":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/574","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/comments?post=574"}],"version-history":[{"count":32,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/574\/revisions"}],"predecessor-version":[{"id":1759,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/posts\/574\/revisions\/1759"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/media\/584"}],"wp:attachment":[{"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/media?parent=574"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/categories?post=574"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wollen.org\/blog\/wp-json\/wp\/v2\/tags?post=574"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}