Using a database | large data
Posted on Tue 05 November 2019 in programming
This is part of a series of articles on how to work with large data
The first method I want to explore when working with large data, since we cannot use RAM, is to use a database for our dataset. Databases are a staple of data processing and analysis, and are extremely memory and cpu efficient.
Where do we begin?
Most popular traditional databases support inserting rows from a csv file. In order to insert our data into the database, we must first create a database and table. I'm going to use PostgreSQL for this task, though other options such as MySQL and SQLite would work as well.
Let's take one more peek at our data to figure out what columns we need for the table.
$ head -5 Checkouts_by_Title.csv | column -s, -t
UsageClass CheckoutType MaterialType CheckoutYear CheckoutMonth Checkouts Title Creator Subjects Publisher PublicationYear
Physical Horizon BOOK 2016 3 1 "The Aleutian islands: their people and natural history (with keys for the identification of the birds and plants) by Henry B. Collins jr. Austin H. Clark [and] Egbert H. Walker." "Collins Henry B. (Henry Bascom) 1899-1987" "Aleuts Natural history Alaska Aleutian Islands Aleutian Islands Alaska" "Smithsonian institution " 1945.
Physical Horizon SOUNDDISC 2016 3 9 Reality [sound recording] / David Bowie. "Bowie David" Rock music 2001 2010 "ISO/Columbia " p2003.
Physical Horizon BOOK 2016 3 1 Thresholds / Nina Kiriki Hoffman. "Hoffman Nina Kiriki" "Moving Household Juvenile fiction Friendship Fiction Extraterrestrial beings Fiction Science fiction Juvenile fiction Fantasy Fiction Oregon Juvenile fiction" "Viking " 2010.
Digital Freegal SONG 2016 3 1 On Green Dolphin Street Nancy Wilson
Let's create a database
$ createdb checkouts
and then a table.
CREATE TABLE checkouts
(
usage_class TEXT,
checkout_type TEXT,
material_type TEXT,
checkout_year INTEGER,
checkout_month INTEGER,
checkouts INTEGER,
title TEXT,
creator TEXT,
subjects TEXT,
publisher TEXT,
publication_year TEXT,
)
CREATE INDEX year_idx ON checkouts (checkout_year)
I intentionally created an index for the checkout_year
column because one of
the questions we want to know the answer to is "how many items were checked out
last year?" Creating an index on a column that you know will be filtered on
will make your query drastically more performant.
Now we can use the \copy
command to insert our csv file into the newly
created table. This will take some time, depending on your computer. On mine,
it took slightly over five minutes to complete.
$ time psql checkouts -c "\copy checkouts FROM './Checkouts_by_Title.csv' DELIMITER ',' CSV HEADER"
COPY 34677950
real 5m13.032s
user 0m14.329s
sys 1m23.068s
Alternatively, if you aren't comfortable writing SQL or prefer to stay in python, the same thing can be achieved using sqlalchemy and pandas' to_sql method, though we'd need to iterate over the csv file in chunks in order to do so.
import sqlalchemy as sa
engine = sa.create_engine("postgresql://brooks:@/checkouts")
for partial_df in pd.read_csv("./Checkouts_by_Title.csv", chunksize=100_000):
partial_df.to_sql("checkouts", engine, if_exists="append", index=False)
That's amazing! The incredible amount of flexibility that pandas and the python ecosystem provides is why I love working with them.
Regardless, now that the data is all inserted, let's move on to the analysis.
How many items were checked out last year?
Can be answered with the following query
WITH checkouts_last_year AS (
SELECT checkouts
FROM checkouts
WHERE checkout_year = 2018
)
SELECT sum(checkouts) AS total_checkouts
FROM checkouts_last_year
Answer
total_checkouts
-----------------
9149176
(1 row)
With the checkout_year
index, this query took less than 2 seconds. Without
the index we created on the table, it took 36.
How popular is digital material compared to physical material?
and is that increasing or decreasing?
SELECT
checkout_year, usage_class, SUM(checkouts)
FROM checkouts
GROUP BY usage_class, checkout_year
ORDER BY usage_class, checkout_year
Answer
checkout_year | usage_class | sum
---------------+-------------+---------
2005 | Digital | 3959
2006 | Digital | 16839
2007 | Digital | 34620
2008 | Digital | 63945
2009 | Digital | 99329
2010 | Digital | 183920
2011 | Digital | 536937
2012 | Digital | 863922
2013 | Digital | 1192836
2014 | Digital | 1719202
2015 | Digital | 2212553
2016 | Digital | 2452733
2017 | Digital | 2752214
2018 | Digital | 3046521
2019 | Digital | 2538416
2005 | Physical | 3794726
2006 | Physical | 6582479
2007 | Physical | 7092007
2008 | Physical | 8374541
2009 | Physical | 9035838
2010 | Physical | 8425046
2011 | Physical | 7784795
2012 | Physical | 7299124
2013 | Physical | 7864260
2014 | Physical | 7416879
2015 | Physical | 6871626
2016 | Physical | 6568318
2017 | Physical | 6479434
2018 | Physical | 6102655
2019 | Physical | 4385334
(30 rows)
Currently, physical material is around twice as popular, but you can see that over time digital material is rapidly gaining ground. If we were to plot these on a graph, we could forecast the year that digital material becomes more popular.
Also note that 2019 has less total checkouts than the past several years, which at first glance would seem worrying, but it is October right now, so we're only seeing partial data for 2019.
What subjects are most often checked out?
Ah, this is a challenge. Can you see why? Our data currently resides in one large table.
Table "public.checkouts"
Column | Type | Collation | Nullable | Default
------------------+---------+-----------+----------+---------
usage_class | text | | |
checkout_type | text | | |
material_type | text | | |
checkout_year | integer | | |
checkout_month | integer | | |
checkouts | integer | | |
title | text | | |
creator | text | | |
subjects | text | | |
publisher | text | | |
publication_year | text | | |
Indexes:
"year_idx" btree (checkout_year)
The previous questions were easily answered because it involved summing up
whole columns from rows, based on another column. But take a look at the
subjects
column.
SELECT subjects FROM checkouts LIMIT 1
subjects
---------------------------------------------------
Music for meditation, New Age music, Zither music
(1 row)
Each row contains a comma-separated list of subjects. In a normalized database, there would be a second table with a list of subjects which we could easily query for the answer, but not so in this case. Luckily, postgres has a bunch of nifty functions built-in that allow us to accomplish the same thing.
Starting off, we have the string_to_array
function, which allows us to access
each subject individually.
SELECT string_to_array(subjects, ',') FROM checkouts LIMIT 1
string_to_array
--------------------------------------------------------------------------------
{Aleuts," Natural history Alaska Aleutian Islands"," Aleutian Islands Alaska"}
(1 row)
This is useful, but even more so when combined with the unnest
function,
which breaks up an array into multiple rows.
SELECT unnest(string_to_array(subjects, ',')) FROM checkouts LIMIT 3
unnest
------------------------------------------
Aleuts
Natural history Alaska Aleutian Islands
Aleutian Islands Alaska
(3 rows)
Note I had to change from
LIMIT 1
toLIMIT 3
to accommodate that there are now more rows
Even better, we can simply add other columns, and they take on the values of the original row
SELECT
checkout_year,
checkout_month,
checkouts,
unnest(string_to_array(subjects, ','))
FROM checkouts
LIMIT 3
checkout_year | checkout_month | checkouts | unnest
---------------+----------------+-----------+------------------------------------------
2016 | 3 | 1 | Aleuts
2016 | 3 | 1 | Natural history Alaska Aleutian Islands
2016 | 3 | 1 | Aleutian Islands Alaska
(3 rows)
Now we're cooking with fire! Finally, we can trim any leading or trailing whitespaces, group the subject count by year (rather than by book, as is the case without a group by), and clean up the column names for ease of use.
WITH subject_popularity AS (
SELECT
checkout_year AS year,
checkout_month AS month,
checkouts AS checkouts,
TRIM(unnest(string_to_array(subjects, ','))) AS subject
FROM checkouts
)
SELECT
MAX(year) AS year,
SUM(checkouts) AS checkouts,
MAX(subject) AS subject
FROM subject_popularity
GROUP BY subject, year
ORDER BY year DESC, checkouts DESC
LIMIT 10
The only problem? This is a computationally expensive question. On my computer, the query took about 15 minutes to complete. The database has to take each row and expand it out by the number of subjects contained within in order to group them appropriately. How many rows would that be? We can estimate it.
WITH top_checkouts AS (
SELECT * from checkouts limit 10000
)
SELECT
AVG(array_length(string_to_array(subjects, ','), 1))
FROM top_checkouts
avg
--------------------
3.8237366619689954
(1 row)
This, multiplied by the 35m rows is about 133m. 133 million rows are needed to calculate the number of checkouts per subject per year, as specified by
GROUP BY subject, year
in the previous query. No wonder it takes so long!
Side note, if you want to visualize the bottlenecks of a query, there are various sites that take query plans and turn them in to charts. I've found this site to be extremely useful in visualizing what part of my query took the longest.
There's one more trick up our sleeve that we can do to make this data faster to access, and that is by putting it in a materialized view. In summary, materialized views allow you to cache the results of a query and allows you to periodically refresh the data. Creating the view will initially take as long as running the original query, but future queries will be much faster against it.
CREATE MATERIALIZED VIEW yearly_subject_popularity AS
<original query>
Finally we can answer the original question. What subjects are most popular?
SELECT * FROM yearly_subject_popularity ORDER BY year DESC, checkouts DESC
LIMIT 10;
year | checkouts | subject
------+-----------+-------------------------------------------
2019 | 1348448 | Fiction
2019 | 854448 | Video recordings for the hearing impaired
2019 | 807154 | Nonfiction
2019 | 684857 | Feature films
2019 | 603138 | Fiction films
2019 | 524205 | Literature
2019 | 334981 | Thriller
2019 | 323856 | Mystery
2019 | 319389 | Romance
2019 | 318219 | Fantasy
(10 rows)
Conclusion
Databases are a great place to store lots of data, because that's what they're meant to do. They have an incredible amount of power, if you know how to use them correctly. For this type of exercise, it was not bad. Ingesting the csv file into a database table was a breeze, as was answering more straightforward questions about the data.
Answering the popular subjects question was much more difficult. The way the data was formatted in one big table with no normalization meant that the solution I came up with was inefficient, and took longer than it would have, had the data been more properly formatted.
But then that's part of the challenge, isn't it? We'll rarely have clean data handed to us in the real world; making the best out of what what is given is an essential skill to have.
I leveled up my database foo in the process of answering this question. For someone who is not a DBA, the largest obstacle I had to overcome was understanding best way to approach the problem. Knowing beforehand what questions I needed to answer helped immensely, as each question presents a new challenge - but that's true for any problem.
If I had to do re-do this exercise, I would definitely reach for a database as part of my solution toolbox again. In the future, I want to explore a better way of normalizing the data as it gets ingested, so the more tricky analysis questions are easier to answer.