In an effort to ease data access at Mozilla we started providing SQL access to our Parquet datasets through Presto . The benefit of SQL is that it’s a commonly understood language among engineers and non-engineers alike.
Even though we use Redshift for some of our data, there are datasets that store complex hierarchical data which would require unnatural transformations to fit in a typical SQL store that doesn’t support the flexibility of a nested data model (think of structs, arrays & maps) such as the one provided by Parquet.
Meet Presto, which provides the best of both worlds: SQL over Parquet. Presto is an open-source distributed SQL query engine for running interactive analytic queries against various data sources. It allows querying different sources such as Hive and Cassandra, relational databases or even proprietary data stores and a single query can combine data from multiple sources. Our data is stored on S3 in Parquet files indeterminately and the very same datasets can be accessed both by Spark and Presto.
Apache Drill offers similar features without requiring up-front schema knowledge, which is a big advantage over Presto given how painful a schema migration can be at times. Overall Drill feels less mature than Presto though and is not supported yet by Amazon EMR unlike Presto , which makes deployment & maintance more involved. For those reasons we picked Presto as our SQL engine.
Presto supports a rich set of data types which map to Parquet ones:
The  operator is used to access an element of an array and is indexed starting from one:
SELECT my_array AS first_element
The  operator is used to retrieve the value corresponding to a given key from a map :
SELECT name_to_age_map['Bob'] AS bob_age
A structure made up of named fields. The fields may be of any SQL type, and are accessed with field reference operator .
Unnesting maps and structs
The unnest function is used to expand an array or a map into a relation. Arrays are expanded into a single column, and maps are expanded into two columns (key, value). UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument:
SELECT numbers, animals, n, a FROM ( VALUES (ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']), (ARRAY[7, 8, 9], ARRAY['cow', 'pig']) ) AS x (numbers, animals) CROSS JOIN UNNEST(numbers, animals) AS t (n, a);
numbers | animals | n | a -----------+------------------+------+------ [2, 5] | [dog, cat, bird] | 2 | dog [2, 5] | [dog, cat, bird] | 5 | cat [2, 5] | [dog, cat, bird] | NULL | bird [7, 8, 9] | [cow, pig] | 7 | cow [7, 8, 9] | [cow, pig] | 8 | pig [7, 8, 9] | [cow, pig] | 9 | NULL
Thelongitudinal view is currently accessible through Presto. As a concrete example, this is how one would count the number of Telemetry fragments over a time range:
SELECT d, count(*) FROM ( SELECT substr(ssd, 1, 11) AS d FROM longitudinal CROSS JOIN UNNEST(subsession_start_date) AS s(ssd) ) WHERE d >= '2015-11-15' AND d < '2016-03-15' GROUP BY d ORDER BY d ASC
But where are those queries run?
Re:dash allows to query Presto directly from Firefox. After a query is run a table is displayed with the result. Queries can be saved and optionally scheduled to run periodically at a given time; the query will run at said time and update the table with the latest data.
Different kinds of plots (e.g. bar charts, line charts, boxplots, …) can be built over a table. The plots will be updated every time the query is re-run.
Custom dashboards can be built that link to tables and plots. Users can visualize the dashboards and optionally access the SQL code behind it, fork it and add changes to it.
Mozilla’s re:dash instance can be accessed at sql.telemetry.mozilla.org .