神刀安全网

Telemetry meets SQL

In an effort to ease data access at Mozilla we started providing SQL access to our Parquet datasets through Presto . The benefit of SQL is that it’s a commonly understood language among engineers and non-engineers alike.

Even though we use Redshift for some of our data,  there are datasets that store complex hierarchical data which would require unnatural transformations to fit in a typical SQL store that doesn’t support the flexibility of a nested data model (think of structs, arrays & maps) such as the one provided by Parquet.

Meet Presto, which provides the best of both worlds: SQL over Parquet. Presto is an open-source distributed SQL query engine for running interactive analytic queries against various data sources. It allows querying different sources such as Hive and Cassandra, relational databases or even proprietary data stores and a single query can combine data from multiple sources. Our data is stored on S3 in Parquet files indeterminately and the very same datasets can be accessed both by Spark and Presto.

Apache Drill offers similar features without requiring up-front schema knowledge, which is a big advantage over Presto given how painful a schema migration can be at times. Overall Drill feels less mature than Presto though and is not supported yet by Amazon EMR unlike Presto , which makes deployment & maintance more involved. For those reasons we picked Presto as our SQL engine.

Presto supports a rich set of data types which map to Parquet ones:

Arrays

The [] operator is used to access an element of an array and is indexed starting from one:

SELECT my_array[1] AS first_element

Maps

The [] operator is used to retrieve the value corresponding to a given key from a map :

SELECT name_to_age_map['Bob'] AS bob_age

Structs

A structure made up of named fields. The fields may be of any SQL type, and are accessed with field reference operator .

SELECT my_column.my_field

Unnesting maps and structs

The unnest function is used to expand an array or a map into a relation. Arrays are expanded into a single column, and maps are expanded into two columns (key, value). UNNEST can also be used with multiple arguments, in which case they are expanded into multiple columns, with as many rows as the highest cardinality argument:

SELECT numbers, animals, n, a FROM (   VALUES     (ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']),     (ARRAY[7, 8, 9], ARRAY['cow', 'pig']) ) AS x (numbers, animals) CROSS JOIN UNNEST(numbers, animals) AS t (n, a);

which yields:

numbers  |     animals      |  n   |  a -----------+------------------+------+------  [2, 5]    | [dog, cat, bird] |    2 | dog  [2, 5]    | [dog, cat, bird] |    5 | cat  [2, 5]    | [dog, cat, bird] | NULL | bird  [7, 8, 9] | [cow, pig]       |    7 | cow  [7, 8, 9] | [cow, pig]       |    8 | pig  [7, 8, 9] | [cow, pig]       |    9 | NULL

Thelongitudinal view is currently accessible through Presto.  As a concrete example, this is how one would count the number of Telemetry fragments over a time range:

SELECT d, count(*) FROM (   SELECT substr(ssd, 1, 11) AS d   FROM longitudinal   CROSS JOIN UNNEST(subsession_start_date) AS s(ssd) ) WHERE d >= '2015-11-15' AND d < '2016-03-15' GROUP BY d ORDER BY d ASC

But where are those queries run?

Meet re:dash

Re:dash allows to query Presto directly from Firefox. After a query is run a table is displayed with the result. Queries can be saved and optionally scheduled to run periodically at a given time; the query will run at said time and update the table with the latest data.

Different kinds of plots (e.g. bar charts, line charts, boxplots, …) can be built over a table. The plots will be updated every time the query is re-run.

Telemetry meets SQL
Number of Telemetry fragments over time

Custom dashboards can be built that link to tables and plots. Users can visualize the dashboards and optionally access the SQL code behind it, fork it and add changes to it.

Telemetry meets SQL
A simple dashboard

Mozilla’s re:dash instance can be accessed at sql.telemetry.mozilla.org .

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Telemetry meets SQL

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮