<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>mozilla on Jeff Klukas</title>
    <link>https://jeff.klukas.net/tags/mozilla/</link>
    <description>Recent content in mozilla on Jeff Klukas</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <managingEditor>jeff@klukas.net (Jeff Klukas)</managingEditor>
    <webMaster>jeff@klukas.net (Jeff Klukas)</webMaster>
    <lastBuildDate>Wed, 04 Aug 2021 16:00:00 -0400</lastBuildDate><atom:link href="https://jeff.klukas.net/tags/mozilla/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Deduplication: Where Apache Beam Fits In</title>
      <link>https://jeff.klukas.net/writing/2021-08-04-deduplication-where-apache-beam-fits-in/</link>
      <pubDate>Wed, 04 Aug 2021 16:00:00 -0400</pubDate>
      <author>jeff@klukas.net (Jeff Klukas)</author>
      <guid>https://jeff.klukas.net/writing/2021-08-04-deduplication-where-apache-beam-fits-in/</guid>
      <description>&lt;p&gt;&lt;em&gt;Summary of a talk delivered at Apache Beam Digital Summit on August 4, 2021.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://jeff.klukas.net/img/deduplication-beam/title-slide.png&#34; alt=&#34;Title slide&#34;&gt;&lt;/p&gt;
&lt;p&gt;This session will start with a brief overview of the problem of duplicate records and the different options available for handling them. We&amp;rsquo;ll then explore two concrete approaches to deduplication within a Beam streaming pipeline implemented in Mozilla’s &lt;a href=&#34;https://github.com/mozilla/gcp-ingestion/tree/main/ingestion-beam&#34;&gt;open source codebase for ingesting telemetry data&lt;/a&gt; from Firefox clients.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;ll compare the robustness, performance, and operational experience of using the deduplication built in to &lt;code&gt;PubsubIO&lt;/code&gt; vs. storing IDs in an external Redis cluster and why Mozilla switched from one approach to the other.&lt;/p&gt;
&lt;p&gt;Finally, we&amp;rsquo;ll compare streaming deduplication to a &lt;a href=&#34;https://github.com/mozilla/bigquery-etl/blob/main/bigquery_etl/copy_deduplicate.py&#34;&gt;much stronger end-to-end guarantee that Mozilla achieves via nightly scheduled queries&lt;/a&gt; to serve historical analysis use cases.&lt;/p&gt;
&lt;h2 id=&#34;links&#34;&gt;Links&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.youtube.com/watch?v=9OfJKDs3h40&#34;&gt;Recording of the talk on YouTube&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://2021.beamsummit.org/sessions/deduplication/&#34;&gt;Session page on beamsummit.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.google.com/presentation/d/1bNOKHw9Gssx_aDI4JilSiqtBaRrEM7SGtygsP9-MgLc/edit?usp=sharing&#34;&gt;Slides from the session in Google Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
    </item>
    
    <item>
      <title>The Nitty-Gritty of Moving Data with Apache Beam</title>
      <link>https://jeff.klukas.net/writing/2020-09-25-the-nitty-gritty-of-moving-data-with-apache-beam/</link>
      <pubDate>Fri, 25 Sep 2020 11:33:53 -0400</pubDate>
      <author>jeff@klukas.net (Jeff Klukas)</author>
      <guid>https://jeff.klukas.net/writing/2020-09-25-the-nitty-gritty-of-moving-data-with-apache-beam/</guid>
      <description>&lt;p&gt;&lt;em&gt;Summary of a talk delivered at Apache Beam Digital Summit on August 24, 2020.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://jeff.klukas.net/img/nitty-gritty-beam/title-slide.png&#34; alt=&#34;Title slide&#34;&gt;&lt;/p&gt;
&lt;p&gt;In this session, you won&amp;rsquo;t learn about joins or windows or timers or any other advanced features of Beam. Instead, we will focus on the real-world complexity that comes from simply moving data from one system to another safely. How do we model data as it passes from one transform to another? How do we handle errors? How do we test the system? How do we organize the code to make the pipeline configurable for different source and destination systems?&lt;/p&gt;
&lt;p&gt;We will explore how each of these questions are addressed in &lt;a href=&#34;https://github.com/mozilla/gcp-ingestion&#34;&gt;Mozilla&amp;rsquo;s open source codebase for ingesting telemetry data from Firefox clients&lt;/a&gt;. By the end of the session, you&amp;rsquo;ll be equipped to explore the codebase and documentation on your own to see how these concepts are composed together.&lt;/p&gt;
&lt;h2 id=&#34;links&#34;&gt;Links&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://youtu.be/ABWKnl_N550&#34;&gt;Recording of the talk on YouTube&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://2020.beamsummit.org/sessions/nitty-gritty-moving-data-with-beam/&#34;&gt;Session page on beamsummit.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://docs.google.com/presentation/d/19bEXp-OpJ0C0GcqnEuef2ytOX4OnekUFSgArWmOP1Ws/edit?usp=sharing&#34;&gt;Slides from the session in Google Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
    </item>
    
    <item>
      <title>Encoding Usage History in Bit Patterns</title>
      <link>https://jeff.klukas.net/writing/2020-05-29-encoding-usage-history-in-bit-patterns/</link>
      <pubDate>Fri, 29 May 2020 13:20:53 -0400</pubDate>
      <author>jeff@klukas.net (Jeff Klukas)</author>
      <guid>https://jeff.klukas.net/writing/2020-05-29-encoding-usage-history-in-bit-patterns/</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published as a &lt;a href=&#34;https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html&#34;&gt;cookbook on docs.telemetry.mozilla.org&lt;/a&gt; to instruct data users within Mozilla how to take advantage of the usage history stored in our BigQuery tables.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://jeff.klukas.net/img/bit-patterns/mau-bit-breakdown.png&#34; alt=&#34;DAU windows in a bit pattern&#34;&gt;&lt;/p&gt;
&lt;p&gt;Monthly active users (MAU) is a windowed metric that requires joining data
per client across 28 days. Calculating this from individual pings or daily
aggregations can be computationally expensive, which motivated creation of the
&lt;a href=&#34;https://docs.telemetry.mozilla.org/datasets/bigquery/clients_last_seen/reference.html&#34;&gt;&lt;code&gt;clients_last_seen&lt;/code&gt; dataset&lt;/a&gt;
for desktop Firefox and similar datasets for other applications.&lt;/p&gt;
&lt;p&gt;A powerful feature of the &lt;code&gt;clients_last_seen&lt;/code&gt; methodology is that it doesn&amp;rsquo;t
record specific metrics like MAU and WAU directly, but rather each row stores
a history of the discrete days on which a client was active in the past 28 days.
We could calculate active users in a 10 day or 25 day window just as efficiently
as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new
metrics based on these usage histories, such as various retention definitions.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>The Dashboard Problem and Data Shapes</title>
      <link>https://jeff.klukas.net/writing/2020-03-13-the-dashboard-problem-and-data-shapes/</link>
      <pubDate>Fri, 13 Mar 2020 09:33:53 -0400</pubDate>
      <author>jeff@klukas.net (Jeff Klukas)</author>
      <guid>https://jeff.klukas.net/writing/2020-03-13-the-dashboard-problem-and-data-shapes/</guid>
      <description>&lt;p&gt;&lt;em&gt;Cross-posted on the &lt;a href=&#34;https://blog.mozilla.org/data/2020/03/20/the-dashboard-problem-and-data-shapes/&#34;&gt;Data@Mozilla blog&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://jeff.klukas.net/img/data-shapes/memphis-shapes-4276605_1280.png&#34; alt=&#34;Memphis Shapes by AnnaliseArt&#34;&gt;&lt;/p&gt;
&lt;p&gt;The data teams at Mozilla have put a great deal of effort into building a robust data ingestion pipeline and reliable data warehouse that can serve a wide variety of needs. Yet, we keep coming back to conversations about &lt;em&gt;the dashboard problem&lt;/em&gt; or about how we’re missing &lt;em&gt;last mile tooling&lt;/em&gt; that makes data accessible for use in data products that we can release to different customers within Mozilla.&lt;/p&gt;</description>
    </item>
    
    <item>
      <title>Guiding Principles for Data Infrastructure</title>
      <link>https://jeff.klukas.net/writing/2019-12-28-guiding-principles-for-data-infrastructure/</link>
      <pubDate>Sat, 28 Dec 2019 13:33:53 -0400</pubDate>
      <author>jeff@klukas.net (Jeff Klukas)</author>
      <guid>https://jeff.klukas.net/writing/2019-12-28-guiding-principles-for-data-infrastructure/</guid>
      <description>Originally published as an article on docs.telemetry.mozilla.org.
So you want to build a data lake&amp;hellip; Where do you start? What building blocks are available? How can you integrate your data with the rest of the organization?
This document is intended for a few different audiences. Data consumers within Mozilla will gain a better understanding of the data they interact with by learning how the Firefox telemetry pipeline functions. Mozilla teams outside of Firefox will get some concrete guidance about how to provision and lay out data in a way that will let them integrate with the rest of Mozilla.</description>
    </item>
    
  </channel>
</rss>
