Live Track Sessions

Our full program of speakers for the live track

Data Engineers: Privacy Is Your Problem

5 October, 12:00 am

Data has the ability to transform and automate business processes. But data also comes with a responsibility to keep people’s personal information protected and ensure it is used in an ethical manner. It is the responsibility of data engineers to instinctively identify and separate dangerous data from the benign. In this session, Stephen Bailey PhD, director of data and analytics for Immuta, will discuss the need for data engineers to take on the responsibility of data privacy. While all organizations working with data understand the seriousness of data privacy, many aren’t sure who is responsible for protecting it. Stephen will explain why managing privacy loss is something only data engineers can solve, as data engineers are the ones who created the systems. He will outline a suggested new set of engineering best practices that go beyond the domains of security and system design. These best practices will include understanding the strengths and weaknesses of data masking, learning anonymization techniques like k-anonymization and differential privacy,. Ultimately, data engineers should know the practice of privacy by design as intuitively as they do the principle of least privilege. Attendees of this educational session will walk away with a better understanding of how they can improve data privacy without compromising data access control. Threats to data aren’t slowing down, so it’s time to fight against them.

From experiment to production - a journey of a machine learning model

5 October, 12:50 am

Machine learning has a wider and wider adoption in the industry as a technology to build intelligent and data enriched systems. However, the reality is that machine learning practice often stops at the end of rapid experiments before its value can be harvested in the real world. This is because the process and the culture of taking ML models from experiment to production is lacking. A machine learning project typically consists of experiment, development and deployment three main phrases. Correspondingly, a machine learning system consists of processes and components that facilitate the operations and data flows in those three main phrases. This system provides a framework and best practices for machine learning practitioners to develop models from experiment to production. This talk explores the process pattern of getting a machine learning model into production and the system to support this operation. An example of productionising a natural language processing model will be used to illustrate how a model travels through experiment tracking, data artifacts management, data / machine learning pipelines and goes into production. The talk will also provide examples of tooling and patterns to be used at each stage.

Shift-left testing : Building reliable Data Pipelines

5 October, 3:30 am

Unreliable data pipelines can result in data downtime. “Data downtime“refers to periods of time when your data is partial, erroneous, missing or otherwise inaccurate. Data driven organisations may need to pay a heavy cost of low trust data. One of the challenges of building reliable data pipelines is unexpected data coming from different sources. In this talk we will share our experiences of how we can build reliable data pipelines by 1. “Shifting Left” - testing to the early stages of development life cycle as well as 2. “Shifting-Left” - validations using data contracts We will then conclude by discussing how we can we make sure that the system catches unexpected data and is able to recover from it.

Reliable data engineering made easy

6 October, 12:40 am

Organisations have a wealth of information siloed in various data sources. These could vary from databases (Oracle, MySQL, Postgres, etc) to product applications (Salesforce, Marketo, HubSpot, etc). A significant number of use-cases need data from these diverse data sources to produce meaningful reports and predictions. For many years, organisations tried to centrally collect all their data in the data warehouse but these were not suited or were too expensive for handling unstructured data, semi-structured data, and data with high variety, velocity, and volume. It also limited the types of analytics data teams could use; unable to do machine learning or anything more than basic SQL. Delta Lake, released and open-sourced in 2019, is helping thousands of organisations build central data repositories in an open format much more reliably and efficiently than before. Delta Lake provides ACID transactions and efficient indexing that is critical for exposing the data for various access patterns, ranging from ad-hoc SQL queries in BI tools, to scheduled ML jobs. This session will provide the one-two step of Data Ingestion & Quality made easy with DELTA Live Tables.

Situational Intelligence with Analytics in Motion

7 October, 2:40 am

The growth of streaming data is a massive industry trend in enterprise tech. This movement towards streaming data provides organizations new intelligence, connecting applications with streaming data can unlock all these insights about what's happening in your business in real-time. However, being able to consume data in real-time only gets you halfway. You've got to have analytics to help you drive intelligence, not only to tell you what is happening, and more importantly, why it is happening. Harnessing this potential for intelligence from streaming data requires a new approach to analytics, i.e., Analytics in Motion. In this talk, we will share how this new data architecture can help organizations unlock the potential of data by enabling them to have rapid interactive conversations and the ability to process massive data in real-time across all of their data.

Albero - Decision Tree “when to use what – the full Azure data estate

7 October, 3:10 am

This year a lot of attention is focused on data and data services. Microsoft is doing their absolute best to help our customers benefit from the wide portfolio of Azure Data Services. But what if this amount of services is a bit too many for many of us not to mention our customers? We have more than 20 main data services plus SKUs. Navigation across such a wide range of technologies and answering simple question: “When to choose what data technology?” suddenly became quite a complex task even for experts. Andrei, Elizabeth and Eleni joined hands in creating universal and easy to consume bird-eye view of the entire Azure data services. This decision tree helps customers/partners to define and shape their thought process around selection of the certain data technologies and using them together in building data solutions.

Scaling Proximity Targeting via Delta Lakehouse based Data Platforms Ecosystem

7 October, 3:40 am

Proximity Targeting is a marketing technique that uses mobile location services to reach consumers in real-time when they are around a store location or point of interest. This is done by defining a radius around a specific location. If a consumer has opted into location services on their mobile phone and enters within this radius, proximity targeting helps in triggering an advertisement or message to consumers in an effort to influence their behaviour. This can be combined with the ability to purchase impressions through programmatic ad platforms that are powered by real-time bidding which can help businesses formulate the right strategy of influencing their users on a particular geographical area. They can build user groups based on certain characteristics (such as neighbourhoods, demographics, interests, and other data), and subsequently launch another campaign that targets anyone with those characteristics. The growth of mobile devices has led to enormous data generation which offers tremendous potential when used effectively for business. Thus we need an efficient platform where we can process such huge data efficiently and with minimum latency and cost. This talk describes MIQ's journey into building a fast, scalable & cost effective processing platform using Spar, MLLib, Kafka's Event Driven microservices, Delta Lakehouse architecture, delivering faster and actionable insights for Proximity targeting which has empowered the creation of a product generating ~30 million dollar revenue on a year to year basis.

On-demand Sessions

Our on-demand speakers will be available directly after the conclusion of the conference

Better, Faster, Stronger Streaming: Your First Dive into Flink SQL

For the most flexible, powerful stream processing engines, it seems like the barrier to entry has never been higher than it is now. If you’ve tried, or have been interested in leveraging the strengths of real-time data processing - maybe for machine learning, IoT, anomaly detection or data analysis - but you’ve been held back: I’ve been there, and it’s frustrating. And that’s why this talk is for you. That being said, this talk is also for you if you ARE experienced with stream processing but you want an easy (and if I say so myself, pretty fun) way to add some of the newest, bleeding edge features to your toolbelt. This session will be about getting started with Flink SQL. Apache Flink’s high level SQL language has the familiarity of the SQL you know and love (or at least, know…), but with some powerful new functionality, and of course, the benefit of being able to be used with Flink and PyFlink. More specifically, this will be a pragmatic entry into creating data pipelines with Flink SQL, as well as a sneak peek into some of its newest and most interesting features.

Metrics-driven Data Architectures

The gravitational pull of cloud data warehouses is a powerful force on data platform architecture. This is evidenced by the growing use of data warehouses to build and serve data products and the well-established shift from ETL to ELT transformation patterns. More recently the trend of pushing data transformations to inside the data warehouse is playing out on the consumption side, with business metric definition and calculation being shifted from fragmented locations across multiple BI tools and data science notebooks into the data warehouse, where they can be published for use across the organisation by a range of consumers. This trend has recently coalesced around the notion of a metrics layer within the modern data stack. In this talk I will unpack the challenges motivating this change in data architecture and identify the core features of the metrics layer that meets these needs. I will look at different flavours of data architecture that enable these features by surveying existing OSS tools and vendor offerings. In doing so, I will address the questions of how this notion of a metrics layer differs from existing approaches to OLAP databases, and whether data warehouses are the appropriate place to be building metric layers.

Object Compaction in Cloud for High Yield

In file systems, large sequential writes are more beneficial than small random writes, and hence many storage systems implement a log structured file system. In the same way, the cloud favors large objects more than small objects. Cloud providers place throttling limits on PUTs and GETs, and so it takes significantly longer time to upload a bunch of small objects than a large object of the aggregate size. Moreover, there are per-PUT calls associated with uploading smaller objects. In Netflix, a lot of media assets and their relevant metadata is generated and pushed to cloud. Most of these files are between 10s of bytes to 10s of kilobytes and are saved as small objects on Cloud. In this talk, we would like to propose a strategy to compact these small objects into larger blobs before uploading them to Cloud. We will discuss the policies to select relevant smaller objects, and how to manage the indexing of these objects within the blob. We will also discuss how different cloud storage operations such as reads and deletes would be implemented for such objects. This includes recycling blobs that have dead small objects - due to overwrites, etc. Finally, we would showcase the potential impact of such a strategy on Netflix assets in terms of cost and performance.

Snowpipe Integration via Yaml

Snowpipe Integration via Yaml Elevator Pitch(Short Description)... If a project uses AWS and Snowflake, Snowpipe is by far the best way for continuous integration. While working with one of our clients (who happens to be a fan of AWS and Snowflake) we came up with a revolutionary way to set up snowpipe integration for any project using just 10 to 15 lines of Yaml file. Yes, you heard me correctly a yaml to deploy snowpipe not SQL. Abstract(Long Description)... Leverage Snowpipe’s continuous integration property.. In this presentation I’ll share techniques to overcome:- You are running SQL every time you run a pipeline, hence unnecessary SQL execution. Hard to maintain SQL queries if a project doesn't have any SQL platform. This was very specific for this project, with every new changes/branch we end up in creating a number of artifacts and when we merge the changes to master we literally destroy everything and recreate the same stuff again which ends up causing downtime for customer heavy projects. Everyone likes less bits of code whether it's SQL, code etc. What problems does it solved:- Beautiful is better than ugly. YAML really looks more niche than SQL. Flags used saved us lot of time and execution if already pipe exits when it Create or if Update only run necessary SQL and Delete for housekeeping The happiest of them all where the people whole were devops, they love yaml more than anything. No outage if no changes made to the snowpipe artifacts.