Live Track Sessions
Our full program of speakers for the live track
Day 1 - Welcome
4 October, 11:00 pm
The DataEngBytes team will welcome you to the conference and take you through what to expect on day 1.
The Rise & Downfall of the Data Engineer REVISITED
4 October, 11:10 pm
In 2017, I wrote two blog posts about data engineering: "The Rise of the Data Engineer" was an attempt at defining the emerging role , and "The Downfall of the Data Engineer" was exposing some of the challenges [and opportunities] around the role. 4 years later, it's a good time to revisit all of this and explore what has changed, from the tool landscape to the role & responsabilities.
Data Engineers: Privacy Is Your Problem
5 October, 12:00 am
Data has the ability to transform and automate business processes. But data also comes with a responsibility to keep people’s personal information protected and ensure it is used in an ethical manner. It is the responsibility of data engineers to instinctively identify and separate dangerous data from the benign. In this session, Stephen Bailey PhD, director of data and analytics for Immuta, will discuss the need for data engineers to take on the responsibility of data privacy. While all organizations working with data understand the seriousness of data privacy, many aren’t sure who is responsible for protecting it. Stephen will explain why managing privacy loss is something only data engineers can solve, as data engineers are the ones who created the systems. He will outline a suggested new set of engineering best practices that go beyond the domains of security and system design. These best practices will include understanding the strengths and weaknesses of data masking, learning anonymization techniques like k-anonymization and differential privacy,. Ultimately, data engineers should know the practice of privacy by design as intuitively as they do the principle of least privilege. Attendees of this educational session will walk away with a better understanding of how they can improve data privacy without compromising data access control. Threats to data aren’t slowing down, so it’s time to fight against them.
From experiment to production - a journey of a machine learning model
5 October, 12:50 am
Machine learning has a wider and wider adoption in the industry as a technology to build intelligent and data enriched systems. However, the reality is that machine learning practice often stops at the end of rapid experiments before its value can be harvested in the real world. This is because the process and the culture of taking ML models from experiment to production is lacking. A machine learning project typically consists of experiment, development and deployment three main phrases. Correspondingly, a machine learning system consists of processes and components that facilitate the operations and data flows in those three main phrases. This system provides a framework and best practices for machine learning practitioners to develop models from experiment to production. This talk explores the process pattern of getting a machine learning model into production and the system to support this operation. An example of productionising a natural language processing model will be used to illustrate how a model travels through experiment tracking, data artifacts management, data / machine learning pipelines and goes into production. The talk will also provide examples of tooling and patterns to be used at each stage.
Chain-speed inference for Computer Vision Pipelines
5 October, 1:20 am
I will go through some best practices for minimizing real-time prediction latency for devices in processing plants. I will cover data collection, processing, data pipelines, model deployment, and monitoring. I would also like to discuss common challenges that arise in factories: a poor network connection, space limitations, and keeping up with the chain-speed.
5 October, 1:50 am
Time to fuel that body of yours after fuelling your mind with all this great DataEng knowledge!
The Startup Panel
5 October, 2:20 am
Whether you're doing it all by yourself or trying to rapidly grow your data team faster than you can take on VC capital... this session is a discussion of experts who've been there and survived to tell the tale!
Trust, Knowledge and your Data. Our approach at KADA to building a great data product
5 October, 3:00 am
Have you spent time building a great data product that failed to gain traction? Or found users using a legacy report despite a better report being available? It's a common occurrence. Through our journey at KADA, we have identified 5 factors that make a great, trusted data product. In this talk, I will share how you can improve your data products and show you how we built these features into K, our platform for making trust & knowledge a key part of the modern data stack.
Shift-left testing : Building reliable Data Pipelines
5 October, 3:30 am
Unreliable data pipelines can result in data downtime. “Data downtime“refers to periods of time when your data is partial, erroneous, missing or otherwise inaccurate. Data driven organisations may need to pay a heavy cost of low trust data. One of the challenges of building reliable data pipelines is unexpected data coming from different sources. In this talk we will share our experiences of how we can build reliable data pipelines by 1. “Shifting Left” - testing to the early stages of development life cycle as well as 2. “Shifting-Left” - validations using data contracts We will then conclude by discussing how we can we make sure that the system catches unexpected data and is able to recover from it.
Beginners guide to Azure MLOps
5 October, 4:00 am
A commercial scenario and example of work we have recently done for a Large engineering client and lessons learned around implementing Azure MLOps. With a Technical how-to of: • Create reproducible ML pipelines. • Create reusable software environments • Register, package, and deploy models from anywhere. • Capture the governance data for the end-to-end ML lifecycle. • Notify and alert on events in the ML lifecycle. • Monitor ML applications for operational and ML-related issues. • Automate the end-to-end ML lifecycle with Azure Machine Learning and Azure Pipelines.
Welcome to Day 2
5 October, 11:00 pm
The DataEngBytes team return to welcome you to an action packed day 2... strap yourselves in for some awesome talks!
What is a Data Mesh - And How Not To Mesh it Up
5 October, 11:10 pm
Nowadays, it seems like every data person falls into two camps: those who understand the data mesh and those who don’t. Rarely in recent memory has a topic taken the data world by storm, spawning hundreds of blog articles, lively discussion on Twitter, and a thriving Slack community. But with this new adoption comes new opportunities for misunderstanding around the data mesh – and how to build one with data trust and reliability in mind. In this talk, Barr Moses, CEO and co-founder of Monte Carlo, will explain what a data mesh is (and isn't) and how teams can get started.
Data Quality with Great Expectations and Airflow in a Reverse-ETL World
6 October, 12:00 am
Data-driven companies are asking their analytics teams to expose information in the data warehouse to third-party applications used by others in the organization. With analytics workflows having increasing downstream dependencies, data quality testing becomes of utmost importance. In this talk, we’ll walk through leveraging the Great Expectations library within a data processing workflow in Airflow. With this architecture, we will establish a gateway to ensure bad data is not exposed downstream while also notifying the team when data quality tests fail.
Reliable data engineering made easy
6 October, 12:40 am
Organisations have a wealth of information siloed in various data sources. These could vary from databases (Oracle, MySQL, Postgres, etc) to product applications (Salesforce, Marketo, HubSpot, etc). A significant number of use-cases need data from these diverse data sources to produce meaningful reports and predictions. For many years, organisations tried to centrally collect all their data in the data warehouse but these were not suited or were too expensive for handling unstructured data, semi-structured data, and data with high variety, velocity, and volume. It also limited the types of analytics data teams could use; unable to do machine learning or anything more than basic SQL. Delta Lake, released and open-sourced in 2019, is helping thousands of organisations build central data repositories in an open format much more reliably and efficiently than before. Delta Lake provides ACID transactions and efficient indexing that is critical for exposing the data for various access patterns, ranging from ad-hoc SQL queries in BI tools, to scheduled ML jobs. This session will provide the one-two step of Data Ingestion & Quality made easy with DELTA Live Tables.
Data quality: the key to long term happiness
6 October, 1:10 am
The advent of modern data warehousing has equalled the playing field and pivoted a focus from data volume to data quality instead. This talk aims to practically explore approaches to quantify data quality at various stages in the collection and processing lifecycle and to present tools that can be implemented to help in the fight against erroneous data.
6 October, 1:40 am
Time to fuel that body of yours after fuelling your mind with all this great DataEng knowledge!
Building a Data Platform at Assembly Payments
6 October, 2:10 am
- Datalake journey @ Assembly Payments. - Description of Architecture using RDS, DynamoDB, DMS, Firehose, Lambda, S3, Glue, EventBridge, Step Functions, Fargate, DBT, Snowflake and Looker - Share the benefits we felt like event driven, serverless, performance, low cost and lower operational overhead. - and share some of the learnings in the usage of step functions, glue, incremental loads and dependency management.
Streaming data analytics with Apache Flink
6 October, 2:40 am
Real-time analytics are on the rise. Apache Flink is a popular purpose-built framework and distributed processing engine for large scale low latency data processing in real-time. In this session, we will give you a brief overview of this popular framework, and a demo to build your first stream processing application with Apache Flink on Amazon Kinesis Data Analytics Studio.
Teleport Data: The future of data
6 October, 3:10 am
This session will discuss: - the new data stack - challenges & problems still faced - teleport data
The Enterprise Panel
6 October, 3:40 am
Join us for a discussion of all those wacky enterprise things like data governance, lineage and meshing with things!
Welcome to Day 3
6 October, 11:00 pm
The DataEngBytes team return to welcome you to an action packed day 3... strap yourselves in for some awesome talks!
From telescope to data centre: adventures in astronomical data pipelines
6 October, 11:10 pm
I will describe the data flow we have developed to move astronomical data from Siding Spring Observatory in central west NSW to data centres in Sydney and Canberra, and the subsequent processing stages to produce science-ready data. This is a story that involves the coupling of Apache Nifi, MongoDB and Docker with astronomical data reduction software to allow researchers to explore some of the most violent phenomena in the Universe: colliding black holes and exploding stars. I’ll also look at some of the related future data engineering challenges involved in the Square Kilometre Array and the European Southern Observatory.
Gone Streaming: dbt+Materialize
7 October, 12:00 am
dbt is great for batch, but it can only approximate transforming streaming data. Together with the dbt community, we’ve worked on an adapter that allows you to transform your streaming data in real-time using Materialize as your data warehouse. What does this mean in practice? The first time you run a dbt model on top of Materialize…well, you never have to run it again! No matter how much or how frequently your data arrives, your model will stay up to date. No matter when you query your view, it will always return a fresh answer. Excited? Skeptical? Cautiously optimistic? Join us to see it for yourself as we walk you through a demo!
A Single Data Platform for All of Your Workloads
7 October, 12:40 am
Organisations struggle to balance the competing requirements of the business, data scientists, data engineers, data analysts, risk and security experts, and the finance department. Maintaining multiple data platforms to try to keep everyone happy often leaves many people unhappy. Learn how customers are using Snowflake to simplify their data at rest, data in motion, and data science workloads on a single and secure data platform.
Putting your data warehouse to work - Reverse ETL & Operational Analytics
7 October, 1:10 am
Today, the center of gravity for data has shifted into data warehouses, mostly for BI & analytics purposes. But why stop there? Through a new approach called Reverse ETL, you can activate your modeled data by moving it from your warehouse to your SaaS tools to power key business workflows. Learn how top companies like Autotrader, CircleCI & CompareClub are using Reverse ETL to power their business
7 October, 1:40 am
Time to fuel that body of yours after fuelling your mind with all this great DataEng knowledge!
Snowflake and dbt -- Our Journey to the Cloud
7 October, 2:10 am
nib's journey of using Snowflake and dbt
Situational Intelligence with Analytics in Motion
7 October, 2:40 am
The growth of streaming data is a massive industry trend in enterprise tech. This movement towards streaming data provides organizations new intelligence, connecting applications with streaming data can unlock all these insights about what's happening in your business in real-time. However, being able to consume data in real-time only gets you halfway. You've got to have analytics to help you drive intelligence, not only to tell you what is happening, and more importantly, why it is happening. Harnessing this potential for intelligence from streaming data requires a new approach to analytics, i.e., Analytics in Motion. In this talk, we will share how this new data architecture can help organizations unlock the potential of data by enabling them to have rapid interactive conversations and the ability to process massive data in real-time across all of their data.
Albero - Decision Tree “when to use what – the full Azure data estate
7 October, 3:10 am
This year a lot of attention is focused on data and data services. Microsoft is doing their absolute best to help our customers benefit from the wide portfolio of Azure Data Services. But what if this amount of services is a bit too many for many of us not to mention our customers? We have more than 20 main data services plus SKUs. Navigation across such a wide range of technologies and answering simple question: “When to choose what data technology?” suddenly became quite a complex task even for experts. Andrei, Elizabeth and Eleni joined hands in creating universal and easy to consume bird-eye view of the entire Azure data services. This decision tree helps customers/partners to define and shape their thought process around selection of the certain data technologies and using them together in building data solutions.
Scaling Proximity Targeting via Delta Lakehouse based Data Platforms Ecosystem
7 October, 3:40 am
Proximity Targeting is a marketing technique that uses mobile location services to reach consumers in real-time when they are around a store location or point of interest. This is done by defining a radius around a specific location. If a consumer has opted into location services on their mobile phone and enters within this radius, proximity targeting helps in triggering an advertisement or message to consumers in an effort to influence their behaviour. This can be combined with the ability to purchase impressions through programmatic ad platforms that are powered by real-time bidding which can help businesses formulate the right strategy of influencing their users on a particular geographical area. They can build user groups based on certain characteristics (such as neighbourhoods, demographics, interests, and other data), and subsequently launch another campaign that targets anyone with those characteristics. The growth of mobile devices has led to enormous data generation which offers tremendous potential when used effectively for business. Thus we need an efficient platform where we can process such huge data efficiently and with minimum latency and cost. This talk describes MIQ's journey into building a fast, scalable & cost effective processing platform using Spar, MLLib, Kafka's Event Driven microservices, Delta Lakehouse architecture, delivering faster and actionable insights for Proximity targeting which has empowered the creation of a product generating ~30 million dollar revenue on a year to year basis.
Our on-demand speakers will be available directly after the conclusion of the conference
Intelligent Serverless and Scalable Real-Time Data Pipeline using Kinesis, Fargate and CFN
This session is about a real case study of an intelligent serverless real-time data pipeline. This is implemented for a big digital media client. The session will cover business problems, approach to cater the problem, architecture, implemented solution, and value created out from the implemented solution. The solution is based on the AWS serverless approach in a highly scalable manner following all well architect principles. presentation- https://www.slideshare.net/YogeshSharma208/intelligent-serverlessstreamingpipelineusingkinesisfargatecfn
Better, Faster, Stronger Streaming: Your First Dive into Flink SQL
For the most flexible, powerful stream processing engines, it seems like the barrier to entry has never been higher than it is now. If you’ve tried, or have been interested in leveraging the strengths of real-time data processing - maybe for machine learning, IoT, anomaly detection or data analysis - but you’ve been held back: I’ve been there, and it’s frustrating. And that’s why this talk is for you. That being said, this talk is also for you if you ARE experienced with stream processing but you want an easy (and if I say so myself, pretty fun) way to add some of the newest, bleeding edge features to your toolbelt. This session will be about getting started with Flink SQL. Apache Flink’s high level SQL language has the familiarity of the SQL you know and love (or at least, know…), but with some powerful new functionality, and of course, the benefit of being able to be used with Flink and PyFlink. More specifically, this will be a pragmatic entry into creating data pipelines with Flink SQL, as well as a sneak peek into some of its newest and most interesting features.
Metrics-driven Data Architectures
The gravitational pull of cloud data warehouses is a powerful force on data platform architecture. This is evidenced by the growing use of data warehouses to build and serve data products and the well-established shift from ETL to ELT transformation patterns. More recently the trend of pushing data transformations to inside the data warehouse is playing out on the consumption side, with business metric definition and calculation being shifted from fragmented locations across multiple BI tools and data science notebooks into the data warehouse, where they can be published for use across the organisation by a range of consumers. This trend has recently coalesced around the notion of a metrics layer within the modern data stack. In this talk I will unpack the challenges motivating this change in data architecture and identify the core features of the metrics layer that meets these needs. I will look at different flavours of data architecture that enable these features by surveying existing OSS tools and vendor offerings. In doing so, I will address the questions of how this notion of a metrics layer differs from existing approaches to OLAP databases, and whether data warehouses are the appropriate place to be building metric layers.
Modern Data Warehouse for Small and Medium Business
With many organizations in SMB sector looking at opportunity to benefit from using modern big data technologies and tools within their budget and skill set, this session overviews an opinion on appropriate architecture for this use-case.
Logging Apache Spark - How we made it easy
Looking at our metrics on Graphite is pretty nice, but what about our logs? How do you improve the visibility of your logs while running Spark on EMR? If you're tired of ssh-ing into your servers and searching log files, this architecture design is exactly for you.
Object Compaction in Cloud for High Yield
In file systems, large sequential writes are more beneficial than small random writes, and hence many storage systems implement a log structured file system. In the same way, the cloud favors large objects more than small objects. Cloud providers place throttling limits on PUTs and GETs, and so it takes significantly longer time to upload a bunch of small objects than a large object of the aggregate size. Moreover, there are per-PUT calls associated with uploading smaller objects. In Netflix, a lot of media assets and their relevant metadata is generated and pushed to cloud. Most of these files are between 10s of bytes to 10s of kilobytes and are saved as small objects on Cloud. In this talk, we would like to propose a strategy to compact these small objects into larger blobs before uploading them to Cloud. We will discuss the policies to select relevant smaller objects, and how to manage the indexing of these objects within the blob. We will also discuss how different cloud storage operations such as reads and deletes would be implemented for such objects. This includes recycling blobs that have dead small objects - due to overwrites, etc. Finally, we would showcase the potential impact of such a strategy on Netflix assets in terms of cost and performance.
Kickstarting a greenfield data project
My takeaways building a data platform from the ground up
Snowpipe Integration via Yaml
Snowpipe Integration via Yaml Elevator Pitch(Short Description)... If a project uses AWS and Snowflake, Snowpipe is by far the best way for continuous integration. While working with one of our clients (who happens to be a fan of AWS and Snowflake) we came up with a revolutionary way to set up snowpipe integration for any project using just 10 to 15 lines of Yaml file. Yes, you heard me correctly a yaml to deploy snowpipe not SQL. Abstract(Long Description)... Leverage Snowpipe’s continuous integration property.. In this presentation I’ll share techniques to overcome:- You are running SQL every time you run a pipeline, hence unnecessary SQL execution. Hard to maintain SQL queries if a project doesn't have any SQL platform. This was very specific for this project, with every new changes/branch we end up in creating a number of artifacts and when we merge the changes to master we literally destroy everything and recreate the same stuff again which ends up causing downtime for customer heavy projects. Everyone likes less bits of code whether it's SQL, code etc. What problems does it solved:- Beautiful is better than ugly. YAML really looks more niche than SQL. Flags used saved us lot of time and execution if already pipe exits when it Create or if Update only run necessary SQL and Delete for housekeeping The happiest of them all where the people whole were devops, they love yaml more than anything. No outage if no changes made to the snowpipe artifacts.