Main track
Registration
Please collect your lanyards from the front counter.
Introduction
Lars Klint will welcome us to the conference.
Data Modeling is Dead! Long Live Data Modeling!
Data modeling is on life support. Some say it’s dead. The traditional practices are increasingly ignored and forgotten. The result is often a loss of structure and a shared understanding of business rules and vocabulary. At the same time, data modeling is more critical than ever. With AI's rising popularity, many organizations rush to incorporate it into their infrastructure. Without consideration of the underlying data framework, the result will be unpleasant for many organizations. In this talk, I argue that data modeling is a key enabler for success with AI. We must return to basics and revamp data modeling to work with modern business workflows and technologies. Long live data modeling!
Why AI has to be built on robust data foundations and what that means for data engineering
Robust, useful and engaging AI powered products only grow on a solid foundation of coherent, understood and loved data. And that takes engaged and articulate participation from all of the actors in the fabulous (melo)drama that is our beloved data and AI ecosystem. We’ll look at who needs to participate, what they know, where their blindspots are and how to build great things faster and with less friction.
Morning Tea
Enjoy some delicious refreshments to keep you going towards lunch.
Accelerating the data mesh journey at ANZ
Say goodbye to clunky data platforms and hello to autonomy over your own data. We will explore ANZ's data mesh journey that empowers engineers and analysts to accelerate data onboarding and unlock insights that power the ANZ Plus app. Built in the cloud with a focus on user experience, we'll touch on how principles like self-service and domain ownership are delivered. We will discover the challenges and lessons learnt along the way, including the often-neglected social aspect of data products!
Reading Data from APIs
In this talk, we will delve into the nuances of extracting data from APIs, illustrate the common challenges faced during the data extraction process, solving the challenges with the schema-on-read approach. We will also see how theory translates into practice using the Ascend.io data pipeline automation platform.
Data Observability and Governance: Building the Pillars of Data Excellence
Guide you through the essential elements of data observability and governance. Join me as we explore strategies, best practices, and tools to ensure impeccable data quality, integrity, and compliance. Together, we'll lay the groundwork for data-driven decision-making and pave the way for organizational success.
Lunch
Relax and network with your fellow peers
Build your own electric vehicle charging map with PostGIS
An educational live demo of a map server showing electric vehicle charging stations so you'll never be stuck without a charge again. Your map will show the available charging stations, which ones you can reach with your current range and the most efficient route to each one. Come along and see what you can build with PostGIS and pgRouting in the context of a topical real world use case. We will also take a look at some ETL and geocoding examples using PL/Python running natively in Postgres. Finally, we will look at why PostGIS is so powerful for both performance and integration - including functions to work with GeoJSON, KML and MVT, built-in 3D and topology support, and advanced spatial indexing enhancements. This is a beginner to intermediate session for PostGIS and assumes a rudimentary knowledge of SQL and spatial data.
Duck, duck, Go! Hybrid SQL Execution with DuckDB and MotherDuck
It's time to re-think the mainframe model of centralized cloud warehouses which require complicated horizontal scaling on shared infrastructure to achieve performance. DuckDB is a lightweight SQL OLAP engine that can run in a browser on your laptop via WASM, in a local python app, and in the cloud. This talk will describe the origin and properties of DuckDB which caused it to take the internet by storm and then dive into the ways these can be used to create a powerful hybrid local+cloud data development environment. This hybrid development model often has your data warehouse sitting in the cloud, but enables it to be joined against local sources in parquet, CSV or pandas DataFrames. Results can be cached and further filtered inside the web browser, using the same SQL dialect and engine. After exploratory analysis, you can create a new cloud branch of your data shared with your colleagues, allowing them to reproduce and expand upon your work.
Anomaly Detection using Apache Flink
Apache Flink is a framework for stateful computations over unbounded and bounded streams. Come see how you can use this new technology at Aiven to build an application that can detect when a data stream starts producing values that vary from a known mean and how to fire alerts to other systems. Fraud detection, predictive maintenance and high value transaction alerts are all use cases of this type of technology.
Afternoon Tea
Enjoy some delicious refreshments to keep you going into the afternoon.
Melbourne Data Engineering Panel
Hear from the greatest minds locally and internationally in the field of data engineering!
Why your Data Mesh project will fail
Since Zhamak Dehghani first wrote about Data Mesh, the industry has been buzzing with curiosity and interest. But implementing data mesh remains a mystery to many. Should you do it, and if so how? This talk is going to examine potential reasons why your data mesh journey may not be a successful one, and why. The talk will be of interest to anyone interested in data mesh, perhaps confused about what it is and how it works, and/or thinking of implementing it in their organisation. The talk will include: - What data mesh is (and isn’t!) - Whether you have the problems that data mesh solves - Key failure modes
What do cows, rockets and developers all have in common?
What do cows, rockets, and developers all have in common? Lessons from building data pipelines at 3 very different companies. Emily has worked as a data engineer across multiple startups and shares the lessons she's learned along the way. These companies include Rocket Lab (NZ's SpaceX), Halter (FitBit for cows), and Multitudes (Engineering metrics that aren’t creepy for software development teams). She's found that regardless of the industry, big data companies all have similar problems to be solved. She'll share the 5 key lessons she's learned from her time building data pipelines from scratch, discussing the varying needs, tech and data sources used at each company.
Closing Remarks
Lars Klint will close out the conference.
After Party
Join us for some late refreshments and further networking to close out the conference.
ML Track
Featurization & Feature Stores: A Crash Course In The ML Lifecycle & MLOps
DataOps, MLOps, Data Engineering.... what's the big difference? Squint at the job descriptions and they'd seem to be the same person, especially around featurization. Can't DataOps tools be used for MLOps? Why is 80% of a data scientist's time stuck with data? Isn't a feature store just an expensive, overly specialized database where machine learning features get parked (only to be forgotten until a pipeline breaks)? Much like how humans share 70% of their DNA with slugs (and 50% with bananas)* the differences, while minute, are significant. My goal in this session is to help illuminate the challenges and vagaries of developing ML models from scratch (for production) and in the process answer the following questions: - What are the main problems MlOps tries to solve? - What does the process look like for developing a model from scratch? And why is feature engineering tricky to automate? - What is a Feature Store? What are the pain-points a feature store is meant to solve? - What are the different types of feature store or platforms that exist and which archetypes are seeing the most adoption? And why?
Data Engineering in the Age of AI: Opportunities and Challenges for Career Growth
It's the age of AI. A day now is moving at pace of a week. Word constant seems to be used less and less now - be in Data pipelines or Tech or Careers! So, what it might mean for Data Engineers?
Bring Context and Explainability to Generative AI by Knowledge Graph
Since its release 6 months ago, ChatGPT has garnered immense popularity due to its ability to engage users in meaningful and human-like conversations, revolutionizing the way we interact with AI. The technology behind it, ie. the so-called Large Language Model / Generative AI for languages, has huge potentials in enterprises and governments too, if the concerns of being up-to-date, reliable and controllable can be addressed. In this speech, a solution framework which combines LLM and knowledge graph stored in Neo4j graph database will be presented, together with a quick demo.
Lunch
Relax and network with your fellow peers
Responsible AI: Building trustworthy solutions
In the era of new AI advancements every week (it feels like!), ensuring that the implementation of these into your solutions is responsible and ethical is more important than ever. In this session we explore the concept of Responsible AI and discuss the key areas of ethical implications - fairness, transparency, accountability and privacy. From data collection and model training to deployment and ongoing monitoring, we cover the considerations needed to foster a responsible AI culture and embedding ethical principles throughout the AI lifecycle. Gain an understanding of the importance of diverse and inclusive datasets, explainable AI techniques, and ongoing model evaluation - all things that can not only mitigate risks but also increase customer trust and long-term sustainability.
Spinning Your Drones With Cadence Workflows, Apache Kafka, and TensorFlow.
In the 1st part of this talk, we'll build a real-time Drone Delivery demonstration application using a combination of two open-source technologies: Uber’s Cadence (for stateful, scheduled, long-running workflows), and Apache Kafka (for fast streaming). With up to 2,000 (simulated) drones and deliveries in progress at once this application generates a vast flow of spatio-temporal data. In the 2nd part of the talk, we'll use this platform to explore Machine Learning (ML) over streaming and drifting Kafka data with TensorFlow to try and predict which shops will be busy in advance.
Melbourne MLOps Panel
Hear from the brightest minds locally and internationally in the field of MLOps!
Afternoon Tea
Enjoy some delicious refreshments to keep you going into the afternoon.
DE Track
Designing better data models: simple solutions for maximum impact
The evolution of artificial intelligence has highlighted that advanced statistics have progressed into intelligence that can gather insights from large data lakes and make decisions to achieve a specific goal. AI in its near future is also anticipated to get complex and hit an intelligence explosion that could surpass human intelligence and decision-making. This has sparked a greater need among data scientists to build complex decision-making models to drive business, as a result, there is an increased focus on piping/cascading machine learning models to achieve complex business needs along with the demand for large-scale computation. Working for the biggest energy distributor in Victoria to ensure customer safety and optimize asset management has pushed me to build advanced use cases to establish an efficient and secure electricity network. Over the last 5 years, I have shaped my data science skills to build highly efficient and conservative automated data science models that stand high within the business. My role has enabled me to efficiently operate with low-risk and high impact when influencing business decisions. The talk emphasizes best practices for a data scientist to plan data flow and compute resources effectively. I will be using a basic ML architecture with a strategic data flow structure and model monitoring to explain a brief about the approach to solving issues listed below • How to foster test & learn techniques in statistical model development that empowers new strategies or paths without impacting the ongoing business workflow. • How to scale and score the model over a large data set and a resilient architectural solution to optimize the use of DB/compute resources. • How to enable the reusability of processed data using a feature store implementation in the architecture. I will be talking about the best practices that I have incorporated and tailored my implementation when architecting data science models to demonstrate continuous improvement using ML architecture. I’m planning to emphasize my learning on “data for machine learning” and talk about data artefacts, model monitoring, and feature stores to enable reusability and efficient use of compute resources.
Logging and Tracing Data Pipelines with Snowflake
This session will demonstrate the capability of Logging and Tracing using Event Tables in Snowflake. This logging and tracing feature can be used in a couple contexts - alleviation of single row inserts, or alleviation of table concurrency locking. This session will talk to each of these and a demonstration of the capability in action will be provided.
Harnessing Metadata for Enhanced Observability and Discoverability
We will provide an update on the key learnings and advancements made in our implementation of Amundsen and as well as updates to and the formalisation of our metadata model. We will be focusing on the utilisation of metadata for improved observability and discoverability addressing the conference themes (data we trust and data we understand). High level outline - Share compelling statistics and highlight new features to illustrate the remarkable journey we have undertaken - Delve into our metadata model, by presenting a high-level conceptual model of our core entities, such as Users, Datasets, Queries, and Reports etc. We will highlight the importance of modelling to build a robust and scalable solution to operationalise metadata. - Discuss different user roles, including Data Managers,Data Analysts,Data Scientists, Data Engineers, and Owners, leverage metadata
Lunch
Relax and network with your fellow peers
Arbitrary code execution, I choose you!
Did you hear about the arbitrary code execution hardware vulnerability in the Nintendo Switch discovered a few years back? In this talk we’re going to delve into this vulnerability in more detail and look at some other notorious home console security issues over the years from Nintendo, Sega, et al., the fallout from them and how they were fixed. This is not talk about breaking things but how companies got their act together, and how home consoles improved (or didn't) with tackling security issues over the years. This isn't a talk about homebrew stuff either. Key takeaways: Understand that security mistakes are a combination of human and technical failures Discover that we have been making the same security mistakes since time immemorial and to become more conscious of this in every day work Learn about some fun security stories that show that security issues happen to everyone, no matter who you are.
Leveraging Apache Iceberg for Effective Data Governance in a Data Lakehouse
Description Preamble: Companies have and are continuously collecting and storing a range of consumer data for further analysis in data lakehouses. In a lot of cases, the collected data contains personal identifiable information. Recent data breaches have highlighted the need for better data governance guidelines and regulation on how companies handle consumer information. The Right To Be Forgotten presents some challenges for companies in how to remove personal information they can no longer legally hold, without compromising data quality, security and atomicity. Apache Iceberg is a data table format that enables data lakehouses to easily comply with this legislation. Talk: I will discuss Apache Iceberg and the benefits it provides in managing large amounts of data and how Apache Iceberg can be part of your modern data architecture by bringing a range of capabilities to your data lake, including compliance with proper regulations, ACID transactions, schema evolution, time travel, and incremental processing. By the end of the talk, attendees will have gained a better understanding on how Apache Iceberg can be successfully used in data lakehouses.
When Pipelines Become Sewers - 7 Wastes of Data Production
Even with the most modern tooling, it’s likely that you’re generating waste in the production of data in your organisation. Waste manifests as business misalignment, slow response to opportunities, poor quality of outputs, and employee disengagement. Waste can be any activity that doesn’t deliver value, but where the cause may be hidden. We’ll review the manufacturing roots of the 7 forms of waste known as Muda in Lean, and how they have been reinterpreted for knowledge work like software engineering. Identifying and managing these wastes is core to modern software delivery and all spheres of business operations. We’ll then consider the data organisation as a factory that produces data, a factory that is constantly reconfigured by engineering as business needs change. This will allow us to identify and characterise the impact of the wastes of data production that emerge in building and running a data organisation and data platform. Initiatives like the DataOps Manifesto and Cookbook also embed this Lean philosophy. With wastes understood, we’ll identify potential interventions to improve alignment, responsiveness, quality and engagement in data engineering. We will also introduce the Improvement Kata approach that provides a framework that any team can use for continuous improvement. You’ll leave with a good understanding how to reduce waste in data production, in order to restore pristine pipelines.
Afternoon Tea
Enjoy some delicious refreshments to keep you going into the afternoon.
Abdi Farah
Block - Data Egnineer
Adric Streatfeild
Data Engineering Manager, ANZ
Akanksha Malik
Data Consultant and Microsoft AI MVP
Akira Takihara Wang
(Meta)Data Engineer, Afterpay
Alexei de Lauw
Consulting Manager
David Colls
Director, Data & AI at Thoughtworks
Emily Melhuish
Lead Engineer at Multitudes
Fabio Ramos
Data Enginner @ Cevo Australia
Izzudin Hafiz
Director of Engineering - Decube
Josh Devlin
Senior Analytics Engineer
Joshua Yu
Director of Presales & Services, APAC
Lizzie Macneill
Solutions Engineer at EDB
Michael Hyatt
Solutions Engineer, Ascend.io
Paul Brebner
Instaclustr Technology Evangelist
Rajath Akshay Vanikul
Data Scientist at United Energy
Sarah Young
Sarah Young, Senior Cloud Security Advocate, Microsoft
Troy Sellers
Staff Solution Architect, Aiven
Vikas Rajput
Tinkler at heart. Principal (Data/AI) @ Microsoft.