Quarto is an innovative, open-source scientific and technical publishing system compatible with Jupyter Notebooks and other popular mediums. Quarto provides data scientists with a seamless way to publish their work in a high-quality format that is easily accessible and shareable. In this workshop, we will demonstrate how Quarto enables data scientists to turn their Jupyter products into a variety of high-quality documents and combine them into well-organized, striking websites.
Large Language Models (LLMs) offer a lot of value for modern NLP and can typically achieve surprisingly good accuracy on predictive NLP tasks with a reasonably structured prompt and pretty much no labelled examples. But can we do even better than that? It’s much more effective to use LLMs to create classifiers, instead of using them as classifiers. By using LLMs to assist with annotation, we can quickly create labelled data and systems that are much faster and much more accurate than using LLM prompts alone. In this workshop, we'll show you how to use LLMs at development time to create high-quality datasets and train specific, smaller, private and more accurate fine-tuned models for your business problems.
We’re moving at a speed faster than ever before, with approximately 328.77 million terabytes of data created each day. For those managing these databases, building a secure and sustainable process for building and releasing elements around the core database is more important than ever. In this tutorial; we’ll cover not only the importance and best practices for thinking about this from an architectural perspective but also how to get our hands dirty with building feature flags and a process for enabling continuous iteration and deployment with stealth, security, and smooth moves in mind.
In this tutorial, we will review some of the most parts of the Python programming language we don't use every day… but should! Using the motivating example of a portfolio construction backtester, we will build the rough outline of a library that allows users to design strategies to be automatically executed and evaluated for constructing a portfolio.
We will design this tool such that:
- it has moderate performance for low-frequency strategies (e.g., no intraday trading, no real-time, more mutual fund than hedge fund)
- it supports high generality in strategy construction
- it supports “foreign” data (i.e., it minimizes assumptions about pandas.Series
or pandas.DataFrame
representing market information or trades)
- it is reasonably easy for a pandas
user to construct a strategy without knowing too many esoteric details of Python or pandas
(except, of course, the rules of index alignment)
Along the way, we will look to answer the following questions:
- what are generators, generator coroutines, and decorators, and where do they they actually show up in analytical code?
- why are generators and generator coroutines so well-suited to the design of simulators, backtesters, model training, &c.?
- how do we write libraries that accept and return pandas.Series
that do not lose generality?
- why is pandas
often considered a “tail-end” analytical tool, and how might we solve the problem of writing libraries that may grow a pandas.Series
?
As the technological landscape evolves from being data-centric to compute-intensive, the challenges in resource allocation, scalability, cost, and workflow complexity have become more pronounced. Traditional cloud tools often fall short in efficiently managing resources like GPUs, and the transition from local to cloud-based environments often involves cumbersome code changes and configurations. Additionally, the high costs and complex workflows associated with compute-intensive tasks are exacerbated by the scarcity and high demand for specialized computing resources. Covalent emerges as a Pythonic framework that addresses these multifaceted challenges. It simplifies the development of compute-intensive products, making them feel like a direct extension of one's local laptop rather than a complex cloud architectural exercise. Moreover, Covalent aids in cost reduction by efficiently managing and allocating resources, thereby optimizing the overall operational expenses. This tutorial will explore how Covalent is uniquely positioned to meet the computational and operational demands of a broad range of high-compute developments, including but not limited to Large Language Models and Generative AI, offering a more efficient and streamlined approach to cloud-based tasks.
Learn how to parallelize or distribute your Python code. We will start with parallelizing it on your local laptop before we move onto a distributed cluster.
This tutorial shows how to use Dask, a popular open source framework for parallel computing, to scale Python code. We start with parallelizing simple for loops, and move on to scaling out pandas code.
Along the way we will learn about concepts like partitioning data and why a good chunksize matters, parallel performance tracking and general metrics that are built into Dask, and managing exceptions and debugging on remote machines.
This will be a hands-on tutorial with Jupyter notebooks.
Navigate the intersection of neuroscience and artificial intelligence as we explore the application of PyTorch-based deep learning models to time series neural data. This illuminating talk will equip researchers, data scientists, and machine learning engineers with the tools and techniques needed to tackle the unique challenges presented by neural data. From preprocessing to model architecture selection, gain actionable insights and hands-on experience that will elevate your data analysis capabilities. Don't miss this opportunity to advance your knowledge in one of the most promising fields of interdisciplinary research.
Ibis provides a common dataframe-like interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend. In this tutorial users will get experience writing queries using Ibis on a number of local and remote database engines.
There is a wide array of tools available to simplify the process for data scientists to package their models and deploy them in production, ranging from serverless functions to Docker containers. However, deploying models in production remains a challenge, particularly when it comes to data access.
Real-time ML systems typically require low-latency access to precomputed features containing history or context data. The code used to create those features should be consistent with the code used to create features using during model training. Similarly, batch ML systems should use the same logic to compute features for training and batch inference.
The FTI (Feature, Training, Inference) pipeline architecture is a unified pattern for building batch and real-time ML systems. It enables the independent development and operation of feature pipelines (that transform raw data into features/labels), training pipelines (that take features/labels as input and produce models as output), and inference pipelines (that take model(s) and features as input and produce predictions as output). The pipelines have clear inputs and outputs, and can even be implemented using different technologies (e.g., Spark for feature pipelines, and Python for training and inference pipelines).
In this workshop, we'll explore how LangChain, Chroma, Gradio, and Label Studio can be employed as tools for continuous improvement, specifically in building a Question-Answering (QA) system trained to answer questions about an open-source project using domain-specific knowledge from GitHub documentation.
While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.
In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a cloud provided by the presenter – starting from how the data is stored and read, to how it is processed and visualized.
In this tutorial we will explore the Palmer Penguins dataset, starting with exploratory charts, then creating a linked animation, and concluding with converting the animation into a simple interactive visualization. In doing so, this tutorial will unpack some of the fundamental concepts that underlie the architecture of Matplotlib, hopefully providing attendees with the foundation for creating effective visualizations using Matplotlib.
In this practical talk about how to extend JupyterLab, we focus on understanding the underlying extension support infrastructure. As we walk through a step-by-step example of creating an app in JupyterLab, we'll learn, among other things, how to launch that app from different places within JupyterLab, how to style our app, and how to pass parameters to our app to modify its behavior. This talk is for anyone who finds themselves doing complex or repetitive tasks and thinks that they, and others, may benefit from integrating those tasks into JupyterLab.
People are hard to understand, developers doubly so! In this tutorial, we will explore how communities form in organizations to develop a better solution than "The Org Chart". We will walk through using a few key Python libraries in the space, develop a toolkit for Clustering Attributed Graphs (more on that later) and build out an extensible interactive dashboard application that promises to take your legacy HR reporting structure to the next level.
Data scientists strive to bridge the gap between raw data and actionable insights. Yet, the actual value of data lies in its accessibility to non-data experts who can unlock its potential independently. Join us in this hands-on tutorial hosted by experts from Vizzu and Streamlit to discover how to transform data analysis into a dynamic, interactive experience.
Streamlit, celebrated for its user-friendly data app development platform, has recently integrated with Vizzu's ipyvizzu - an innovative open-source data visualization tool that emphasizes animation and storytelling. This collaboration empowers you to craft and share interactive, animated reports and dashboards that transcend traditional static presentations.
To maximize our learning time, please come prepared by following the setup steps listed at the end of the tutorial description, allowing us to focus solely on skill-building and progress.
In this tutorial, you'll discover how to leverage the MERCURY framework to effortlessly transform your computed notebooks, such as those created in Jupyter Notebook, into interactive web apps, insightful reports, and dynamic dashboards. With no need for frontend expertise, you can effectively communicate your work to non-technical team members. Dive into the world of Python and enhance your notebooks to make your insights accessible and engaging.
Open remarks by Martha Norrick, the Chief Analytics Officer of New York City.
Keynote by Soumith Chintala
Have questions after a keynote? Excited to learn more from one of our keynote speakers? Fireside chats are a great opportunity meet our speakers in a more intimate setting.
This talk will focus on applying Test Driven Development principles and practices to data analysis and basic ML Models.
We will talk about the benefits of unit testing your python scripts and ML engines. We will live code examples of how to build unit tests during data analysis, how we can use fixtures to make basic validation on our ML engines, and once we create a script with the process end-to-end we will look at options for integration testing. Finally, we will set up a github pipeline that will run our tests in different phases in order to give us visibility into the accuracy of our scripts/engines.
The great majority (more than 2/3) of power outages are caused by contact from vegetation to an active power line, with the added risk of fire and public safety. The goal for utility companies is to manage the vegetation near their above-ground infrastructure in order to reduce these types of contact during inclement weather. This session will discuss an AI-driven vegetation management solution, using LiDAR and satellite imagery to determine the highest risk areas of a utility's service area. Our approach is able to give insights across a service territory, by line down to the foot level as far as how close a branch may be to a power line. This work will enable them to shift from cycle-based to condition-based trimming. We will go over the data used, the various technical challenges, and our approach to scale the solution.
In the world of machine learning, more data and diverse data sets usually leads to better training, particularly with human centered products such as self-driving cars, IOT devices and medical applications. However, privacy and ethical concerns can make it difficult to effectively leverage many different datasets, particularly in medical and legal services. How can a data scientist or machine learning engineer leverage multiple data sources to train a model without centralizing the data in one place? How can one benefit from multiple datasets without the hassle of breaching data privacy and security?
As a community organizer, I have had the privilege of being a part of the Python community for the past ten years.
In that time, I have seen the community grow and evolve in countless ways. Python evolved from a top 10 language to the top 1 language. Many new people joined the community, new topics as data & AI became part of Python. I'm super proud the preferred language to learn is Python nowadays.
I have also learned a great deal about what it takes to be an effective organizer and how to build and sustain a healthy community. I also experience how not to do it and pulling through in hard times.
I learned a lot about leadership. I learned a lot about myself, my strength and my weaknesses.
This helped me also to grow professionally and had a very positive impact on how I work and lead in my day-job as partner in a data and AI consultancy.
Reading documentation: what if you could flip through a comic book in your hands, instead of scrolling online? What if you read a story instead of a bullet-point list? What if comics were documentation...?
Is that possible? It sure is! In this talk, we'll dive into the NumPy Contributor Comics, which were completed for Google Season of Docs 2023 (GSOD). Let's get more comics in open source!
Economics research has widespread policy and industrial applications. It is rapidly changing due to open source tools. Deep learning techniques have widened the range of models that it is feasible to solve, simulate, and estimate. This talk highlights recent contributions to computational economics and their Python packages. It is aimed at quantitative analysts and economists working in finance and public policy.
Sciris aims to streamline the development of scientific software by making it easier to perform common tasks. Sciris provides classes and functions that simplify access to core libraries of the scientific Python ecosystem (such as NumPy), as well as low-level libraries (such as pickle). Some of Sciris' key features include: ensuring consistent list/array types; allowing ordered dictionaries to be accessed by index; and simplifying parallelization, datetime arithmetic, and the saving and loading of complex objects. With Sciris, users can achieve the same functionality with fewer lines of code, reducing the need to copy-paste recipes from Stack Overflow or follow dubious advice from ChatGPT.
For online publications and media sites, recommendation systems play an important role in engaging readers. At Chartbeat, we are actively developing a recommendation system that caters to billions of daily pageviews across thousands of global websites. While conventional discussions frequently highlight the data science and machine learning facets of the system, the cornerstone of a successful application is its system architecture. In this presentation, we will dissect our architectural decisions designed to meet high-performance requirements and share insights gleaned from our journey in scaling up the recommendation system.
pandas has long been known for its robust I/O offerings, which includes commonly used functions for interacting with SQL databases. With the new Apache ADBC drivers, users can expect even better performance and data type management.
How big should a machine learning deployment be? What is a reasonable size for a microservice container that performs logistic regression? Although Python is the generally used ecosystem for data science work, it doesn't provide satisfactory answers to the size question. Size matters: if deploying to the edge or to low-resource devices, there's a strict upper bound on how large the deployment can be; if deploying to the cloud, size (and compute overhead) correspond directly to cost. In this talk, I will show how typical data science pipelines can be rewritten directly in C++ in a straightforward way, and that this can provide both performance improvements as well as massive size improvements, with total size of deployments sometimes in the single-digit MBs.
Polars is an OLAP query engine that focusses on the DataFrame use case. Machines have changed a lot in the last decade and Polars is a query engine that is written from scratch in Rust to benefit from the modern hardware.
Effective parallelism, cache efficient data structures and algorithms are ingrained in its design. Thanks to those efforts Polars is among the fastest single node OSS query engines out there. Another goal of polars is rethinking the way DataFrame's should be interacted with. Polars comes with a very declarative and versatile API that enables users to write readable. This talk will focus on how Polars can be used and what you gain from using it idiomatically.
A key challenge in training AI models is lack of labeled or ground truth data. This is especially true in the remote sensing field where seasonal changes and differences between label characteristics makes difficult creating a common labeled dataset. With the emergence of self-supervised learning the amount and quality of labeled data can be relaxed but model performance is still of paramount importance. This is especially true for quantifying the sources and sinks of greenhouse gases that drive climate change. In this talk, we present how state of the art AI technologies such as generative AI and Foundation Models can be used to estimate Above Ground Biomass (AGBD) changes due to extraction of CO2 from the atmosphere by vegetation. We demonstrate how these tools can be used by companies with NetZero pledge to quantify, monitor, validate and report their offsetting methodologies and sustainability practices.
Join us for the traditional PyData NYC Pub Quiz during lunch on Thursday. Grab some food and bring it to Winter Garden.
Lightning Talks will be in Central Park East/West after lunch. Sign up for a talk Thursday morning at the NumFOCUS table on the 5th floor.
Electricity is unique in that its storage is prohibitively costly; this makes it essential for supply to, at least, meet demand at every second of every day. In times of crisis where demand exceeds its projected amount, system operators need to fall back on different methods to lower demand. One such method is Dynamic Pricing which incentivizes customers to lower their consumption by increasing electricity prices during those times. There are trials that test out this pricing model against a flat rate pricing model. This talk uses mathematics and statistical techniques applied in python to see the impact of dynamic pricing on consumption.
Unleash your data analytics superpowers with Python in Excel! Come learn from the Microsoft Excel product team as we share how you can do more with Python in Excel!
As web developers, we create large quantities of data (e.g., from test reports and analytics) that require insights for iterative optimization. But how does a JavaScript developer begin their journey into data science? Enter GitHub Codespaces, GitHub Copilot and Open AI. In this talk, I'll share my journey into creating a consistent development and runtime environment with GitHub Codespaces and Jupyter Notebooks, then activating it with Open AI to support an interactive "learn by exploring" process that helped me (as a JavaScript developer) skill up on Python and data analysis techniques in actionable ways. I'll walk through a couple of motivating use cases, and demonstrate some projects related to competitive programming and data visualization to showcase these insights in action.
With an average of 3.2 new papers published on Arxiv every day in 2022, causal inference has exploded in popularity, attracting large amount of talent and interest from top researchers and institutions including industry giants like Amazon or Microsoft.
There’s a very good reason for this upsurge in popularity. In our contemporary data culture we got accustomed to thinking that traditional machine learning methods can provide us with answers to any interesting business or scientific questions.
This view turns out to be incorrect. Many interesting business and scientific questions are causal in their nature and traditional machine learning methods are not suitable to address them.
In this talk, dedicated to data scientists and machine learning engineers with at least 3 years of experience, we’ll show why this is the case, we’ll introduce the fundamental tools for causal thinking and show how to translate them into code.
We’ll discuss a popular use case of churn prevention and demonstrate why only causal models should be used to solve it.
All in Python, repo included!
Abstract: In this talk, we will explore the world of Media Mix Modeling (MMM) and examine the application of Bayesian and Frequentist regression methods to derive valuable marketing insights. MMM plays a pivotal role in marketing strategy, enabling businesses to measure marketing effectiveness and optimize budget allocation across various channels. The presentation will offer an overview of both Bayesian and Frequentist approaches, comparing their principles and applications within the MMM context. We will present a real-world case, showcasing how these two models were used to deconstruct sales driven by several factors and generate actionable insights. Finally, we will share reflections and recommendations for marketers and data scientists regarding further exploration and adoption of different techniques in MMM, empowering businesses to make informed decisions and effectively allocate their marketing resources.
Numerous tech professionals are transitioning to full-time YouTube careers, signifying the platform’s viability beyond a side gig. In this talk, you will get a glimpse of YouTube's impact on data science education and learn about a 5 step framework for becoming a successful tech (or data science) YouTuber.
"Should I just run it in production to see if it works!" is a common starting point for many python data engineers. Don't let it be your end point. Any software product deteriorates rapidly without disciplined testing. However, testing data pipelines is a hellish experience for new data developers. Unfortunately, there are somethings that are only learnt on the job. It is a crude reality that data pipeline testing is one of those fundamental skills that gets glossed over during the training of new data engineers. It can be so much more fun to learn about the latest and greatest python data processing library. But how can a new python data engineer transform a patchwork of scripts, into a well engineered data product? Testing is the cornerstone of any iterative development to reach acceptable confidence in the outputs of any data pipeline. This talk will help with an overview of modern data pipelines testing techniques in a visual and coherent game plan.
Why bother testing data pipelines? Billions of budget dollars regularly rely on the excellence of the data scientists, data engineers, and machine learning engineers behind the countless software data pipelines that inform critical business decisions.
Numerous packages exist within the Python open-source ecosystem for algorithm building and data visualization. However, a significant challenge persists, with over 85% of Data Science Pilots failing to transition to the production stage.
This talk introduces Taipy, an open-source Python library for building web applications’ front-end and back-end. It enables Data Scientists and Python Developers to create production-ready applications for end-users.
Its simple syntax facilitates the creation of interactive, customizable, and multi-page dashboards without having to know about HTML or CSS. Taipy generates highly interactive interfaces, including charts and all sorts of widely used controls.
Additionally, Taipy is engineered to construct robust and tailored data-driven back-end applications. It models dataflows and orchestrates pipelines. Each pipeline execution is referred to as a scenario. Scenarios are stored, recorded, and actionable, enabling what-if analysis or KPI comparison.
Lately, Taipy provides the most suitable cloud tool to host, deploy, and share your Taipy applications easily. In addition, this platform provides the ability to manage, store, and maintain the various states of your backend.
As data scientists, we are often unaware of what's happening under the hood when our models are put in the corporate cloud environment. If I had to deploy a model to the cloud provider from scratch, what are the options I have? This talk gives an introduction to basic cloud computing concepts and services so next time you will be confident in evaluating the different alternatives for taking your models to the cloud.
This talk explores the intuitive algorithm behind Shiny for Python and shows how it allows you to scale your apps from prototype to product. Shiny infers a reactive computation graph from your application and uses this graph to efficiently re-render components. This eliminates the need for data caching, state management, or callback functions which lets you build scalable applications quickly.
Have you experienced the frustration of installing a new Python package, only to discover that your existing code and Jupyter notebooks no longer function as expected? Have you found yourself uttering the phrase, "It works on my computer," to a colleague? Have you ever looked at Randal Munroe's five-year-old XKCD comic on Python environments (https://xkcd.com/1987/) and felt torn between laughter and despair?
Well, there's hope on the horizon. In this presentation, we will delve into an open-source tool that enables the creation and management of a set of stable, reproducible, and version-controlled environments right on your laptop, all through an intuitive graphical user interface (GUI).
It's common for machine learning practitioners to train a supervised learning model, generate feature importance metrics, and then attempt to use these values to tell a data story that suggests what interventions should be taken to drive the outcome variable a favorable way (e.g. "X was an important feature in our churn prediction model, so we should consider doing more X to reduce churn"). This simply does not work, and the idea that standard feature importance measures can be interpretted causally is one of data science's more enduring myths. In this session we'll talk through why this isn't the case, what feature importance is actually good for, and we'll give a brief overview of a simple causal feature importance approach: Meta Learners. This talk should be relevant to machine learning practitioners of any skill level that want to gain actionable, causal insights from their predictive models.
Everyone is invited to join us at Printer’s Alley (215 W 40 St, New York, NY 10018) on Thursday after the conference, 5-7pm. You will find the party area upstairs at the venue. Please bring your conference pass to claim your free drink tickets.
Keynote by Michelle Gill
Developing reliable code without writing tests may be a far off dream, but Hypothesis' ghostwriter function will generate tests from type hints. The resulting tests are powerful and often appropriate for data analysis. In this talk, I'll discuss how to add tests to your data analysis code that cover a wide range of inputs -- all while using just a small amount of code.
Finding good datasets or web assets to build data products or websites with, respectively, can be time-consuming. For instance, data professionals might require data from heavily regulated industries like healthcare and finance, and, in contrast, software developers might want to skip the tedious task of collecting images, text, and videos for a website. Luckily, both scenarios can now benefit from the same solution, Synthetic Data.
Synthetic Data is artificially generated data created with machine learning models, algorithms, and simulations, and this workshop is designed to show you how to enter that synthetic world by teaching you how to create a full-stack tech product with five interrelated projects. These projects include reproducible data pipelines, a dashboard, machine learning models, a web interface, and a documentation site. So, if you want to enhance your data projects or find great assets to build a website with, come and spend 3 fun and knowledge-rich hours in this workshop.
Building ML systems in business settings isn't just about algorithms; it's also about navigating unpredictable data, fostering efficient collaboration, and promoting constant experimentation. This talk delves into Metaflow's solutions to these challenges, spotlighting its event-driven patterns that streamline model training while leveraging infrastructure shared by multiple data-scientists/engineers. Through a hands-on demonstration, we'll showcase how Metaflow facilitates asynchronous training upon new data arrivals for finetuning a state-of-the-art LLM like LLaMAv2 (although the framework is general and widely applicable). Additionally, we'll highlight how multiple developers can seamlessly experiment with various models as new data becomes available.
Have questions after a keynote? Excited to learn more from one of our keynote speakers? Fireside chats are a great opportunity meet our speakers in a more intimate setting.
The pandas library for data manipulation and data analysis is the most widely used open source data science software library. Dask is the natural extension for scaling pandas workloads to more than a single machine. The continuing integration and adoption of Apache Arrow accelerates historical bottlenecks in both libraries.
Code Duels returns to PyData NYC for the second year in a row! Show up and check it out! This year, we have a friendly competition with three game modes:
Many data science initiatives fail because of the unavailability of good data.
In this talk, I go over examples of bad data I’ve encountered in real life projects. I present hypotheses about the reasons that lead to bad data; tools, techniques and patterns to “debug and correct” bad data; simple workflows for validation, verification and automation to identify bad data using commonly available tools from the PyData and Python ecosystem. Finally, I’ll go over some prescriptive techniques for how you can approach data projects with non-ideal-data to improve odds of success.
Traditional time series models such as ARIMA and exponential smoothing have typically been used to forecast time series data, but the use of machine learning methods have been able to set new benchmarks for accuracy in high profile forecasting competitions such as M4 and M5.
However, the use of machine learning models can easily lead to inferior results under common conditions. This talk is a discussion of how each of these methods can be used to model time series data, and demonstrate how SKTime provides a unified framework for implementing both families of techniques.
Explore the intricacies of designing, implementing, and maintaining a production-ready LLM-based self-serve analytics platform. Learn about common pitfalls, essential design considerations, performance evaluation, and robust security measures like protection against prompt injection attacks.
Python promises productivity, GPUs promise performance, but if you ever try to fire up a program on a GPU you will find that it is often slower than a CPU. Over the last decade, the Python ecosystem has embraced GPUs in numerous libraries and techniques. We survey what works with GPUs and some of the libraries that one can use to accelerate the Python workflow on a GPU.
From music composition with code to style transfer and music generation with machine learning (ML), the intersection of music, programming, and ML is opening up exciting avenues for innovation and creativity. This session will take you through a whirlwind tour of the latest techniques and applications that use code and ML to compose music, generate novel rhythms, and evolve the way we approach audio production. If you want to get started writing code and algorithms that create or enhance music, this session will give you the right tools to begin your journey and continue learning.
Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.
Scikit-learn was traditionally built and designed to run machine learning algorithms on CPUs using NumPy arrays. In version 1.3, scikit-learn can now run a subset of its functionality with other array libraries, such as CuPy and PyTorch, using the Array API standard. By adopting the standard, scikit-learn can now operate on accelerators such as GPUs. In this talk, we learn about scikit-learn's API for enabling the feature and the benefits and challenges of adopting the Array API standard.
This session will provide a case study of using Llama2-70b to tackle a data transformation friction point in reinsurance underwriting. The final approach of the solution is industry agnostic. We will walk through our thought framework for breaking down a business problem into LLM-able chunks, lay out the explored solutions and best performing method, compare local vs. at scale inference, and how we evaluated the unstructured LLM responses to prevent hallucination and ambiguity in getting structured response.
This talk explores how we create smarter, more reliable economic policies by using a technique called “Robust Control”. Robust control is a technique from control systems engineering focused on driving desired outcomes even when there is a lot of uncertainty about the circumstances and behaviors of the system these policies endeavor to regulate. These methodology assumes we've developed an incomplete structural model of a systems: parts of the system have well defined and enforceable rules, whereas others are models of phenomena outside of our control, such as user behavior. This talk reviews the application of a Robust Control informed workflow to select the parameters of a pricing algorithm. Code and data will be shared from the design and pre-launch tuning work and we will also use data to demonstrate how the real life deployed system did and did not match our models. The goal of this talk to demonstrate how practices from control engineering can be applied in data science applications, especially those where algorithms make decisions with intent to influence human behavior at the user population level.
As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document interrogation systems have emerged as a high value practical use case. The ability to ask natural language questions and get relevant answers from a large corpus of documents has the potential to fundamentally transform organizations and make institutional knowledge accessible.
Retrieval-augmented generation (RAG) is a technique to make foundational LLMs more powerful and accurate, and a leading way to implement a personal or company-level chat-based document interrogation system. In this talk, we’ll understand RAG by creating a personal chat application. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll also demonstrate a web application that leverages the REST API built with Panel–a powerful OSS Python application development framework.
By the end of this talk, you will have an understanding of the fundamental components that form a RAG model as well as exposure to open source tools that can help you or your organization explore and build on your own applications.
We're always told that early optimization is bad, not just for our code but our designs and processes. Yet very few of us are able to find that magical balance between getting things done and doing more background research and design. We're always lured by the siren's call of getting things "right" on the first try. This talk is about the process of finding the balance between doing the work and thinking about the work.
Spark, Dask, DuckDB, and Polars are popular dataframe tools used at large scale. We benchmark these across a variety of scales (10 GiB to 10 TiB) on both local and cloud architectures with the standard TPC-H benchmark. No project emerges unscathed.
The Customer Lifetime Value(CLV) model is one of the major techniques of customer analytics and it helps companies to identify who valuable customers are. A high CLV indicates customers who deserve more marketing resources. If the company overlooks CLV, it might invest more in short-term customers who buy just once.
To predict future CLV, we encounter sub-problems like forecasting the time until a customer's next purchase (“Buy Till You Die” modeling) and the probability of a customer's churn (Survival Analysis).
Recently, PyMC-Marketing was released and, it's becoming more feasible to implement these models with the Bayesian approach.
In this talk, I will show the key concepts of CLV prediction, its demonstration using Pymc-marketing, and practical tips.
Have questions after a keynote? Excited to learn more from one of our keynote speakers? Fireside chats are a great opportunity meet our speakers in a more intimate setting.
We take a deep dive into the world of generative AI models and their transformative applications in e-commerce, specifically focusing on the implementation of these models for tasks such as text generation, dynamic and personalized content creation and AI assistants.
Conduit is a Python package and terminal user interface (TUI) designed to streamline the process of building, deploying, and managing machine learning (ML) pipelines on GCP’s Vertex AI platform. Its primary goal is to improve efficiency for data science teams by reducing the time it takes to deploy ML models while observing MLOps best practices.
Our talk aims to uncover the Impact of Image Synthetic Data in Disease Diagnosis. Delve into the domain of Generative AI in medical imaging, discover the potential of synthetic data to revolutionize rare disease diagnosis, and explore the ethical and practical considerations surrounding this groundbreaking technology.
Keynote by JJ Allaire