Half hour of labeling power: Can we beat GPT? PyData NYC 2023

Half hour of labeling power: Can we beat GPT?
.ical

11-01, 09:00–10:30 (America/New_York), Winter Garden (Room 5412)

Large Language Models (LLMs) offer a lot of value for modern NLP and can typically achieve surprisingly good accuracy on predictive NLP tasks with a reasonably structured prompt and pretty much no labelled examples. But can we do even better than that? It’s much more effective to use LLMs to create classifiers, instead of using them as classifiers. By using LLMs to assist with annotation, we can quickly create labelled data and systems that are much faster and much more accurate than using LLM prompts alone. In this workshop, we'll show you how to use LLMs at development time to create high-quality datasets and train specific, smaller, private and more accurate fine-tuned models for your business problems.

This workshop includes a hands-on experiment to compare how an LLM performs on a classical NLP task with the wisdom of an annotator crowd. To perform the experiment, we'll leverage prompts and LLMs and then review and correct those labels using a collaborative data annotation platform where participants will label a corpus real-time. After completing the annotations, we'll then train a lightweight NLP model on the corrected labels and compare that performance to LLM's both quantitatively and qualitatively. We'll highlight where LLMs perform very well and struggle, along with how annotators can enhance and possibly hurt such LLM-assisted annotation workflows.

The workshop is open to all levels of experience, from beginners who have never created annotated data before, to experienced developers who have previously worked on large-scale annotation pipelines.

Reasons to attend:
- Experience what it’s like to annotate training data. Data quality ain’t trivial!
- Explore tricks that might make it easier to annotate data for ML (active learning, LLM helpers and friends).
- See the effect of the group’s annotations on the performance of a model.
- Learn practical tips and strategies you can apply to your business problems to create better and more accurate models faster.
- Participate in the development of a large annotated dataset that will be publicly released.

Prior Knowledge Expected –

No previous knowledge expected

Ines Montani

Ines Montani is a developer specializing in tools for AI and NLP technology. She’s the co-founder and CEO of Explosion and a core developer of spaCy, a popular open-source library for Natural Language Processing in Python, and Prodigy, a modern annotation tool for creating training data for machine learning models.

Ryan Wesslen

Ryan is an ML Engineer and Prodigy Customer Lead at Explosion. Previously, he led the NLP team in Bank of America's Chief Data Scientist Organization and worked in various roles in credit risk management, analytics, and modeling. His academic work covers research from fields like human-computer interaction, cognitive science, information visualization, and computational social science.

Half hour of labeling power: Can we beat GPT? .ical 11-01, 09:00–10:30 (America/New_York), Winter Garden (Room 5412)

Half hour of labeling power: Can we beat GPT?
.ical

11-01, 09:00–10:30 (America/New_York), Winter Garden (Room 5412)