Why is SQL AI such a hard problem?

SQL is usually how you work with your data in Deepnote and Deepnote AI is now able to help with that by generating SQL based on input in natural language. It sounds simple, and it is simple to use! But it wasn’t easy to implement. Why is generating SQL with AI such a hard problem to solve?

How to effectively prompt Deepnote AI

It’s like generating Python, right?

Writing Python is like using a language you know but writing SQL is like using a map to find your way through a landscape—the "map" being the database schema, and the "landscape" being the database itself.

You need to know the names of the tables and columns to get to the data you want. Compared to generating Python, you also need to provide a “map” 🗺️.

Database schemas are big

Database schemas can be big. Think hundreds of thousands of tables 🤯. The full schema might not fit into the context window of current language models. But there’s no need to cram it all onto your “map”. Your “map” just needs to include tables and columns which are used by the SQL query that leads to your data. But you need to decide which ones are those.

Finding relevant tables

Apart from being large, database schemas can also be quite messy. Table names may use various naming conventions and column names may include symbols or numbers which lose their significance when taken out of their specific context. What if there are multiple tables sharing the same name 😕? Finding only the relevant tables and columns for your “map” is challenging.

Who is a “monthly active user”?

But messy names aren't the only reason why navigating the schema is challenging. To do it well, you need the semantic context about what the data represents and the links in between. Just as new team members need orientation 🧑‍🎓, the AI must learn to identify, for example, a 'monthly active user' based on the specific business logic and conventions embedded in the database schema.

There isn’t just one SQL

Unlike many programming languages, SQL has dialects 🗣️, similar to how English has variants, for example. While most of the syntax and functions stay the same, most databases and data warehouse have their own variant. SQL code written for Snowflake might not work against a Postgres database, because some of the syntax is different. Your solution needs to be fluent in every dialect.

Putting it together

In order to generate a valid SQL which answers the user’s question, you have to find the relevant parts of the data landscape and draw them together on a map. To do that, you need to be a skilled cartographer of data maps, an excellent navigator through the data landscape and a SQL polyglot 💪. Oh, and you want to do it all pretty quickly! 🏃‍♂️💨