March 1, 2023

12:55 pm - 1:25 pm PST

PRQL: a modern language for transforming data

PRQL is a modern language for transforming data, and a simple, powerful, pipelined SQL replacement. This talk will introduce PRQL to your community of software engineers, showcasing how it makes it easier to work with your data. PRQL transpiles to SQL so you can use it wherever you currently use SQL, including relational databases such as PostgreSQL, MariaDB, SQL Server, SQLite; modern data warehouses such as BigQuery and Redshift; and even DataFrame libraries like Pandas and Polars. PRQL has many integrations, including JavaScript, Python, R, CLI, etc. that make it easy to start using PRQL right away.

This talk will demonstrate some of these use cases with interactive examples that you can try for yourself afterwards. PRQL is completely open source (Apache 2.0 licensed), volunteer driven, and committed to staying that way forever to vendor lock-in and remain free to use for everyone. To learn more about PRQL, go to our website(https://prql-lang.org/), GitHub Repo (https://github.com/prql/prql), or try it out in our online playground (https://prql-lang.org/playground/).

Topics Covered

Open Source

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. It may contain errors.

Tobias Brandt:

Hi, everyone. Yes. Hi. I’m Tobias a con contributor and developer advocate. And with us today is a core compiler developer. Now, our agenda is very simple, make querying data better, and for that, we are going to introduce you to sql modern language for transforming data. But you might say, well, I already know sql. Why do I need to find out about a another language? So for that reason, we’re gonna start off with what is great about sql. There’s a lot we like about SQL and we’ll, we’ll start with that then. We’ll, Aash will take us through some things where we believe SQL could be improved. Then we’ll look at introducing a pipeline relational query language, i e prequel. And finally we’ll look at where today, right now you can use prequel and make your life easier in working with data.

So be sure to stick around until then. All right. So what is great about sql and SQL is the lingua franca of data. I it is spoken anywhere that you work with data. So from traditional relational database management systems to cloud data warehouses and data frame libraries like pandits and Polars have SQL interfaces and also like data lakehouse engines like Apache Iceberg or Dremeo, et cetera. So yeah, so, and what is it that we like about prequel it? Well, first of all, it is relational. It deals with tabular data where columns have specific types and the tables are related to each other. And there’s a whole like relational algebra that’s well studied that works with that. And Aash will say more about that. And second of all, it’s declarative. So by that we mean that you declare what data you want to get out out at the end.

 and the query LA optimizer, the query engine has the freedom to optimize the query and decide how to get that data to you so you don’t have to specify which loops to, to write or et cetera. So that gives the yeah, the optimize a lot of freedom to come up with, with creative plans that work better and make it efficient to retrieve the data. However, we think that’s not the end of the story, and, and SQL is, is quite old at this stage. So we think there can be some improvements. So at this point, I’ll hand over to Ali to talk about what we think could be better about sql.

Aljaž M. Eržen:

Yeah, thank you. so the core thing that already was mentioned is the relational part. esque was built a long time ago with, with even some noble prices. and the core thing here is that it uses relational algebra. It is based on a well-defined mathematical construct relations. And here on the slide, you can see on at the bottom some equations from that mathematical base of, of the relation of algebra. You can see that we have tracks and albums, both of them relations, and you do a Cartesian product of them. And then you give this relation resulting relation to the pie function or sigma function which is basically filtering. And you do filter over tracks, albums and album albums. And this is a basic join, you know, from as well. And then you pass that back up, up into a, the pie function, which is projection.

And this takes only track Id name and titled out of the code relation. And on the top you can see the corresponding query in as esque. And you can see that it’s matching a lot. But it, it’s the same thing, semantically, but not synt tactically. And you, you, you probably know that if you search the, the, the internet for formal definitions of esque, you will find relational algebra because most people just, I mean, most papers just define the relational algebra as the semantics of esque and then slap over, okay, and this, this can be done with where keyword, this can be done with, with from keyword, but you never have defined the name resolution rules and stuff like this. So there’s a disparity between actual relational algebra and as well and, but we actually equate these things in a lot of ways to the point where we named the move from non-relational move to non-relational databases.

No, as well. Yeah. So how did we come to here? it’s an old story. esque was evolved from structured query language which looked like cobalt and still does. And this is a good indication that this is a lang old language, but it’s also visible in some other constructs. Like here on the top you can see a query for window functions that has basic syntex of an aggregation function, sum total, but then slapped over that is a keyboard over that now signifies that this is a window function. It does not do aggregation. And yeah, it may be the best looking way to represent this as in a sentence, sentence, but this is not really, really nice design of, of the APIs. We can do much better than that. Also the distinct keyboard, for example, is a very important keyboard that if you know a little bit about as well is applied at the end of your query.

But because we people speak, the way we do it is inserted under in on the front of in esque Syntex. So there is some disparity of, of semantics and syntax here. And then esque is not really composable. It does not feel like a programming language. And with this, I mean, look at this query here. I have query from tracks and I’m grouping by album id, and then I want to do join by this album id that’s that I just grouped by. And this is not possible to do with just one query. I have to nest queries here, and I could be using common table expressions here, but this is not really, it, it’s not easy to do. I mean, you have to move things around. the floor is not from top to bottom. And I don’t feel like this is composable and, and I feel that like we can do better.

And now a little bit more about name resolution as well here have a query that starts from album albums and it defines a new name for the name column. new alias named album name. And this album name is now available to use in group by an orbi, but not in where, because if you know esque, you know that where happens before the projection. So the, the alias is not available in where, and how do I know that? Well, you have to memorize it for each close. There are their own name resolution rules. And autobi can even contain just just number lit lits that refer to po positional positional column in your relation. And this is, I mean, it’s documented somewhere, but it, this is not consistent across the, across databases. And this is the last problem we have with, with, with as well.

We used to work on just one database in your day job. But it’s in the days of data science and interacting with multiple data warehouses, you have to know a few dialects just for your day job. And you know that queries are not just, does not just work if you move from move them from one database to another. Like here I have example for a simple thing like specifying the limit of your query and you can see that postgre syntax is not the same as this Microsoft desk. Well, so you basically have to keep track of all this. And this is not a good, good approach and we think that we can do better. And the main question now is how do we fix that? it’s a tough question. And at prequel we just said, well, you don’t fix it.

Let’s start a new, let’s build a new language and keep, keep the good parts of as well. And with this, I mean mainly relations, relational algebra, that that is well defined and works. And we built this with consistent seman semantics, define what, what’s the exact name resolution and use that name resolution across all Europe closes design a syntax with new features that we acquired over the last 40 years of language development. And if you want to be really really the cherry on the top is that you want to compile this language to as well, so it can execute on any database that that supports as well, which is basically annual relational database. Okay, so we named the the thing prequel, which stands for pipeline relational query language. I already talked about relations and query language is self-explanatory, but let’s focus on pipeline here because in prequel we don’t have queries.

The base unit here is pipeline. And this is basically a series of functions that transform your relations. Here on the left side, you can see that I start with from tracks and then select a few columns, title, unit, price, and album id. This here is a list because the whole, the whole list of, of columns is just one argument. Then I derive a new column, gross price, which is unit price times sum factor, and then filter the row by this just defined column, gross price. Then I do grouping and don’t mind the, the whole syntax here. This is aggregation by we have this defined in depth, but it’s, I don’t have enough time here. And then I do filtering again. And if you know, esque, the first filter would translate to where, and the second one would translate to having, because I’m filtering on number of tracks, which was just aggregated.

And then I apply some sorting and take 10 elements. And the first thing you can see is that the flow is top to bottom. It’s al always happening one step at a time and actually to the point where you could actually cut the pipeline at any point. And the first part would be a valid relation. So you can think of, okay, I have my pipeline relation, it has this and this columns, and then just apply a new function on that, on that. And you can just keep going and applying new, new functions and transforming your relation. And this is really nice to bug because it’s easy to, to just pull some pipe so some, some prefix of the pipeline out and do something on that or define it’s in the variable and, and so on.

Okay, so to get deeper into benefits of, of this approach is that I have to tell you that we don’t have clause if you paid attention, no clauses, but functions, each of these keyboards then that you see on the screen, these are functions that take arguments. And this is important because we designed them to be orthogonal to to each other with this. I mean that they don’t share much much functionality between each other and they try to be as different as possible. And this allows us to define basically nine different functions that transform your, your relations. And this is all that you need to learn to, to basically get started with, with, with preco. it also makes them really composable. So we can do things like inferring that, okay filter and sort don’t interact with each other so you can switch their, their, their place as as you want it to.

And now to go a step further, we strive to be a functional language. So obviously we apply functions, but here I’m actually defining a user function. let’s, let’s read it together. So function, take cheapest takes a number and a relation, and then defines a new pipeline that takes that relation, applies some sorting over unit price and then takes an elements. And this is now a function like any other function that we used in the pipeline before, and we can use it as from tracks take cheapest 10. And now you probably get the feeling that this is be becoming composable. This feels like a programming language, well, at least I do. And on top of that, we can now put aliases in front, which is way more readable than having them in the back have a separate syntax for dates, for daily turtles. Like we are database, which we should have that the FS strings for for like in Python for for basic ation. We have operator for new ability checks. it’s easy to comment outlines because lines are usually just one transformer per line. So this is way more easier. And we have trailing commerce, we have underscores in numbers to make them more readable, just the nitpicks that that we acquired over the last 40 years of language development.

Okay? And also one thing that we have in our principles is that as a developer, when when you are querying databases, you are thinking about your, your data, you are writing this down and then executing your query and waiting for the result. And you know that if you want to be productive to be fast, you shouldn’t be waiting for a database. if you have to wait five seconds, a minute, 10 minutes for CI or something, this is, this blows your, your productivity away and you want this cycle to be fast. And that’s why we never try to connect, we infer from your queries. So when, when you’re building a pipeline and you say, from this table, select these two columns, the compiler remembers, oh, there are these two columns in your table. And it does not try to connect and, and, and, and infer and, and try to inspect your database.

 but we are planning to add support so you, your language server could pull in the definitions to your actual file to, to even have help you even more. And this allows us to fail early and give you information like, Hey, you missed a comma here, or Hey the relation here does not contain this, this column you, you, you did something wrong. And this really helps you, you, your productivity and also the fact that we can provide the good errors. here on the, on the slide, I have an example of actual error from the compiler that’s saying, okay, you specified column name before the list, but it should be, should go inside the list. And this is something that modern languages can do and it really helps you your development. Okay, so that’s all I have to say about prequel. Come, come and check it out. And Tobias, you tell us how.

Tobias Brandt:

Okay, thank you Ali. So yeah, I mean, I hope there’s some things you’ve seen that you like it, like ask yourself how can I bring those ergonomics and that composability to my workflow? And yeah, luckily you can get started right now. So if you open up a new browser tab and you go to prequel-lang.com/playground, we’ve got our prequel playground where we’ve got the compiler com running, right, a new browser. And, and in the left hand, on the left hand side, you can enter your example prequel queries and on every keystroke they’ll be compiled. And when, if it’s a valid query, it’ll produce skl on the right hand side. So you can try it out right here, right, right now for other work in your day-to-day job. Next we’ve got prequel query PQ for short, which a tool like JQ for working with Jason is a prequel tool for working with relational tabular data in the command line right in your terminal.

And the example query there shows you how you can run a simple prequel query over a CSV file. You can also use query multiple CSV files or parque files join them together and run any kind of query on, on those. So, and that’s all made puzzled by the power of DTB and Data Fusion and you can switch between those backend. So that’s something I’ve used in my day job and yeah, I’ve really enjoyed it. Next we have our VS code extension and the screenshot is already bit out of date. Cause this has seen a tremendous amount of work in the last few work weeks and it integrates very nicely with other SQL tools extensions. And similar to the playground, you can write your Precal Curry preview, the, the S sql, and then execute it right against your database backend from within VS code.

So be sure to check that out after that talk. Next Python, which is another universal well loved language in, in data science and there is as simple as pip install pipe prequel. And then that gives you access to prequel accessor function on Pandas data frames, and you can just run prequel queries right on your Pandas data frames. And second of all we’ve also have a Jupyter Magic, which is the bottom half of the screen. And there with a a percent percent prequel magic, you can write your prequel query you write in the Jupyter Cell and then execute that either against data frames in memory or other database backends that you got access to. Finally we also have r and there’s a prequel R package that we can that plugs right into the tidyverse and you can run your prequel queries in r and just if you look at the bottom example using the tidy query show dli command, you can even generate dli queries back from prequel.

And if you just compare the two, you’ll see how similar prequel is to Dly and because Dly was one of the inspirations for prequel that is just a, a test of some of the extensions. And there’s also JavaScript ones like Node or, and you’ve seen a lot of recent work on jdbc. So you can even use prequel and tools like DBE or Data Grip to interact with your databases. So there’s a lot happening. go and check it out, prequel-lang.org or on GitHub at prequel slash prequel or come talk to us on Discord. We are very friendly. Open community prs are very welcome and get merged quickly and yeah, with as a community will, the language will grow and from go for strength to strength. So come check it out and thank you. That’s all from us.

header-bg