Practical Concerns of Database Design

Theory meets practice where your data meets your program

Feb 20, 2022

Any discussion of database design would be remiss to not include SQL. SQL is so popular, that other databases are identified as being “No-Sql”. We are going to break this discussion into two parts, first the “S”- Structured, and secondly the “QL” part - a query language.

Structured Data

Structured data is the schema system for dealing with the content and meaning of data. Content is what can be stored in a data field, basically its type, meaning is how that data is signified or represented, typically two things help indicate the meaning of structured data, the field name and the hierarchy within a table.

Structured data in essence serves as a treasure map for where things are located in your data. It allows you to know what is there, and where to find it. This is important for the organization, planning, and auditing of an application. It is invaluable in testing, debugging, and changing your code.

With unstructured data, the data can be stored semantically, but there is no overarching design or pattern for what is stored and how it relates to one another.

Basically, any serious application, which involves a significant degree of complexity or future changes, should have a system for structuring the data. This does not necessarily need to be handled by the database, as is the case with SQL tables, but it needs to be handled one way or the other. For example, your could have a code file or script that tracked all of the data fields in your application, and you could use that to verify that correct contents are stored in each.

Data Access and Querying

Apart from the structure of your data, is the issue of finding a particular datum of interest, and this is what querying is for. So for example, you may want to find all the birthdays this week, if you a printing out a weekly schedule for a school class

But whereas queries may be a semantic mechanism for filtering and selecting data, the primary technical issue is one of data access, and this is where SQL systems may or may not be a good choice for a given application.

SQL is premised on the idea that an administrator or script can execute arbitrary queries on the data at any time, such that this data can be retrieved in realtime. This realtime querying ability is the true magic of SQL, and while structuring data with schemas is one of the key benefits of SQL databases, realtime ad-hoc queries are the essential technical reason that SQL exists.

SQL tends to shine with “medium sized” datasets. This could range from hundreds of megabytes, where memory or file access might be resource intensive or cumbersome, to hundreds of giga-bytes, up to the point that maintaining indexes on your data becomes too expensive, and having a single point of access to your data may not be preferred.

For a good design and performance balance, SQL remains a very attractive technology today. It has a good deal of design flexibility, while maintaining high performance under a variety of usage patterns. My experience with very large SQL databases is limited, so some of this is gleaned second hand. But I would contend that the key technical benefit of SQL, the ad-hoc dynamically indexed query access, is no longer so universally important. What is the principle alternative to ad-hoc queries.

Batched Data Processing and Analysis

Batched data processing is an attractive alternative to ad-hoc queries for many applications. Basically, with batched data processing you process all the information you want from you data at one time, at regular intervals, instead of in realtime.

Batched data processing has two significant benefits: flexible tooling options, and much more forgiving performance constraints. Batched data processing might involve something like exporting your data to CSV files and then using a numpy script to analyze statistics on that data. Because you do not require realtime access, data can be transformed between formats, moved across systems, filtered through privacy walls, and much more. You can use any desired programming language to run a script over that data and analyze it. This makes it attractive to technical people well outside the typical web programmer or IT manager. Batched data processing makes automated data analysis more accessible to engineers, statisticians, researchers, scientists, and more.

In addition, batched data performance is judged on the ability to process your entire dataset, and it typically involves much more relaxed time constraints. If you process data on a daily basis, taking an hour or two may not be a significant concern, whereas with a realtime query, that may involve a user or administrator waiting patiently. Because all the data is processed at once, you only need to consider the total amortized time and resource usage of your data processing. With realtime queries one slow query may be obstructive, but for batched processing, the average time across your entire process is typically more important, meanin that individually slow queries are less of a concern.

The significant drawback of batched data processing is that it does not reflect the most recent state of the database, in which you may need to fallback on realtime database access, if that is required. This is often acceptable tradeoff.

Historical Changes Have Made SQL Alternatives Attractive, Despite Often Ad-Hoc Designs

In addition to its technical performance characteristics, SQL can be compelling because there is so much theory on relational database design, and it is someone nearly everyone with database experience is familiar with.

For reasons of flexibility, NoSQL databases often do not have any system for structuring your database. Many are flat key-value or document stores, and do not have tables or schemas. I do not believe this is due to implementation difficulty, it would be easy to add structured data and schemas to these systems, but rather for the benefit of programmer flexibility. These databases are primarily created for their technical performance characteristics, often differing significantly from the SQL databases in that regard. By using a simple interface like key-value storage, this makes the database useable to the broadest set of programmers without imposing any design patterns or structure on them. Schemas and structure can always be implemented on top of a key-value store, so that is often left out as an explicit part of these databases.

Historically, SQL’s technical performance characteristics were appealing to a much wider array of data applications. This is for two primary reasons: use cases and resource limitations.

Earlier computers had much less memory in all forms, so was much more likely that an application would need the performance. Computers were expensive and less ubiquitious, so batch processing which could occupy systems for a long time might have been less appealing. Today it is easy to transfer the data to a second machine and have it run the batched processing. So while the “mid-sized” database was almost universal in database applications for decades, that ground has been slowly eaten away from above and below, so that the performance characteristics of SQL databases are no longer necessary for most applications. Many organizations have datasets so large they need custom built or massively parallel systems, which requires much more consideration than a typical MySQL database offers.

So for applications large and small, SQL is no longer

SQL’s Query Language Is Awkward and Dated

When SQL was first invented in the 1970’s, programming practice was very different than it was today. It was expected that most users would be interacting with a text terminal with textual commands. This was the era of fotran and COBOL, of BASIC and FORTH. And in many ways, the SQL language.

Today, SQL is often used behind an object relational mapper. My theory is that organizational momentum has lead to SQL needing to be very stable over time and uniform design patterns highly preferred to language innovations. While modern SQL databases have been heavily optimized under the hood, the language itself has changed minimally over the decades. This works for SQL, because it is a simple language, but this has not been without problems or limitations. For example, SQL injection attacks, while they can be avoided by properly using cursors and parameterized queries, the simplicity of SQL meant that it has been traditionally used by constructing queries on the fly without data type specifications on the parts of the query, simply by concatenating strings. Programs that “modify themselves” are a highly discouraged programming practice because of their unpredictability, and the same can be said of this kind of ad-hoc second level embedded scripting facility.

Given that SQL directly accesses the most important part of the system, your stored data, this is highly concerning.

Again, batched data processing, in addition to explicitly indexed fields offer a much more modern and simplified approach to database design.

But many times, SQL can still be attractive because it is often the perfect compromise.

Alternatives to a Database

One question we may want to ask, is whether a database is in fact at all necessary. Why not load your data into memory, and store in flat files like CSV or JSON?

For many uses, this may not be a bad idea, but in this case, the extra resource usage may be undesireable. If you are completely rewriting your database on disk every time it is saved, even if you do this intelligently on intervals, it might be too resource intensive. It can also make it difficult or impossible to scale your application.

Object relational mappers are a common way to use SQL databases these days, their primary benefit being more regular handling of database migrations. Even if these migrations may be relatively small such as adding new fields or changing the name of a data field, if you have many of these changes at once it can be difficult to script these migrations, whether you are using SQL or not.

Object relational mappers are one of the rare technologies that seem to be easier to use in practice than understand in theory, which is why I am often frustrated by them. Relational data can be complex, and understanding how relational aspects of data like foreign keys, joins, keys, and other constraints are converted between the object representation and back, can be difficult to wrap one’s head around. But often, if you just try to use these ORMs, then your conceptualization seems like unjustified worrying and they simply work as intended.

While this may be nerve wracking, it is often the reality of programming practice. Programmers may have limited tenure at a job or with a system, and frequently get thrown into the middle of a script and are expected to play their part. Happy Coding!

Code 4 Life

Discussion about this post