Databases – Brock at Cliff Cottage

Murder victim Mahsa Amini (2000-09-20 – 2022-09-16)

In a recent weblog post, Classification, I wrote about Geoscheme, my approach to dividing up the geography of the world. Eleven days before this, 2022-09-16, Mahsa Amini (2000-2022), died in a hospital in Tehran, Iran, with a broken skull after being arrested by the Guidance Patrol = the Iranian government’s religious morality police, for not wearing a hijab. A photograph of Amini illustrates this weblog post.

Since then, Iranian women and men, girls and boys have protested daily. They have been joined by others throughout the world. Writer, comedian, and former president of Humanist UK, Shaparak Khorsandi (1971 – ), who fled Iran to Britain with her family following the 1979 revolution, wrote: The Iranian regime kills women for trying to live freely. This is not just Iran’s problem, it is the world’s problem. Do not look away. This denial of basic human rights is an affront to human dignity. Mahsa Amini cannot speak up any more. The world should act in solidarity and amplify her voice and the voices of all Iranian women who dare to speak up for choice and democracy.

On 2022-11-30, I decided that I could amplify Amini’s voice by using some of my working hours on a personal Women, Life, Freedom project to: 1) transform Geoscheme into a database, and 2) use that database as an example to teach the basics of database development. The example will use an open-source database, and the teaching materials will be freely available under a copy-left licence. A third step has yet to be worked out, which involves the translation of the materials into Persian, with the help of a local Iranian refugee, preferably one with little or no understanding of the workings of databases.

Later, while I was wondering what else could be built into a database as dinner approached, my housekeeper, Trish, was telling me that she was re-organizing her recipes. It then struck me, that many Iranian women probably knew a lot about cooking, and that combining this existing knowledge with databases, could help improve an understanding of them.

Since food is generally made from ingredients, such a database could also be expanded to include inventories for raw materials, goods in process, and finished products. It also could allow some accounting basics to be explored. In other words, this example offered greater scope.

Another point that could be made is that there is an opportunity for small-scale software-as-a-service (sssaas). No one says a cloud has to be huge and multi-national. In a previous post, titled Clouds & Puddles, I used the Xample family with mother Ada, father Bob, cat Cat and daughter Deb, as an example to explain cloud-based computing. I am sure more Persian sounding names could be found.

Geoscheme data will be discussed in another post, to be published 2022-12-11. The remainder of this post will discuss database choices. It provides a short and incomplete history of the evolution of relational databases.

Rather than using the overworked term, cloud, to describe sssaas, I propose to call it Mist computing. This topic will be explored in yet another post to be published 2022-12-18.

My own interest in databases has nothing to do with either Geoscheme or cooking, but digital asset management (DAM), a system to store and access different types of content, such as texts, photos, videos, audio (including music), so that they can be edited and otherwise combined to produce new content. While such a framework is most often called a DAM, at other times it is referred to as a content management system.

Data structuring

In the beginning there was a bit, the most basic unit of information that expresses either 0 or 1. It first appeared in Claude Shannon’s (1916 – 2001) paper A Mathematical Theory of Communication (1948), but he attributed its origin to John Tukey (1915 – 2000), who in a Bell Labs memo of 1947-01-09 contracted binary information digit to bit. Eight bits form a byte, which can take one of 264 values. The term byte was coined by Werner Buchholz (1922 – 2019) in 1956-06.

The total amount of data created, captured, copied and consumed globally was about 64 zettabytes in 2020. By 2025, this could increase to 180 zettabytes. 1 zetabyte = 10²¹ = 1 000 000 000 000 000 000 000 bytes = 1 sextillion bytes.

Data storage has always been problematic. In 1951 magnetic disks as well as magnetic tape was developed for data storage. To begin with, there were flat files, often they came with programming languages, with data stored within the program. Then, in the 1960s, hierarchical and network database models appeared and flourished. These early database implementations were imperfect. Some would call them messy, overly large and difficult to use.

Structuring data begins with Edgar Frank “Ted” Codd (1923 – 2003) and his A Relational Model of Data for Large Shared Data Banks (1970). Codd’s model structures data in tables, with columns and rows, where one column holds unique keys. models were unable to use this model, as computer hardware only became powerful enough to deploy it in the mid-1980s. Within a few years (early 1990s), relational databases dominated, and continue to dominate in 2021.

Codd worked for IBM, but it was a mismatch. IBM failed to understand the importance of Codd’s work, and the need for data storage to have a theoretical basis, that ultimately resulted in a need for relational databases. Codd’s twelve rules, from 0 to 12 making 13 altogether, are important for anyone wanting to understand databases. The most important concept is that relational databases store information without unnecessary redundancy.

Initially, tables formed a basic unit for structuring data. These need storage space for data, and programming capabilities to create (insert), modify (update) and remove (delete) content.

Data is useless unless it can be retrieved = accessed in a form that is directly usable or can be further processing by other applications, that make it usable. The retrieved data may be made available in a form basically the same as it is stored in the database or in a new form obtained by altering or combining existing data from the database.

Administrative tasks include registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information that has been corrupted by some event such as an unexpected system failure.

The Alpha language was defined in Codd’s A Data Base Sublanguage Founded on the Relational Calculus (1971). It was never implemented, but influenced the design of QUEL, that was implemented by Michael Stonebraker (1943 – ) in 1976. This was part of the University of California, Berkeley project Ingres (Interactive Graphics and Retrieval System), used to demonstrate a practical and efficient implementation of the relational model. QUEL was supplanted by Structured Query Language (SQL, pronounced sequel, by some), based on Codd’s relational algebra defined in Relational Completeness of Data Base Sublanguages (1970).

SQL was (and is) available in several different flavours. It can be housed in the cloud (someone else’s server), one’s own server, a desktop machine, or a laptop. Handheld devices, such as smartphones, are less suitable.

SQL was implemented by Donald Chamberlin (1944 – ) and Raymond Boyce (1946 – 1974), based on an earlier language SQUARE (Specifying Queries As Relational Expressions).

By 1980, relational databases had became mature enough to enter the world of business. A number of startups such as Britton Lee, Relational Technology and Sybase were founded by people actively involved in the Ingres project at Berkeley. These were often implemented on cheap mini-computers, such as VAX machines made by Digital Equipment, rather than on larger, more expensive mainframes, such as System/ 370 machines made by IBM.

Postgres, was a successor project to Ingres, at Berkeley. Its goals were to make evolutionary improvements, including support for more complex data types, and improved performance. Most people exposed to Postgres, praised its extensible nature, allowing it to add new features, as required. This allowed databases to be used in new areas, such as support for geographical data as in geographical information systems (GIS) using a geolocation engine, and/ or time-series. Currently Timescale DB, launched in 2018 as an open-source product, offers the best possibilities for extending PostgreSQL in this area.

In the 1990s, the project drifted closer to SQL, so that by 1996 it was renamed PostgreSQL to reflect this support for SQL. There are many database practitioners who regard PostgreSQL as the standard database system to use, unless something dictates otherwise.

For me, no discussion of databases is complete without a mention of the closed-source Microsoft Office and its Access relational database system. It was promoted as a database for everyone, that provided a graphical user interface and the Access Connectivity Engine (ACE) that I originally knew as Joint Engine Technology (JET). The Norwegian Department of Education, in its wisdom, decided that computer science at the senior secondary school level would be taught with the help of this software. This approach to teaching, was a mistake.

Fortunately, not all Norwegian school programs were using Windows software. The Media and Communications program, that I transitioned to, used Apple equipment, notably iMac G4 machines with flatscreens, and software that included ProTools, to create a digital audio workstation, along with Adobe Creative Suite, and Claris FileMaker database system.

FileMaker is superior, and Access inferior everywhere one looks. FileMaker had more extensive capabilities but, more importantly, it just works. My experience with Access was that it was impossible to make coding changes with it. It was more time-efficient to just scrap an existing database, and to write a new one.

Thus, when it comes to other databases designed for everybody, I take this experience with me. Base, the LibreOffice/ OpenOffice equivalent of Access, performs equally badly. One reviewer noted: It [Base] doesn’t support Insert, Delete, and Update statements [all fundamental to any database] through its query designer, and the SQL tool [is not standard, because it] doesn’t parse [= analyse symbols that are written in] ANSI [American National Standards Institute] compliant SQL. There are no easy solutions.

The next database problem child is MySQL, originally developed by MySQL AB, a Swedish company founded in 1995 by Michael Widenius (1962 – ) who programmed the database system – and named developed products after his children: My, Max and Maria; David Axmark (1962 – ); and Allan Larsson (? – ). It offered dual licence database software. A Gnu General Public Licence (GPL) version was offered free of charge, but the database system was also sold under other licences. This closed-source version included additional components that improved the operation of the database. In 2008, MySQL AB was acquired by Sun Microsystems. In 2010, Sun Microsystems was acquired by Oracle Corporation.

MariaDB is a forked = branched version of MySQL, also developed by the same Michael Widenius, starting in 2009. MariaDB maintains high compatibility with MySQL, allowing it to function as a drop-in replacement for MySQL. However, new features diverged. MariaDB includes new storage engines. While MariaDB was originally entirely open source and non-commercial, a merger with SkySQL in 2013, resulted in a more commercial profile, especially the emergence of software-as-a-service (SAAS) activities on Google Cloud Platform.

Avoiding redundancy is not just a matter of saving space. It is a means of avoiding data conflicts. If, for example, a person’s address is only saved in one place, if that address changes, that change only has to be recorded once. It should be noted that redundancy has nothing to do with backup which, at a minimum should follow the 3-2-1 rule: 3) Create one primary backup and two copies of data. 2) Save backups on two different types of media. 1) Keep at least one backup file offsite.

Relational databases use an ACID test to monitor the suitability of a database. ACID involves atomicity, consistency, isolation and durability properties that guarantee data validity despite errors, power outages, and more. This has to operate at the transaction level. Other types of databases find this test impractical to implement, and accept data loss to a varying degree.

DB-Engines ranks the most popular databases. In 2022-12, the top five ranked databases were, in order: Oracle, MySQL, Microsoft SQL server, PostgreSQL and Mongo DB. Microsoft Access is #9, Maria DB is #13, FileMaker is #22. Base uses the FirebirdSQL DB-engine, which is ranked #32.

So far, one would get the impression that all modern databases are relational, and in some way related to SQL. This is not the case, and is exemplified by Mongo DB in the above list. It is a document database, often described as NoSQL. There are a large and increasing number of NoSQL databases, a 21st century term variously translated as not SQL, not only SQL and not a relational database.

Document databases typically store archived records, then use their database engines to extract metadata. It is a more complex arrangement, and is not the place to begin when teaching database techniques.

It is my intention to encourage everyone to use open-source databases, where these are appropriate. This includes, especially, oppressed women wanting software solutions in a country experiencing sanctions, such as Iran. Even profitable corporations, such as Google, have found it appropriate to develop and support open-source software.

The conclusion of this post is that I will begin to use PostgreSQL to implement a Geoscheme database, with the aim of making it a example to be used for teaching basic database management techniques. After that, everything is open.