8. Data – tukoblog

What is a program without data? and what is data?

It was noted at the beginning of this guide that machine code is all command and data. With this definition, data is almost everything in computing. On the other hand, data can be merely ‘what is stored in a database’. This first definition above is too far-reaching to be useful here and the second far too narrow.

To advance our enquiry, we can compare data to memory. Without memory, we humans would move from moment to moment entirely unable to remember anything that had just happened. Just so with a program. We run it, use it, shut it down, and then what? What if even a simple text editor had no memory? We would write a text, close the editor, lose our work.

This gives us a key to a definition of data that we can use here. We can agree that the document in both the running program and the saved file are in a sense the same document, but if we keep with the analogy of data as memory, we observe it has two states: active in a running program and static in a stored file. Although we can speak of ‘data’ in either state or indeed both, this chapter defines data purely in its stored form. Data here is the memory that persists after a program has been shut down.

The main purpose of this chapter is to explore beyond ‘active’ programming languages like C# and give a brief overview of the equally-important ‘static’ languages that deal with data, languages quite unlike C#. First, as databases are most obviously associated with with data, we turn to SQL (‘Structured Query Language’).

SQL is usually pronounced ‘sequel’ but ‘ess-cue-ell’ is also in use and won’t raise eyebrows.

SQL

Alternatives to SQL do exist nowadays (NOSQL for example, would you believe?), but for a long time SQL ruled the database roost.

SQL has a mathematical basis in relational algebra / calculus. For this reason, a SQL database is often called a relational database. To acquire a basic understanding of SQL, you have to imagine a grid of data. This grid is called a table. The grid consists of columns and rows and is where data is stored. Each item of data is contained in a row. It is the job of SQL to fetch data from the grid by querying its columns and rows.

Here a table definition in SQL :

CREATE TABLE Animal(
     /*
          A NULL column does not have to have a value,
          but a NOT NULL must have a value when a new
          row is inserted into the database.
     */
     Everyday_Name varchar(255) NOT NULL, -- string equivalent
     Scientific_Name varchar(255) NOT NULL,
     Is_Predatory char(1) NOT NULL, -- bool equivalent
     Number_Of_Legs int NOT NULL
)
GO -- run the SQL

This creates a table and its grid columns. The table is like a class without any methods. There is no active data in SQL and there are no methods.

There are no methods in SQL, but there are ‘stored procedures’, which are functions that run SQL commands on a SQL server.

The rows in the grid are the data and they are retrieved by a SQL query :

-- the asterisk means 'all columns'
SELECT * FROM Animal

This will select all the rows in the ‘Animal’ table. A query can, however, select only particular rows and columns :

SELECT Everyday_Name -- specify columns here
FROM Animal
WHERE Everyday_Name = 'Lion' -- specify rows here

This will select the column ‘Everyday_Name’ and all the rows from the ‘Animal’ table where the row has the exact value of ‘Lion’ for the ‘Everyday_Name’ column.

Now take another look at the table definition :

Everyday_Name varchar(255) NOT NULL,
Scientific_Name varchar(255) NOT NULL,
Is_Predatory char(1) NOT NULL,
Number_Of_Legs int NOT NULL

There are two issues here that usefully illustrate what SQL is and does.

First, the NOT NULL declarations. These say that all of the columns must have a value. This might be what is wanted. It is difficult to see why any of these should be left blank unless there was some data-input error. However, this is only an option if you have full control over the data. In many cases, a database imports its data from an external source. Here, making any but the ‘Everyday_Name’ column NOT NULL would be a very bad design because, with no control over the data, there can be no guarantee there will not be bad data. Only the key column, the everyday name, could be NOT NULL, for without the key value a record is meaningless. However, even then it might be better to allow bad data to be imported and then fixed rather than refused entry into the database.

The second issue lies with the key, the ‘everyday name’ column. If there were two records with the same name (for example ‘Bear’ was entered twice), the database would effectively be corrupted. This can be fixed :

Everyday_Name varchar(255) NOT NULL UNIQUE,
Scientific_Name varchar(255) NULL,
Is_Predatory char(1) NULL,
Number_Of_Legs int NULL

This would probably be acceptable here, as after all animal names are unique. But if you were storing data about your record collection, which is very extensive, there would likely be a problem :

Album_Title varchar(255) NOT NULL UNIQUE, -- this is not a good key!
Artist_Name varchar(255) NOT NULL

You look through the thousand or so items in your catalogue and notice you have ‘Greatest Hits’ by The Chancellors and also ‘Greatest Hits’ by The Shaggs. You simply cannot have a UNIQUE name here. Here, primary keys come running to your rescue :


Album_Id int NOT NULL PRIMARY KEY, -- now we have a proper key!
Album_Title varchar(255) NOT NULL,
Artist_Name varchar(255) NOT NULL

With a primary key in place, the database will generate a sequence of numbers automatically each time a new record is created. A primary key is therefore guaranteed to be unique and the duplicate album titles can now be added to your database without any problems.

But there is still a problem with the ‘Artist’ column. As you have every Bob Dylan record ever released and also a number of bootlegs, you will have a great many ‘Bob Dylan’ (database) records. Such a thing is not the relational database way of doing things. Instead, to be relationally-correct you must create a second ‘Artist’ table with its own primary key and then give the ‘Album’ table a foreign key.

-- 'Album' table
Album_Id int NOT NULL PRIMARY KEY,
Album_Title varchar(255) NOT NULL,
FOREIGN KEY (Artist_Id) REFERENCES Artist(Artist_Id)

-- 'Artist' table
Artist_Id int NOT NULL PRIMARY KEY,
Artist_Name varchar(255) NOT NULL

You can now fetch back data using a join in which the ‘Album’ and ‘Artist’ tables are ‘joined’ via the ‘Artist_Id’ key.

SELECT alb.Album_Title, art.Artist_Name
FROM Album alb 
INNER JOIN Artist art on alb.Artist_Id = art.Artist_Id

Now all your catalogue items are unique and there is only one Bob Dylan. As it should be.

SQL quickly gets very complex, but it always works within the simple concepts illustrated above, so I think that is just enough to enable us to understand what SQL is and does.

As far as deployment into ‘active’ languages such a C# goes, a longstanding issue with databases has been the disconnect between C# and SQL and how to fetch SQL data into a program. There are now sophisticated means of doing this but nevertheless, SQL is not a lightweight option for storing data. The database must be designed, configured and deployed and of course a SQL server needs to be installed and running. A database is a good choice for a large program, but the smaller a program is the less this is so and a text-file based form of storing data begins to look a better bet.

XML

XML stands for ‘Extensible Markup Language’ and it is, coincidentally, a markup language. What is a markup language? Markup structures plain text. In XML terms, markup is achieved using elements and tags. An element consists of a start tag and an end tag :

<!-- this is an element -->
<!-- <tag> is the opening tag -->
<!-- </tag> is the closing tag
<tag>DATA</tag>

A tag is enclosed within angular brackets. The name of the tag sits within the brackets. In the closing tag, the name is prefixed with a forward slash (/).

An XML equivalent of the SQL table we defined earlier might look like this :

<animals>
     <animal>
          <everyday_name>Tiger</everyday_name>
          <scientific_name>Panthera tigris</scientific_name>
          <is_predatory>Y</is_predatory>
          <number_of_legs>4</number_of_legs>
     </animal>
     <!-- other animals follow -->
     <!-- equivalent to SQL rows -->
</animals>

XML is usually stored in a text file (it can be stored in a database). XML file data can be read by an XML parser and most ‘active’ programming languages have one of these. XML is very easy to read in C#, for example.

It should be noted that there is a family of XML technologies. XSLT (‘Extensible Stylesheet Language Transformations’, if you will) transforms one XML document structure into another XML document structure. So,

<person>
     <name>Lord Lucan</name>
     <location>
          <place>Missing</place>
     </location>
<person>

might be transformed to the conciser :

<person name="Lord Lucan" whereabouts="Missing" />

<!--
    The values in the quotes are known as ATTRIBUTES.
    
    A lot of XML data can be stored as either tag
    content or in an attribute. It is a fine art
    to decide which is better. In this case, where
    concision is clearly a priority and the data
    is always going to be a short bit of text,
    attributes appear to be good to go.
-->

Another important member of the family is XQuery (featuring in the star role XPath). This allows you to ‘query’ (as per SQL) an XML document.

However, the basic fact is that ‘active’ languages like C# pretty much exclusively interact with XML itself (though their parsing technologies are likely to use XPath).

To conclude, then, as a simple text file XML is far easier to deploy than a SQL database and its data is easier to read into a program. However, it is clearly less suited for larger datasets.

JSON

JSON (‘JavaScript Object Notation’) is a subset of JavaScript, a C-like language that is used ubiquitously for web development. JSON is, basically, the data bits of JavaScript and a JSON file can be thought of one large flub of declarations.

Our data would look like this in JSON :

{
  "animal": {
    "everydayName": "Tiger", // a string value
    "scientificName": "Panthera tigris",
    "isPredatory": true, // a boolean value
    "numberOfLegs": 4 // an integer value
  }
}

/*
// for comparison purposes, here is the JSON
// written in 'proper' JavaScript
const animal = {
    "everydayName": "Tiger",
    "scientificName": "Panthera tigris",
    "isPredatory": true,
    "numberOfLegs": 4
  };
*/

JSON can be read into any programming language as serialised data. To achieve this in C#, you would create a class that matches the JSON data values, the target for the serialisation :

// C# class to match the JSON data
public class Animal
{
     public string everydayName { get; set; }
     public string scientificName{ get; set; }
     public bool isPredatory{ get; set; }
     public int numberOfLegs{ get; set; }
}

The JSON data can now be deserialised into the C# class. Assuming a JSON file containing a list (array) of animals, the code to deserialise is simple :

// the reader method reads the text file of JSON data
string jsonAnimalsData = MyAnimalsReader.ReadJson(jsonFilePath);
// the JSON data is fed into the deserialiser, which converts
// each item into an object of the Animal class and adds
// it to a Generic list of Animal objects
List<Animal> animals = JsonSerializer.Deserialize(jsonAnimalsData,
     typeof(List<Animal>)) as List<Animal>;
     
/*
     NOTES.
     
     1) typeof(). This is a special C# operator that takes in an
     object and returns its type. C# has a class called Type that
     expresses type information for an object and the Deserialize
     method expects an object of class Type as the second parameter.
     
     2) as. Classes can be 'cast' with the 'as' operator. The
     Deserialize method has a return type of 'object', which means
     it can return any class type. The return value therefore
     needs to be converted (which is what 'cast' means) into
     the type we want. The deserialiser, note, will create
     a list of animals. We must cast to this type or an error
     will be thrown.
*/

As for serialisation and deserialisation, this refers to the process of creating an object from its class definition (serialisation) or a class definition from the object (deserialisation). So long as language A (here JSON) and language B (here C#) have identical class definitions, an object can be passed from one language to the other.

JSON makes it programmatically very easy to store and read data stored in an object. The downside is that it is less easy for humans to read than XML.

YAML

Another way of storing data is YAML (‘Yet Another Markup Language’). Our data looks like this in YAML :

animal:
  - everydayName: Tiger
  - scientificName: Panthera tigris
  - isPredatory: Y
  - numberOfLegs: 4

YAML was intended as an easy-to-read format, but nowadays has a bad reputation for a complexity, making it highly error-prone and hard to maintain. However, if the YAML is kept simple it does I think keep its original promise. The YAML is easier to read and write than the JSON, less verbose than the XML and far simpler to deploy than a SQL database.

As with JSON, YAML can easily be serialised and deserialised in C#.

HTMLF

There are edge cases where the best option might be to roll your own data format. Here is an example from my personal website.

The site contains a set of overview texts that summarise their page content. To make maintaining the overview content easier, the text is stored in an external file that needs above all to be easily editable. Evaluating the existing options, XML is simultaneously overkill and fiddly to work with, JSON is not designed for editing data and YAML is quite unsuitable because yamelised stored HTML needs to be carefully indented making editing difficult. So, the HTMLF (HTML Fragment) format was born :

@begin/
Tiger
<p>Panthera tigris</p>
/end@
@begin/
Tortoise
<p>Testudinidae</p>
/end@

HTMLF simply stores markup fragments and their identifiers between @begin/ and /end@ markers. It is an example of a data format that is little, limited and local and it is easy to parse and use :

HtmlFragments fragments = new(htmlfPathString);
HtmlFragment fragment = fragments.Find("Tiger");
string tigerFragment = fragment.Html;

Summary

All computing and every computer program is a mix and match of action and data. Data can be defined in many ways. One way is to distinguish between its active state (when it is in a running program) and its passive state (when it is stored after the program is shut down). Without this world-outside-the-program, computers would have no conceivable use. Stored data is the long-term memory of a program and thus all programming is much-concerned with stored data.

Outside of C# and comparable ‘active’ languages, languages that deal with data — SQL, XML, JSON, YAML, dare I say HTMLF, etc — are equally part of any programmer’s remit.

In a nutshell, coding is as much about loading and storing data (in XML, say) as running data (in C#, say).