Handling data

Monday, 17 July 2017

Today we are going to discuss the creation of data and learn how manipulate data structures.

We will learn some things about using pipes to redirect output and learn some commands for working with data.

Data

So, data?

What is data?

Rather, we should ask: "What are data?"

datum, data, n - something given (past participle of the verb, dare, "to give").

Where does it come from? What do we use it for? What does it all mean?

The major question that we are going to be asking ourselves here is "How are we going to get data into and out of different formats?"

We will start with lists of similar data and then move to structured and ordered sets of lists (tables).

Eventually we will consider linked sets of data in the form of databases.

Raw data

"Raw" data is sort of an oxymoron. There is very little data available that is actually really raw in the sense that it has not been touched, manipulated, massaged, curated, or cleaned by some human intervention.

Remember, even data that is available on the web is not raw, it is text that we have marked up and structured in specific ways. However, web data can stand in as an analog for raw data.

The process through which we might gather data via the web is referred to as "scraping." A "scraper" is a program that reaches out into the web and grabs all of the text (including markup) available at a URL and saves it in some meaningfully structured way.

We're not going to dig into web-scraping too much, but I want you to be aware of how data can be gathered on the web.

One tool that can be used for web scraping is our friend, wget. We've used it to download remote files, but it can also be used to get whole websites and all of the data linked from them.

This can be useful for mirroring a website. It can also be useful in aggregating unstructured data so that it might be manipulated into structured data.

Structured data

One simple fomat for structured data is a table.

Rows in the table represent individual cases or instances of something.

Columns represent variables.

What is the difference?

In the data that we are going to create in class, our rows will represent individual people. The information contained in these rows will be given ("datum, a thing given") to us by every member of this class. The columns will represent a specifically defined aspect of data that we gather about every individual person.

We will start with making our own individual lists and then aggregate them.

The humble and mighty CSV

Lists

We'll start with a list of data.

Open a new file and name it with your GitHub user account and the extension .list.

Mine will be jdmar3.list.

Inside the file, I want you to give one-word or numerical answers to the following (as specified), in this order, each on their own line:

What is your GitHub username? How tall are you (in centimeters)? What time did you wake up this morning (in 24-hour/military time: e.g., 06:30)? How many semesters do you have left in your degree program? Approximately how far is your home city/town away from UNC/Chapel Hill (in km)?

If any answer doesn't apply to you, type NA ("not applicable").

My file will look like this:

jdmar3
175.26
06:45
2
1,129.3

Very simple.

Comma Separated Values (CSV)

Now that we have listed some information about ourselves, lets try to aggregate our data.

If we want to put all of our data together as it is, we will just end up with a super long list that is difficult for us to use in any meaninful way. If we take our list and flip it, so that we have a single line instead, we can then stack all of our data up together. We can separate the elements in the list with commas (or tabs, semicolons, pipe separators, or some other marker) and then we will have a row of what will become a Comma Separated Values file: structured data.

We can do this by hand, but that is boring.

Let's learn a command to do this:

paste -d, -s example.list

paste sequentially reads the lines from a file and then writes them out in the same sequence, separated by something (tabs, by default). In this case we are asking it to read every line in our file, and then write it out separated by a comma (-d,). The -s tells paste to serialize its operations instead of parellelizing them.

So our standard ouput (STDOUT) from the above command will be:

gh-username,height,wakeup,semesters-left,hometown-distance

Output redirection

To get this into a file, we will use one of several forms of output redirection.

Output redirection is simple. It merely allows for the echoed output of one file to be put into another file. We can use programs on top of this to manipulate that output.

For example:

paste -d, -s example.list > example.csv

This will take the output from the first part of the command and overwrite the CSV file specified in the second part.

This command will append the output to the file instead of overwriting it:

paste -d, -s example.list > example.csv

Pipes

A "pipe" is an operator that tells a program to take output from another program. You'll find it on your keyboard as SHIFT+.

Pipes translate the output of one program (STDOUT) into being input for another program (STDIN).

For example, if we wanted to count how many lines were in our csv file, we could run:

cat example.csv | wc -l

For next time

Tomorrow, we are going to work in groups to learn to create and aggregate data using scripts. In your groups, you will write a script that asks the above questions of the user and then appends their answer to a CSV file. This will be the basis of the next asssignment, which will be a group assignment.

I would like you to review some commands for working with a CSV file including how pipes work.Connelly, Brian. “Working with CSVs on the Command Line.” http://bconnelly.net/. Last modified September 23, 2013. http://bconnelly.net/working-with-csvs-on-the-command-line/.

I would also like you to watch the following video on working with CSV files. I think that it might be very helpful. Try watching it once and then following along a second time.


Handling data - July 17, 2017 -