Textfiles with tabulated data

The data files we will deal with after the introduction will always be textfiles with data seperated by spaces or another special character like 'comma'. Each line of the data files will contain a complete data set. For example have a look into the file
http://www.tbi.univie.ac.at/~ronny/Leere/sb1/urbanareas.tsv
that contains the populations of large urban areas arround the world at different points in history.

In order to only show certain columns of data or swap columns, e.g. only the names of cities listed in the data file above, you can use the awk program. Here the program call looks slightly more complicated than the ones you have previously seen but the general usage in our cases will mostly stay the same. Use the command

awk '{print $1}' urbanareas.tsv
to show the first columns of data, i.e. the city names. Use
$ awk '{print $1,$3}' urbanareas.tsv
to print the citynames followed by the country they are located in. You see, the $x'es after the print denote the number of the column.
$ awk '{print NF}' urbanareas.tsv
will print the number of columns (Number of Fields) in your datafile.

Find out the number of the column that contains the population of the cities in year 1980 and print a list of city names, followed by their population at this time, the country and the geographic postition (altitude/longitude).

We might have touched only the tip of the iceberg with such rather simple executions of the awk program. Indeed, it is much more powerful but would exceed the time available to go further into detail. (check the man-pages of awk for more information if you like)

Nevertheless, the last example usage of awk will show, how to print only a certain range of datasets from our datafile. First find out the number of lines in the data file using the wc command

$ wc -l datafile

Than, print the city name followed by the population in the year 1950 for the last twenty cities in our datafile

$ awk 'BEGIN{i=1}{if(i>=x) print $1,$2; i++}' urbanareas.tsv

where $ x$ has to be replaced by the line number of the beginning of the block with the final 20 cities. In detail, the statement above tells awk to set a variable $ i$ to the value $ 1$ in the beginning. Then for each row of processed data this variable is incremented (i++). Additionally, awk will only print the data column $ 1$ and $ 2$ if the variable $ i$ is greater or equal to the specified value $ x$. This behavior can also be achieved using the special variable $ NR$ which represents the current row number. So a call of awk that does the same would look like this:

$ awk 'NR>=x {print $1,$2}' urbanareas.tsv

As you might guess, the awk command is very useful for slicing out blocks of data from a data file for further usage.

Two other very useful programs are head and tail. With them, you are able to print the first $ n$ lines (head) or the last $ n$ lines (tail) of a file. You can invoke these programs like this:

$ head -n 20 datafile and
$ tail -n 20 datafile
For further options these programs may provide check the manual pages or the help parameter again.

This should be enough for a first very basic introduction about how to deal with text files in the Linux OS. You should now be able to do a lot of 'magic' stuff with the data you will be provided with. ;)

Ronny Lorenz 2010-04-06