In order to only show certain columns of data or swap columns, e.g. only the names of cities listed in the data file above, you can use the awk program. Here the program call looks slightly more complicated than the ones you have previously seen but the general usage in our cases will mostly stay the same. Use the command
awk '{print $1}' urbanareas.tsvto show the first columns of data, i.e. the city names. Use
$ awk '{print $1,$3}' urbanareas.tsvto print the citynames followed by the country they are located in. You see, the $x'es after the print denote the number of the column.
$ awk '{print NF}' urbanareas.tsvwill print the number of columns (Number of Fields) in your datafile.
Find out the number of the column that contains the population of the cities in year 1980 and print a list of city names, followed by their population at this time, the country and the geographic postition (altitude/longitude).
We might have touched only the tip of the iceberg with such rather simple executions of the awk program. Indeed, it is much more powerful but would exceed the time available to go further into detail. (check the man-pages of awk for more information if you like)
Nevertheless, the last example usage of awk will show, how to print only a certain range of datasets from our datafile. First find out the number of lines in the data file using the wc command
$ wc -l datafile
Than, print the city name followed by the population in the year 1950 for the last twenty cities in our datafile
$ awk 'BEGIN{i=1}{if(i>=x) print $1,$2; i++}' urbanareas.tsv
where has to be replaced by the line number of the beginning of the block with the final 20 cities.
In detail, the statement above tells awk to set a variable
to the value
in the beginning.
Then for each row of processed data this variable is incremented (
i++
). Additionally, awk will only
print the data column and
if the variable
is greater or equal to the specified value
.
This behavior can also be achieved using the special variable
which represents the current row number.
So a call of awk that does the same would look like this:
$ awk 'NR>=x {print $1,$2}' urbanareas.tsv
As you might guess, the awk command is very useful for slicing out blocks of data from a data file for further usage.
Two other very useful programs are head and tail. With them, you are able to print the first
lines (head) or the last
lines (tail) of a file. You can invoke these programs like this:
$ head -n 20 datafile and $ tail -n 20 datafileFor further options these programs may provide check the manual pages or the help parameter again.
This should be enough for a first very basic introduction about how to deal with text files in the Linux OS. You should now be able to do a lot of 'magic' stuff with the data you will be provided with. ;)
Ronny Lorenz 2010-04-06