awk


Assigning auto-increment IDs to empty fields in a KML/XML file

Recently we were processing some KML files using OpenLayers and at some point we realised that some place-marks were not appearing on the map. After inspecting the debug console and the files more carefully we understood that OpenLayers did not like empty placemark IDs.

To mitigate the problem we wrote the following AWK script that will go over all lines in the KML/XML file, find the empty id fields (id="") and assign them with an auto-increment value. A note here, initially we just replaced all empty IDs with the same value but it seems that OpenLayers does not treat kindly conflicts on IDs and thus we had to go with an auto-increment solution.

# Assigning auto-increment IDs to the placemarkers as openlayers does not show conflicting-ID elements.
awk -i inplace '{
  for(x=1;x<=NF;x++) {
    if($x~/id=""/) {
      sub(/id=""/,"id=\"" (++i) "\"")
    }
  }
}1' "$output_path/$file_name";

Side notes

In case you already have some IDs defined, you would have to make your code a bit more complex… You would first need to find all filled IDs and then you would have two options:

  • empty them and execute the above script
  • or register them and make sure the script does not create conflicting IDs either by starting the variable i from a number greater than the biggest registered ID or making it even harder by filling in the gaps between the already registered IDs..

Add a new line whenever the first column changes 3

Recently we were processing some results from an SQL query on the command line, we were grouping the results based on the first column of the query results and we wanted to add an empty line between each group of data.

Instead of messing with SQL specifics, we decided to use awk to finish the task which resulted in the following code:

awk -F '|' -v column=1 'NR>1 && $column != previous { print "" } { previous = $column } { print $0 }'

Explanation:

  • -F fs or --field-separator fs: This option allowed us to choose the input field separator, in other words the character that should be used to split our text into columns. We used the value | because that character is the default column delimiter for sqlite.
  • -v var=val or --assign var=val: We used the -v option to define a variable to be used later on in the script. The value 1 was assigned to the variable column before execution of the program began and it was available event to the BEGIN rule of the AWK program. We did this to make the code a bit more modular, we could have just hardcoded the number in.
  • NR>1 && $column != previous { print "" } : Here we defined an if statement that checks two options: First we make sure that we are not on the first line of the input by using the NR>1 (and thus avoid creating an empty line which will be the first line of the output).
    Second, we check that the last value we had for the column of interest did change since last time. (We still did not define the value of previous, it is on the next step). When both statements are true (we are not on the first row and the value of the column in the current row is different than the value of the column in the previous row) it will print out an empty line.
  • { previous = $column }: This part is executed on ALL lines (even the first one) no matter what the values are. What this line does is to translate the value of the column variable from being a number (the index of the column that we are interested in) into the actual value that the column has at that specific line. That value is then copied to the previous variable to allow us to perform the check in the previous point once we move to the next line.
  • { print $0 }: Finally, this part is also executed on all lines and it instructs awk to print the input row whole and as is. This whole part could be replaced by a true value like the value 1. In awk as you see in this example, you define a series of operations. Each operation is constructed by a pattern to be matched and an action. Each pattern is evaluated for each input line, and in the cases where the pattern matches, the action is executed. The user can choose to omit either the pattern or the action for any operation. When a pattern is omitted, the action is executed on every line. When the action is omitted, then awk will execute { print $0 }. So, by adding a true value on its own it will be translated as on each line execute { print $0 } which prints the whole row as is.

Example

1|1|0.0564904019731175
1|2|0.103176086258974
1|3|0.12910406904073
1|4|0.188592489201024
1|5|0.169676224898487
1|6|0.164690820027741
1|7|0.128458728519047
1|8|0.18549773544014
1|9|0.155677575617836
1|10|0.153941343314285
2|1|0.217221158956016
2|2|0.23390973064067
2|3|0.180231657220626
2|4|0.257673927303071
2|5|0.261393785194329
2|6|0.273441488895552
2|7|0.242815632929545
2|8|0.262269697286057
2|9|0.256054399760891
2|10|0.262613705138411
3|1|0.378589461360716
3|2|0.33008177312116
3|3|0.380973166776554
3|4|0.340431190160728
3|5|0.38189416214207
3|6|0.364842933594872
3|7|0.372958396398964
3|8|0.350010176652464
3|9|0.355815612501188
3|10|0.380553180349294

Will become

1|1|0.0564904019731175
1|2|0.103176086258974
1|3|0.12910406904073
1|4|0.188592489201024
1|5|0.169676224898487
1|6|0.164690820027741
1|7|0.128458728519047
1|8|0.18549773544014
1|9|0.155677575617836
1|10|0.153941343314285

2|1|0.217221158956016
2|2|0.23390973064067
2|3|0.180231657220626
2|4|0.257673927303071
2|5|0.261393785194329
2|6|0.273441488895552
2|7|0.242815632929545
2|8|0.262269697286057
2|9|0.256054399760891
2|10|0.262613705138411

3|1|0.378589461360716
3|2|0.33008177312116
3|3|0.380973166776554
3|4|0.340431190160728
3|5|0.38189416214207
3|6|0.364842933594872
3|7|0.372958396398964
3|8|0.350010176652464
3|9|0.355815612501188
3|10|0.380553180349294

Bash: Switch positions between all characters in odd positions with characters in even positions

The following awk script allowed us to switch position of all characters placed in odd numbered positions with their next neighboring even numbered position characters.
In detail what it does is to create a for loop that skips one character every time and then it prints each pair in reverse order (it will print the second character first, then the first one, then the fourth and so on).


echo "123456789" | awk -vFS= '{for (i = 1; i <= NF; i+=2) {printf $(i+1)$i""} printf "\n"}';

# Will produce 214365879

echo "1234567890" | awk -vFS= '{for (i = 1; i <= NF; i+=2) {printf $(i+1)$i""} printf "\n"}';

# Will produce 2143658709

Please note that we set the built-in variable FS (The input field separator which is a space by default) to the empty string so that each character is treated like a different field by NF (The number of fields in the current input record).

 


Bash: Print time stamp in front of every line in a pipe

Recently, we received a binary that collected data from a web service and it printed them on screen.
The binary did not print a time stamp in front of each line so we had to improvise of a way to add the time stamp to the logs without modifying the binary.

The solution we came to was to use awk to prepend the time stamp in front of every line using a pipe.
Specifically, our solution was the following:


server_application 2>&1 | awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }'

What we did there was to start our binary server_application, redirect stderr to stdout (using 2>&1) so that we will have only one stream and then we read the lines one by one using awk and printed the time stamp right before the line ($0) using strftime.
The strftime() function formats the broken-down time according to the format specification format.
fflushforces a write of all user-space buffered data for the given output or update stream via the stream’s underlying write function. We call it at each line to make sure that we do not cause additional delay in presenting the data due to buffering limitations caused by our prints.

Example


$ echo -e "hi\nHI" 2>&1 | awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }'
2017-06-21 20:33:41 hi
2017-06-21 20:33:41 HI