sort


Analyzing Data with Python: Counting how many times each value in a CSV occurs

Data analysis is an essential aspect of many fields, from business and research to sports and education. Python, with its versatile libraries, is a popular choice for data analysis tasks. In this blog post, we’ll explore a Python script that reads data from a CSV file and counts the occurrences of each value in 3 columns. The script can be a valuable tool for gaining insights into educational information datasets.

#!/bin/python

import csv

# Create empty dictionaries
d_university = dict()
d_country = dict()
d_region = dict()

with open('XtremeScores.csv') as f:
    reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
    # Loop through each line of the file
    for line in reader:
        word = line[6]
        # Check if the word is already in dictionary
        if word in d_region:
            # Increment count of word by 1
            d_region[word] = d_region[word] + 1
        else:
            # Add the word to dictionary with count 1
            d_region[word] = 1

        word = line[5]
        # Check if the word is already in dictionary
        if word in d_country:
            # Increment count of word by 1
            d_country[word] = d_country[word] + 1
        else:
            # Add the word to dictionary with count 1
            d_country[word] = 1

        word = line[4]
        # Check if the word is already in dictionary
        if word in d_university:
            # Increment count of word by 1
            d_university[word] = d_university[word] + 1
        else:
            # Add the word to dictionary with count 1
            d_university[word] = 1

sorted_university = sorted(d_university.items(), key=lambda x:x[1], reverse=True)
print(sorted_university[:10])
sorted_country = sorted(d_country.items(), key=lambda x:x[1], reverse=True)
print(sorted_country[:10])
sorted_region = sorted(d_region.items(), key=lambda x:x[1], reverse=True)
print(sorted_region[:10])

Let’s break down the code step by step:

#!/bin/python

import csv

The script starts by importing the csv module, which is essential for handling comma-separated value (CSV) files.

# Create empty dictionaries
d_university = dict()
d_country = dict()
d_region = dict()

Three empty dictionaries, d_university, d_country, and d_region, are created. These dictionaries will be used to store the counts of universities, countries, and regions, respectively.

with open('XtremeScores.csv') as f:
    reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)

The script opens a CSV file named ‘XtremeScores.csv’ using a with statement. The csv.reader object is used to read the contents of the file. We specify that the delimiter is a comma (,), and we don’t want to perform any special quoting.

    # Loop through each line of the file
    for line in reader:

A for loop iterates through each line in the CSV file. The variable line contains the data for each row in the file.

        word = line[6]

The code extracts the word at index 6 (zero-based index) from the current line. In this context, it typically represents a region.

        # Check if the word is already in the dictionary
        if word in d_region:
            # Increment count of word by 1
            d_region[word] = d_region[word] + 1
        else:
            # Add the word to the dictionary with count 1
            d_region[word] = 1

The code checks if the extracted word (representing a region) is already present in the d_region dictionary. If it is, it increments the count by 1. If not, it adds the word to the dictionary with a count of 1. This process counts the occurrences of each region in the dataset.

The code repeats the same process for words representing countries (at index 5) and universities (at index 4), using the d_country and d_university dictionaries, respectively.

sorted_university = sorted(d_university.items(), key=lambda x:x[1], reverse=True)
print(sorted_university[:10])

After counting the universities, the code sorts them in descending order of frequency and prints the top 10 universities based on their occurrence.

sorted_country = sorted(d_country.items(), key=lambda x:x[1], reverse=True)
print(sorted_country[:10])

Similarly, it does the same for countries and prints the top 10 countries.

sorted_region = sorted(d_region.items(), key=lambda x:x[1], reverse=True)
print(sorted_region[:10])

Finally, it sorts and prints the top 10 regions based on their occurrence.

In summary, this Python script provides a simple but effective way to analyze data in a CSV file, specifically counting universities, countries, and regions. It leverages dictionaries to maintain counts and uses the csv module for reading data from the file. This can be a useful starting point for more advanced data analysis tasks and visualization in Python.


Bash: Extract data from files both filtering filename, the path and doing internal processing

The following code will find all files that match the pattern 2016_*_*.log (all the log files for the year 2016).

To avoid finding log files from other services than the Web API service, we filter only the files that their path contains the folder webapi. Specifically, we used "/ServerLogs/*/webapi/*" with the following command to match all files that are under the folder /ServerLogs/ and somewhere in the path there is another folder named webapi, we do that to match files that are like /ServerLogs/Production/01/webapi/* only. The way we coded our regular expression, it will not match if there is a folder called webapi directly under the /ServerLogs/ (e.g. /ServerLogs/webapi/*).

For each result, we execute an awk script that will split the lines using the comma (FS=",";) character, then check if the line contains exactly 4 tokens (if (NF == 4) {). Later, we get the 4th token and check if it contains the substring "MASTER=" (if (match($4,"MASTER=")) {), if it does contain it we split it using the space character and assign the result to the variable named tokens. From tokens, we get the first token and use substr to remove the first character. Finally, we use the formatted result to create an array where the keys are the values we just created and it is used as a hashmap to keep record of all unique strings. In the end clause, we print all the elements of our hash map.

Finally, we sort all the results from all the awk executions and remove duplicates using sort --unique.


find /ServerLogs/ \
    -iname "2016_*_*.log" \
    -ipath "/ServerLogs/*/webapi/*" \
    -exec awk '
        BEGIN {
            FS=",";
        }
        {
            if (NF == 4) {
                if (match($4,"MASTER=")) {
                    split($4, tokens, " ");
                    instances[substr(tokens[1], 2)];
                }
            }
        }
        END {
            for (element in instances) {
                print element;
            }
        }
    ' \
    '{}' \; | sort --unique;

Following is the same code in one line.

 find /ServerLogs/ -iname "2016_*_*.log" -ipath "/ServerLogs/*/webapi/*" -exec awk 'BEGIN {FS=",";} {if (NF == 4) {if (match($4,"MASTER=")){split($4, tokens, " "); instances[substr(tokens[1], 2)];}}} END {for (element in instances) {print element;}}' '{}' \; | sort --unique 

Another way

Another way to do similar functionality would be the following


find /ServerLogs/ \
    -iname "2016_*_*.log" \
    -ipath "/ServerLogs/*/webapi/*" \
    -exec sh -c '
        grep "MASTER=" -s "$0" | awk "BEGIN {FS=\",\";} NF==4" | cut -d "," -f4 | cut -c 3- | cut -d " " -f1 | sort --unique
    ' \
    '{}' \; | sort --unique;

What we changed is the -exec part. Instead of calling a awk script, we create a new sub-shell using sh -c, then we define the source to be executed inside the single codes and we pass as the first parameter of the shell the filename that matched.

Inside the shell, we find all lines that contain the string MASTER= using the grep command. Later we filter out all lines that do not have four columns when we tokenize using the comma character using awk. Then, we get the 4th column using cut and delimiter the comma. We remove the first two characters of the input string using cut -c 3- and later we get only the first column by reusing cut and changing the delimiter to be the space character. With those results we perform a sort that eliminates duplicates and we pass the results to the parent process to perform other operations.

Following is the same code in one line


find /ServerLogs/ -iname "2016_*_*.log" -ipath "/ServerLogs/*/webapi/*" -exec sh -c 'grep "MASTER=" -s "$0" | awk "BEGIN {FS=\",\";} NF==4" | cut -d "," -f4 | cut -c 3- | cut -d " " -f1 | sort --unique' '{}' \; | sort --unique;


How to: Extract all usernames that are logged in from who

who | cut -d ' ' -f 1 | sort -u

who: will show who is logged on

cut  -d ‘ ‘ -f 1: will remove all sections from each line except for column 1. It will use the space character as the delimiter for the columns

sort -u: it will sort the usernames and remove duplicate lines. So if a user is logged in multiple times you will get that username only once.
In case you want to filter out root user from this list you can do it as follows:

who | cut -d ' ' -f 1 | sort -u | grep -v 'root'