count


Analyzing Data with Python: Counting how many times each value in a CSV occurs

Data analysis is an essential aspect of many fields, from business and research to sports and education. Python, with its versatile libraries, is a popular choice for data analysis tasks. In this blog post, we’ll explore a Python script that reads data from a CSV file and counts the occurrences of each value in 3 columns. The script can be a valuable tool for gaining insights into educational information datasets.

#!/bin/python

import csv

# Create empty dictionaries
d_university = dict()
d_country = dict()
d_region = dict()

with open('XtremeScores.csv') as f:
    reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
    # Loop through each line of the file
    for line in reader:
        word = line[6]
        # Check if the word is already in dictionary
        if word in d_region:
            # Increment count of word by 1
            d_region[word] = d_region[word] + 1
        else:
            # Add the word to dictionary with count 1
            d_region[word] = 1

        word = line[5]
        # Check if the word is already in dictionary
        if word in d_country:
            # Increment count of word by 1
            d_country[word] = d_country[word] + 1
        else:
            # Add the word to dictionary with count 1
            d_country[word] = 1

        word = line[4]
        # Check if the word is already in dictionary
        if word in d_university:
            # Increment count of word by 1
            d_university[word] = d_university[word] + 1
        else:
            # Add the word to dictionary with count 1
            d_university[word] = 1

sorted_university = sorted(d_university.items(), key=lambda x:x[1], reverse=True)
print(sorted_university[:10])
sorted_country = sorted(d_country.items(), key=lambda x:x[1], reverse=True)
print(sorted_country[:10])
sorted_region = sorted(d_region.items(), key=lambda x:x[1], reverse=True)
print(sorted_region[:10])

Let’s break down the code step by step:

#!/bin/python

import csv

The script starts by importing the csv module, which is essential for handling comma-separated value (CSV) files.

# Create empty dictionaries
d_university = dict()
d_country = dict()
d_region = dict()

Three empty dictionaries, d_university, d_country, and d_region, are created. These dictionaries will be used to store the counts of universities, countries, and regions, respectively.

with open('XtremeScores.csv') as f:
    reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)

The script opens a CSV file named ‘XtremeScores.csv’ using a with statement. The csv.reader object is used to read the contents of the file. We specify that the delimiter is a comma (,), and we don’t want to perform any special quoting.

    # Loop through each line of the file
    for line in reader:

A for loop iterates through each line in the CSV file. The variable line contains the data for each row in the file.

        word = line[6]

The code extracts the word at index 6 (zero-based index) from the current line. In this context, it typically represents a region.

        # Check if the word is already in the dictionary
        if word in d_region:
            # Increment count of word by 1
            d_region[word] = d_region[word] + 1
        else:
            # Add the word to the dictionary with count 1
            d_region[word] = 1

The code checks if the extracted word (representing a region) is already present in the d_region dictionary. If it is, it increments the count by 1. If not, it adds the word to the dictionary with a count of 1. This process counts the occurrences of each region in the dataset.

The code repeats the same process for words representing countries (at index 5) and universities (at index 4), using the d_country and d_university dictionaries, respectively.

sorted_university = sorted(d_university.items(), key=lambda x:x[1], reverse=True)
print(sorted_university[:10])

After counting the universities, the code sorts them in descending order of frequency and prints the top 10 universities based on their occurrence.

sorted_country = sorted(d_country.items(), key=lambda x:x[1], reverse=True)
print(sorted_country[:10])

Similarly, it does the same for countries and prints the top 10 countries.

sorted_region = sorted(d_region.items(), key=lambda x:x[1], reverse=True)
print(sorted_region[:10])

Finally, it sorts and prints the top 10 regions based on their occurrence.

In summary, this Python script provides a simple but effective way to analyze data in a CSV file, specifically counting universities, countries, and regions. It leverages dictionaries to maintain counts and uses the csv module for reading data from the file. This can be a useful starting point for more advanced data analysis tasks and visualization in Python.


Count how many submissions per score

Recently, we had access to a database that contained the scores of a programming competitions system.
The database contained several contests, each contest contained several challenges and any competitor could make multiple submissions.
We wanted to extract a couple of charts showing

  • how many submissions we had per score and
  • how many submissions we had per score while filtering out the best submission (max score) per contestant per challenge per contest

The following code will return the number of submissions per score per challenge per contest.


SELECT contest_id, challenge_id, TRUNCATE(score, 1), COUNT(*)
FROM submissions
GROUP BY contest_id, challenge_id, TRUNCATE(score, 1)
ORDER BY contest_id, challenge_id, TRUNCATE(score, 1);

The next one will return the number of submissions per score per challenge per contest while filtering out the best submission (max score) per contestant per challenge per contest:


SELECT contest_id, challenge_id, TRUNCATE(max_score, 1), COUNT(*)
FROM
(
  SELECT contest_id, challenge_id, competitor_id, MAX(score) AS max_score
  FROM submissions
  GROUP BY contest_id, challenge_id, competitor_id
) AS max_scores
GROUP BY contest_id, challenge_id, TRUNCATE(max_score, 1)
ORDER BY contest_id, challenge_id, TRUNCATE(max_score, 1);


Bash: Read a file, line by line

The following script will accept from the keyboard a filename, later it will read it and process it line by line.
In this example we will just number the lines, print them and count the total number of lines in the file.

#!/bin/sh
echo Enter the Filename to process
read filename
exec<$filename
lineNumber=0
while read line
do
	lineNumber=`expr $lineNumber + 1`;
	echo $lineNumber - $line;
done
echo "Total Lines Number: $lineNumber";