split


GNU Linux/Bash: A function that splits a word in half

The following function takes one argument – a text file.
The text file should contain one word on each line.
The function reads the text file (argument) line by line.
Then it checks if the line has one word; if this is true, it splits the word in half.
Finally, it prints the two new words with a space between them.

#!/bin/bash

splitWordsInHalf () {
  # This function takes one argument - a text file.
  # The text file contains one word on each line.
  # It reads the text file (argument) line by line.
  # Then it checks if the line contains one word, if this is true, it splits the word in half.
  # Finally, it prints the two new words with a space between them
  while read line
  do
    words=( $line )
    if [ ${#words[@]} == 1 ]
    then
      echo ${line:0:${#line}/2} ${line:${#line}/2}
    fi
  done < $1
}

splitWordsInHalf input.txt

Example

Using the following input file:

banana
apple
ball
car
door

We will get the following output when we execute splitWordsInHalf input.txt:

ban ana
ap ple
ba ll
c ar
do or

Notes

The following parts of the code are in charge of looping on the data of the incoming file. The parameter (the input file) given to the function is translated into the variable $1. The while loop gets one line of text on each iteration and assigns the text to the variable that is named line. You could have chosen any other name that suits you instead of the word line.

splitWordsInHalf () {
  while read line
  do
    ...
  done < $1
}

The next part of the code (words=( $line )) converts the string value that is contained in the line variable into an array of words, and it assigns that array to the variable named words. Then, it counts the number of elements in the array (the number of words in the line) using the following ${#words[@]} and it checks that there is only one item.

words=( $line )
if [ ${#words[@]} == 1 ]
then
  ...
fi

The following line will print two strings. The first string is a sub-string of variable line that is composed by the first half of the value. The second sub-string is the second half of the value contained in the variable named line.

echo ${line:0:${#line}/2} ${line:${#line}/2}

The ${#line} will return the length of the string contained in the variable.

The structure ${VARIABLE:START:END} defines the slice of the string that we want returned.


Audacity – Automatically split an audio file into multiple files using at the quiet/silenced parts

This video demonstrates how we were able to automatically split a large audio file into multiple smaller files at the quiet parts of the audio using Audacity.

The steps to follow after you open your audio file are:

  1. Select the part of the audio that you want to automatically split to multiple parts or press ctrl + A to select all the track.
  2. Go to menu Analyze and select the option Label Sounds....
  3. Set the settings that best suit you. For example the noise level or the minimum duration of silence that should indicate a new part, etc.
  4. Press OK and give it some time to process the file and add labels around the new parts.
  5. You will see a new row appearing that will demonstrate in ranges the new parts that were created. If the file was not split as you expected, press ctrl + Z to undo the operation, then go to step 2 again and try with different settings.
  6. Once you are happy with the results, go to the menu File then select the category Export and finally the option Export Multiple....
  7. Unless you need specific settings, select the folder where you want the new file parts to be created and hit the Export button.
  8. In the following pop-up windows, which will be one per audio track segment, if you do not need to make changes just hit the OK button enough times to get the export process going.

A note on using Audacity on large audio files (which we assume applies to many other serious audio processing applications): When you open the audio file, Audacity will pre-process it, and it will take several GBs of disk space to use for its metadata. It will delete them as soon as you close the project, but it is good to keep it in mind before trying to work and then failing to perform an export.


Download Large Jupyter Workspace files

Recently, we were working on a Jupyter Workspace at anyscale-training.com/jupyter/lab. As there was no option to download all files of the workspace nor there was a way to create an archive from the GUI, we followed the procedure below (that we also use on Coursera.org and works like a charm):

First, we clicked on the blue button with the + sign in it.
That opened the Launcher tab that is visible on the image above.
From there, we clicked on the Terminal button under the Other category.

In the terminal, we executed the following command to create a compressed archive of all the files we needed to download:

tar -czf Ray-RLLib-Tutorials.tar.gz ray_tutorial/ Ray-Tutorial/ rllib_tutorials/;

After the command completed its execution, we could see our archive on the left list of files. By right-clicking it we we are able to initiate its download. Unfortunately, after the first 20MB the download would always crash! To fix this issue, we split the archive to multiple archives of 10MB each, then downloaded them individually and finally merged them back together on our PC. The command to split the compressed archive to multiple smaller archives of fixed size was the following:

tar -czf - ray_tutorial/ Ray-Tutorial/ rllib_tutorials/ | split --bytes=10MB - Ray-RLLib-Tutorials.tar.gz.;

After downloading those files one by one by right-clicking on them and then selecting the Download option we recreated the original structure on our PC using the following command:

cat Ray-RLLib-Tutorials.tar.gz.* | tar xzvf -;

To clean up both the remote Server and our Local PC, we issued the following command:

rm Ray-RLLib-Tutorials.tar.gz.*;

This is a guide on how to download a very big Jupyter workspace by splitting it to multiple smaller files using the console.


Bash: Extract data from files both filtering filename, the path and doing internal processing

The following code will find all files that match the pattern 2016_*_*.log (all the log files for the year 2016).

To avoid finding log files from other services than the Web API service, we filter only the files that their path contains the folder webapi. Specifically, we used "/ServerLogs/*/webapi/*" with the following command to match all files that are under the folder /ServerLogs/ and somewhere in the path there is another folder named webapi, we do that to match files that are like /ServerLogs/Production/01/webapi/* only. The way we coded our regular expression, it will not match if there is a folder called webapi directly under the /ServerLogs/ (e.g. /ServerLogs/webapi/*).

For each result, we execute an awk script that will split the lines using the comma (FS=",";) character, then check if the line contains exactly 4 tokens (if (NF == 4) {). Later, we get the 4th token and check if it contains the substring "MASTER=" (if (match($4,"MASTER=")) {), if it does contain it we split it using the space character and assign the result to the variable named tokens. From tokens, we get the first token and use substr to remove the first character. Finally, we use the formatted result to create an array where the keys are the values we just created and it is used as a hashmap to keep record of all unique strings. In the end clause, we print all the elements of our hash map.

Finally, we sort all the results from all the awk executions and remove duplicates using sort --unique.

find /ServerLogs/ \
    -iname "2016_*_*.log" \
    -ipath "/ServerLogs/*/webapi/*" \
    -exec awk '
        BEGIN {
            FS=",";
        }
        {
            if (NF == 4) {
                if (match($4,"MASTER=")) {
                    split($4, tokens, " ");
                    instances[substr(tokens[1], 2)];
                }
            }
        }
        END {
            for (element in instances) {
                print element;
            }
        }
    ' \
    '{}' \; | sort --unique;

Following is the same code in one line.

 find /ServerLogs/ -iname "2016_*_*.log" -ipath "/ServerLogs/*/webapi/*" -exec awk 'BEGIN {FS=",";} {if (NF == 4) {if (match($4,"MASTER=")){split($4, tokens, " "); instances[substr(tokens[1], 2)];}}} END {for (element in instances) {print element;}}' '{}' \; | sort --unique 

Another way

Another way to do similar functionality would be the following

find /ServerLogs/ \
    -iname "2016_*_*.log" \
    -ipath "/ServerLogs/*/webapi/*" \
    -exec sh -c '
        grep "MASTER=" -s "$0" | awk "BEGIN {FS=\",\";} NF==4" | cut -d "," -f4 | cut -c 3- | cut -d " " -f1 | sort --unique
    ' \
    '{}' \; | sort --unique;

What we changed is the -exec part. Instead of calling a awk script, we create a new sub-shell using sh -c, then we define the source to be executed inside the single codes and we pass as the first parameter of the shell the filename that matched.

Inside the shell, we find all lines that contain the string MASTER= using the grep command. Later we filter out all lines that do not have four columns when we tokenize using the comma character using awk. Then, we get the 4th column using cut and delimiter the comma. We remove the first two characters of the input string using cut -c 3- and later we get only the first column by reusing cut and changing the delimiter to be the space character. With those results we perform a sort that eliminates duplicates and we pass the results to the parent process to perform other operations.

Following is the same code in one line

find /ServerLogs/ -iname "2016_*_*.log" -ipath "/ServerLogs/*/webapi/*" -exec sh -c 'grep "MASTER=" -s "$0" | awk "BEGIN {FS=\",\";} NF==4" | cut -d "," -f4 | cut -c 3- | cut -d " " -f1 | sort --unique' '{}' \; | sort --unique;


C: Split a buffer to a list of segments of a specific size in bits

Full project and source code to segment buffer to parts of specific size in bits (386 downloads)

The following code will split a buffer in C to a list of segments.
The size of the segments does not have to be a multiple of a byte.
User defines the size of the segments in bits when calling node_t *segment(const unsigned char buffer[], const unsigned int buffer_bytes_size, const unsigned int segment_bit_size, const unsigned int first_segment_bit_size);.

Each segment is an instance of element_t structure as follows:

struct element_t {
  unsigned char *segment;
  unsigned int unused_bits;
  unsigned int size;
};

Variable unused_bits defines the bits in the last byte that should not be used in future operations.

Full project and source code to segment buffer to parts of specific size in bits (386 downloads)

Following is the code that performs the segmentation:

#include "segmentation.h"

#include <math.h>
#include <limits.h>
#include <malloc.h>
#include <string.h>

//This method will create a string made of 0s and 1s representing the bits in an object.
//It will skip printing the last n bits as per the input
char *create_bit_representation_string(const void *object, const unsigned int size,
                                       const unsigned int skip_last_bits)
{
    unsigned int i = 0;
    const unsigned char *byte;
    unsigned int temp_size = size;
    const double mask_filter = pow(2, skip_last_bits);
    const unsigned int skip_last_bytes = skip_last_bits / CHAR_BIT;
    char *result = malloc(sizeof(char) * size * CHAR_BIT - skip_last_bits + 1);

    for (byte = object; temp_size--; ++byte)
    {
        unsigned char mask;
        for (mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)
        {
            //We do not want to print the last n bits of the last byte as they should always be 0
            if ((temp_size < skip_last_bytes) || (temp_size == 0 && mask < mask_filter))
            {
                break;
            }
            result[i++] = (char) (mask & *byte ? '1' : '0');
        }
    }

    result[i] = '\0';
    return result;
}

//Creating a mask where the first n bits are 1s and the rest are 0s to zero the unused bits of the segment
unsigned char create_left_mask(const unsigned int bits)
{

    unsigned char left_mask = 0;
    unsigned int i;
    for (i = 0; i < bits; i++)
    {
        left_mask |= (1 << (CHAR_BIT - 1 - i));
    }
    return left_mask;
}

//This function will shift to the left a char array for up to 7 bits.
//It will update the object and return the number of bits shifted
unsigned int
shift_left_char_array_n_bits(void *object, const unsigned int size, const unsigned int bits)
{
    if (bits == 0)
    {
        return 0;
    }

    if (bits < 1 || bits > CHAR_BIT - 1)
    {
        fprintf(stderr, "%s: Bad value %u for 'bits', it should be [1,7]"
                "\n\tIgnoring operation\n", __FUNCTION__, bits);
        return 0;
    }

    //Creating a mask where the first n bits are 1s and the rest are 0s.
    const unsigned char left_mask = create_left_mask(bits);

    unsigned char *byte;
    unsigned int temp_size = size;
    //We use temp_size as a counter (until it reaches 0) and we move the byte pointer at each loop
    for (byte = object; temp_size--; ++byte)
    {
        unsigned char carry = 0;
        if (temp_size)
        {
            //We get the bits we want to carry using the mask
            carry = byte[1] & left_mask;
            //Then shift them to the right, as this is where they will be in the new byte.
            carry >>= (CHAR_BIT - bits);
        }
        //Shifting the new byte to make space for the carry
        *byte <<= bits;
        //Applying carry
        *byte |= carry;
    }
    return bits;
}

const unsigned int calculate_unused_bits(const unsigned int segment_bit_size)
{
    return (CHAR_BIT - (segment_bit_size % CHAR_BIT)) % CHAR_BIT;
}

element_t *create_element(const unsigned char buffer[], const unsigned int byte_size,
                          const unsigned int unused_bits, const unsigned int bytes_skipped,
                          const unsigned char left_mask)
{
    element_t *element = (element_t *) malloc(sizeof(element_t));
    element->segment = malloc(byte_size);
    element->size = byte_size;
    element->unused_bits = unused_bits;
    memcpy(element->segment, &(buffer[bytes_skipped]), byte_size);
    //Zeroing the unused bits at the end of the segment
    element->segment[byte_size - 1] &= left_mask;
    return element;
}

//This method will split a buffer to segments of specific size in bits and it will return them as a list
//(each element contains the segment data, its size in bytes and the number of bits that are not used from the last byte)
//If the input buffer is less than the segment size, it will return one segment with all the data.
//The user can set the bit size of the first segment to be different than the rest using first_segment_bit_size > 0
node_t *segment(const unsigned char buffer[], const unsigned int buffer_bytes_size,
                const unsigned int segment_bit_size, const unsigned int first_segment_bit_size)
{
    if (buffer_bytes_size == 0)
    {
        fprintf(stderr, "%s: Bad value %u for 'buffer_bytes_size', it should be greater than 0"
                "\n\tIgnoring operation\n", __FUNCTION__, buffer_bytes_size);
        return NULL;
    }
    if (segment_bit_size == 0)
    {
        fprintf(stderr, "%s: Bad value %u for 'segment_bit_size', it should be greater than 0"
                "\n\tIgnoring operation\n", __FUNCTION__, segment_bit_size);
        return NULL;
    }

    node_t *head = NULL;

    const double char_bit = CHAR_BIT;
    const unsigned int first_segment_byte_size = (unsigned int) ceil(
            first_segment_bit_size / char_bit);
    if (first_segment_byte_size > buffer_bytes_size)
    {
        append(&head, create_element(buffer, buffer_bytes_size, 0, 0, UCHAR_MAX));
        return head;
    }

    unsigned char *temp_buffer = malloc(buffer_bytes_size);
    memcpy(temp_buffer, buffer, buffer_bytes_size);

    unsigned int bits_shifted = 0;
    unsigned int bytes_skipped = 0;

    if (first_segment_bit_size > 0)
    {
        const unsigned int first_segment_unused_bits = calculate_unused_bits(
                first_segment_bit_size);
        const unsigned int first_segment_byte_size_without_incomplete_byte =
                first_segment_bit_size / CHAR_BIT;

        const unsigned int first_segment_bits = CHAR_BIT - first_segment_unused_bits;
        const unsigned char left_mask = create_left_mask(first_segment_bits);

        append(&head, create_element(temp_buffer, first_segment_byte_size,
                                     first_segment_unused_bits, bytes_skipped, left_mask));

        bytes_skipped += first_segment_byte_size_without_incomplete_byte;

        if (bytes_skipped == buffer_bytes_size)
        {
            free(temp_buffer);
            return head;
        }
        if (first_segment_bits > 0 && first_segment_bits < CHAR_BIT)
        {
            bits_shifted += shift_left_char_array_n_bits(&(temp_buffer[bytes_skipped]),
                                                         buffer_bytes_size - bytes_skipped -
                                                         (bits_shifted / CHAR_BIT),
                                                         first_segment_bits);
        }
    }

    const unsigned int segment_byte_size = (unsigned int) ceil(segment_bit_size / char_bit);
    const unsigned int buffer_bits_size =
            (buffer_bytes_size - bytes_skipped) * CHAR_BIT - bits_shifted;
    const unsigned int segments_count = buffer_bits_size / segment_bit_size;

    if (segments_count == 0)
    {
        append(&head, create_element(temp_buffer, buffer_bytes_size - bytes_skipped, bits_shifted, bytes_skipped, UCHAR_MAX));
        free(temp_buffer);
        return head;
    }

    //Creating a mask where first n bits are 1s and the rest are 0s to zero the unused bits of the segment
    const unsigned int segment_unused_bits = calculate_unused_bits(segment_bit_size);
    const unsigned int last_segment_bits = CHAR_BIT - segment_unused_bits;
    const unsigned char left_mask = create_left_mask(last_segment_bits);
    const unsigned int segment_byte_size_without_incomplete_byte = segment_bit_size / CHAR_BIT;
    const unsigned int extra_bits = buffer_bits_size % segment_bit_size;

    unsigned int i;
    for (i = 0; i < segments_count; i++)
    {
        append(&head,
               create_element(temp_buffer, segment_byte_size, segment_unused_bits, bytes_skipped,
                              left_mask));
        bytes_skipped += segment_byte_size_without_incomplete_byte;

        if ((segments_count > 1 || extra_bits > 0) &&
            (last_segment_bits > 0 && last_segment_bits < CHAR_BIT))
        {
            bits_shifted += shift_left_char_array_n_bits(&(temp_buffer[bytes_skipped]),
                                                         buffer_bytes_size - bytes_skipped -
                                                         (bits_shifted / CHAR_BIT),
                                                         last_segment_bits);
        }
    }

    if (extra_bits)
    {
        const unsigned int last_segment_bytes_size =
                buffer_bytes_size - bytes_skipped - (bits_shifted / CHAR_BIT);
        const unsigned int unused_bytes_for_last_segment =
                segment_byte_size - last_segment_bytes_size;
        const unsigned int last_segment_unused_bits =
                segment_bit_size - (buffer_bits_size % segment_bit_size) + segment_unused_bits -
                (unused_bytes_for_last_segment * CHAR_BIT);
        append(&head, create_element(temp_buffer, last_segment_bytes_size,
                                     last_segment_unused_bits, bytes_skipped, UCHAR_MAX));
    }

    free(temp_buffer);
    return head;
}

Sample code that uses the function:

#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <limits.h>
#include <stdlib.h>
#include <time.h>

#include "libs/segmentation/segmentation.h"


// This application will create a char array of size BUFFER_BYTE_SIZE that contains random values
// and later it will split it in segments of size SEGMENT_BIT_SIZE.
// The first segment will be of size FIRST_SEGMENT_BIT_SIZE.

#define BUFFER_BYTE_SIZE 420
#define SEGMENT_BIT_SIZE 222
#define FIRST_SEGMENT_BIT_SIZE 11
#define POSSIBLE_VALUES 256

int main()
{
    srand(time(NULL));
    const unsigned int buffer_byte_size = BUFFER_BYTE_SIZE;
    fprintf(stdout, "Buffer Size: %uB\n", buffer_byte_size);
    const unsigned int segment_bit_size = SEGMENT_BIT_SIZE;
    fprintf(stdout, "Segment Size: %ub\n", segment_bit_size);
    const unsigned int first_segment_bit_size = FIRST_SEGMENT_BIT_SIZE;
    fprintf(stdout, "First Segment Size: %ub\n", first_segment_bit_size);
    unsigned char buffer[buffer_byte_size];
    unsigned int i;
    for (i = 0; i < buffer_byte_size; i++)
    {
        buffer[i] = (unsigned char) (rand() % POSSIBLE_VALUES);
    }
    char *buffer_bits = create_bit_representation_string(buffer, buffer_byte_size, 0);
    const size_t buffer_length = strlen(buffer_bits);
    fprintf(stdout, "\tBuffer: '%s'\n", buffer_bits);
    node_t *head = segment(buffer, buffer_byte_size, segment_bit_size, first_segment_bit_size);

    element_t *element = pop(&head);
    unsigned int bytes_skipped = 0;
    unsigned int segment_count = 0;
    unsigned int total_segment_bit_size = 0;
    while (element != NULL)
    {

        char *segment_bits = create_bit_representation_string(element->segment,
                                                              element->size,
                                                              element->unused_bits);
        const size_t segment_length = strlen(segment_bits);
        fprintf(stdout,
               "\t\tSegment %04u: Size in bytes %02u - Unused bits %04u - '%.*s'\n",
               ++segment_count,
               element->size, element->unused_bits,
               element->size * CHAR_BIT - element->unused_bits, segment_bits);
        if (segment_length == 0)
        {
            fprintf(stderr,
                    "Data validation failed."
                            "\n\tBuffer size in bytes %d"
                            "\n\tSegment size in bits %d"
                            "\n\tFirst Segment size in bits %d"
                            "\n\tFound empty segment\n",
                    buffer_byte_size, segment_bit_size, first_segment_bit_size);
            clear(&head);
            free(segment_bits);
            free(element->segment);
            free(element);
            free(buffer_bits);
            return EXIT_FAILURE;
        }
        for (i = 0; i < segment_length && bytes_skipped + i < buffer_length; i++)
        {
            if (segment_bits[i] != buffer_bits[bytes_skipped + i])
            {
                fprintf(stderr,
                        "Data validation failed."
                                "\n\tBuffer size in bytes %d"
                                "\n\tSegment size in bits %d"
                                "\n\tFirst Segment size in bits %d"
                                "\n\tPosition %u of the buffer"
                                "\n\tPosition %u of the segment\n",
                        buffer_byte_size, segment_bit_size, first_segment_bit_size, bytes_skipped + i, i);
                clear(&head);
                free(segment_bits);
                free(element->segment);
                free(element);
                free(buffer_bits);
                return EXIT_FAILURE;
            }
        }
        free(segment_bits);
        bytes_skipped += segment_length;

        const unsigned int current_segment_bit_size = ((element->size - 1) * CHAR_BIT) + CHAR_BIT - element->unused_bits;
        if (segment_length != current_segment_bit_size)
        {
            fprintf(stderr,
                    "Data validation failed."
                            "\n\tBuffer size in bytes %d"
                            "\n\tSegment size in bits %d"
                            "\n\tFirst Segment size in bits %d"
                            "\n\tCurrent Segment bit size (%u) not equal to its string representation (%lu)\n",
                    buffer_byte_size, segment_bit_size, first_segment_bit_size, current_segment_bit_size, segment_length);
            clear(&head);
            free(segment_bits);
            free(element->segment);
            free(element);
            free(buffer_bits);
            return EXIT_FAILURE;
        }
        total_segment_bit_size += current_segment_bit_size;

        free(element->segment);
        free(element);
        element = pop(&head);
    }

    free(buffer_bits);

    if (buffer_length != total_segment_bit_size) {
        fprintf(stderr,
                "Data validation failed."
                        "\n\tBuffer size in bytes %d"
                        "\n\tSegment size in bits %d"
                        "\n\tFirst Segment size in bits %d"
                        "\n\tTotal Segment bit size (%u) not equal to full string representation (%lu)\n",
                buffer_byte_size, segment_bit_size, first_segment_bit_size, total_segment_bit_size, buffer_length);
        return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
}

Full project and source code to segment buffer to parts of specific size in bits (386 downloads)