find


How to execute `find` that ignores .git directories

Trying to find a source code file by its content using find and -exec grep, can some times result in getting results from the repository .git folders as well.

This behavior not only does it provide results you do not need but it also makes your search slower.
Below, we propose a couple of solutions on how to make a more efficient search.

Example 1: Ignore all .git folders no matter where they are in the search path

For find to ignore all .git folders, even if they appear on the first level of directories or any in-between until the last one, add -not -path '*/\.git*' to your command as in the example below.
This parameter will instruct find to filter out any file that has anywhere in its path the folder .git. This is very helpful in case a project has dependencies in other projects (repositories) that are part of the internal structure.


find . -type f -not -path '*/\.git/*';

Note, if you are using svn use:


find . -type f -not -path '*/\.svn/*';

Example 2: Ignore all hidden files and folders

To ignore all hidden files and folders from your find results add -not -path '*/\.*' to your command.


find . -not -path '*/\.*';

This parameter instructs find to ignore any file that has anywhere in its path the string /. which is any hidden file or folder in the search path!


Replace a character in all filenames

The following command will search for files in the current directory (.) that have in their name the colon character :.
The files that match will then be renamed and all instances of the colon character : in the names will be replaced by the full stop character ..

find . -name "*:*" -execdir sh -c 'mv "$1" "${1//:/.}"' _ {} \;
  • -execdir command {} + is like -exec, but the specified command is run from the subdirectory containing the matched file, which is not normally the directory in which you started find.

Example: if you have a file named 2017-03-15 14:34:44.116002523.png then it will be renamed to 2017-03-15 14.34.44.116002523.png.


Back Up Jenkins instance except for workspace and build logs

Our Jenkins setup has a lot of cool features and configuration.
It has ‘project-based security’, it has parametrized projects, multiple source code management blocks per project and fairly extensive tests implemented with several build steps.
Of course, we do not want to lose them, so we make backups often.
The commands we use for the backup are the following.


jenkins_folder="/var/lib/jenkins/";
 backup_folder="$HOME/jenkins/`date +%F`";
 mkdir -p "$backup_folder";
 (cd "$jenkins_folder"/jobs/; find . -mindepth 3 -type d -regex '.*/[0-9]*$' -print) | sed 's|./|jobs/|' | sudo rsync --archive --exclude 'workspace/*' --exclude-from=- "$jenkins_folder" "$backup_folder";

Explanation of commands:

  • In backup_folder="$HOME/jenkins/`date +%F`"; we used the $HOME variable instead of the tilde ~ as this would create a folder in the current directory called ~ instead of creating a new folder called jenkins in the home directory.
  • mkdir -p "$backup_folder"; instructs mkdir to create all parent folders needed to create our destination folder.
  • (cd "$jenkins_folder"/jobs/; find . -mindepth 3 -type d -regex '.*/[0-9]*$' -print) navigates to the directory of jenkins before performing the search, this way the result file names will be relative to the installation location which we need later to pass to rsync.
    Then we search for all folders which their name is numeric and they at least on depth 3. We filter by depth as well to avoid matching folders directly in the jobs folder.
  • sed 's|./|jobs/|' replaces the prefix ./ with jobs/ to match the relative path from where rsync will work from
  • sudo rsync --archive --exclude 'workspace/*' --exclude-from=- "$jenkins_folder" "$backup_folder"; it will copy everything from $jenkins_folder to the folder $backup_folder while excluding the data in workspace and the folders matched from find (the job build folders).
    --exclude-from=- instructs rsync to read from stdin the list of files to exclude.

Find files that were created, modified or accessed in the last N minutes

Find all files in $my_folder that their status changed in the last 60 minutes

find $my_folder -cmin -60

Find all files in $my_folder that their data were modified in the last 60 minutes

find $my_folder -mmin -60

Find all files in $my_folder that they were accessed in the last 60 minutes

find $my_folder -amin -60

Please remember to use negative values for the minutes. e.g. use -60 and not 60.

More examples

Find all files in $my_folder that their status changed in the last 60 minutes AND they were accessed in the last 10 minutes

find $my_folder -cmin -60 -amin -10

Find all files in $my_folder that their status changed in the last 60 minutes OR they were accessed in the last 10 minutes

find $my_folder \( -cmin -60 -o -amin -10 \)

Notes on find command

  • -cmin n Matches files which their status was last changed n minutes ago.
  • -mmin n Matches files which which data was last modified n minutes ago.
  • -amin n Matches files which they were last accessed n minutes ago.
  • -o is the logical Or operator. The second expression  is not evaluated if the first expression is true.

How to search for specific filenames in .tar archives

The following commands will search in the .tar archives found in the specified folder and print on screen all files that their paths or filenames match our search token. We provide multiple solutions, each one for a different type of .tar archive depending on the compression used.

For .tar archives

find /media/repository/packages/ -type f -iname "*.tar" -exec tar -t -f '{}' \; | grep "configurations/arm-cortexa9";

For .tar.bz2 archives

find /media/repository/packages/ -type f -iname "*.tar.bz2" -exec tar -t -j -f '{}' \; | grep "configurations/arm-cortexa9";

For .tar.xz archives

find /media/repository/packages/ -type f -iname "*.tar.xz" -exec tar -t -J -f '{}' \; | grep "configurations/arm-cortexa9";

For .tar.gz and .tgz archives

Please note that this commands uses the -o (which is the logical or) parameter on find to search for multiple filename extensions.

find /media/repository/packages/ -type f \( -iname "*.tar.gz" -o -iname "*.tgz" \) -exec tar -t -z -f '{}' \; | grep "configurations/arm-cortexa9";

find Parameters Legend

  • -type f filters out any result which is not a regular file
  • -exec command '{}' \; runs the specified command on the results of find. The string '{}' is replaced by the current file name being processed.
  • -o is the logical Or operator. The second expression  is not evaluated if the first expression is true.

tar Parameters Legend

  • -z or --gzip instructs tar to filter the archive through gzip
  • -j or --bzip2 filters the archive through bzip2
  • -J or --xz filters the archive through xz
  • -t or --list lists the contents of an archive
  • -f or --file=INPUT uses the archive file or device named INPUT

How to suppress binary files from matching results

When you try to find all files that contain a certain string value, it can be very costly to check binary files that you might not want to check.
To automatically prevent your search from testing if the binary files contain the needle you can add the parameter -I (capital i) to prevent grep from testing them.
Using grep, -I will process a binary file as if it did not contain matching data, this is equivalent to the --binary-files=without-match option.

Example

find . -type f -exec grep 'string' '{}' -s -l -I \;

The above command breaks down as follows:

  • find . -type f Find all files in current directory.
  • -exec For each match execute the following.
  • grep 'string' '{}' Search the matched file '{}' if it contains the value ‘string’.
  • -s Suppress error messages about nonexistent or unreadable files.
  • -l (lambda lower case) or --files-with-matches Suppress normal output, instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match.
  • -I (i capital) or --binary-files=without-match Process a binary file as if it did not contain matching data.

Delete all empty directories and all directories containing empty directories

Assuming you have a complex filesystem from which you want to delete all empty directories and all directories containing empty directories recursively, while leaving intact the rest.
You can use the following command.

find . -type d -empty -delete;

The configuration we used is the following:

  • -type d restricts the results to directories only
  • -empty restricts to empty directories only
  • -delete removes the directories that matched

The above code will delete all directories that are either empty or contain empty directories in the current directory.
It will successfully delete all empty directories even if they contain a large number of empty directories in any structure inside them.


GNU/Linux find: Get results relative to the directory searching in, instead of directory shell is in

Recently we wanted to create a list of files that could be found in a specific folder.
For that list we wanted the paths of the files to be relative to the folder we were searching in, instead of them being relative to the folder our shell was currently in.

To achieve that, we used cd to navigate into that folder and searched from there locally.
We used a sub-shell to achieve this, which was not needed, but because we did not want to change the current directory of our shell, it was needed.

The command was as follows:

(cd toThe/Path/WeAre/Interested/In && find .)

instead of:

find toThe/Path/WeAre/Interested/In

Since we were interested in getting all files, we did not put any filters on find.
Of course you can use find normally and modify it as you please.

Finally, since we wanted the list of files to be saved in a text file, we redirected the output of the above command to a file in the current working directory

(cd toThe/Path/WeAre/Interested/In && find .) > interestingFiles.txt

Delete all files and keep the directory structure

Scenario

You have a complex folder structure and you want to remove all files and at the same time keep all folders intact.

We will present one method, using two variations of it that can achieve the above.
The method uses the GNU find command to find all files and delete them one by one.

Variation A

find . ! -type d -exec rm '{}' \;

This above command will search in the current directory and sub-directories for anything that is not a folder and then it will delete them.

  • find . – Searches in this folder, since we did not define depth, it will search in all sub-folders as well
  • ! -type dtype d instructs find to match all Directories, by adding the ! in front of the instruction it negates the result and instructs find to match anything but the Directories
  • -exec rm '{}' \; – for every result, the command after exec is executed. The filename replaces '{}' so that the results get deleted one by one.

Variation B

find . ! -type d -delete

In this example, we replaced -exec rm '{}' \; with the simpler to remember directive of -delete.


Bash: Extract data from files both filtering filename, the path and doing internal processing

The following code will find all files that match the pattern 2016_*_*.log (all the log files for the year 2016).

To avoid finding log files from other services than the Web API service, we filter only the files that their path contains the folder webapi. Specifically, we used "/ServerLogs/*/webapi/*" with the following command to match all files that are under the folder /ServerLogs/ and somewhere in the path there is another folder named webapi, we do that to match files that are like /ServerLogs/Production/01/webapi/* only. The way we coded our regular expression, it will not match if there is a folder called webapi directly under the /ServerLogs/ (e.g. /ServerLogs/webapi/*).

For each result, we execute an awk script that will split the lines using the comma (FS=",";) character, then check if the line contains exactly 4 tokens (if (NF == 4) {). Later, we get the 4th token and check if it contains the substring "MASTER=" (if (match($4,"MASTER=")) {), if it does contain it we split it using the space character and assign the result to the variable named tokens. From tokens, we get the first token and use substr to remove the first character. Finally, we use the formatted result to create an array where the keys are the values we just created and it is used as a hashmap to keep record of all unique strings. In the end clause, we print all the elements of our hash map.

Finally, we sort all the results from all the awk executions and remove duplicates using sort --unique.


find /ServerLogs/ \
    -iname "2016_*_*.log" \
    -ipath "/ServerLogs/*/webapi/*" \
    -exec awk '
        BEGIN {
            FS=",";
        }
        {
            if (NF == 4) {
                if (match($4,"MASTER=")) {
                    split($4, tokens, " ");
                    instances[substr(tokens[1], 2)];
                }
            }
        }
        END {
            for (element in instances) {
                print element;
            }
        }
    ' \
    '{}' \; | sort --unique;

Following is the same code in one line.

 find /ServerLogs/ -iname "2016_*_*.log" -ipath "/ServerLogs/*/webapi/*" -exec awk 'BEGIN {FS=",";} {if (NF == 4) {if (match($4,"MASTER=")){split($4, tokens, " "); instances[substr(tokens[1], 2)];}}} END {for (element in instances) {print element;}}' '{}' \; | sort --unique 

Another way

Another way to do similar functionality would be the following


find /ServerLogs/ \
    -iname "2016_*_*.log" \
    -ipath "/ServerLogs/*/webapi/*" \
    -exec sh -c '
        grep "MASTER=" -s "$0" | awk "BEGIN {FS=\",\";} NF==4" | cut -d "," -f4 | cut -c 3- | cut -d " " -f1 | sort --unique
    ' \
    '{}' \; | sort --unique;

What we changed is the -exec part. Instead of calling a awk script, we create a new sub-shell using sh -c, then we define the source to be executed inside the single codes and we pass as the first parameter of the shell the filename that matched.

Inside the shell, we find all lines that contain the string MASTER= using the grep command. Later we filter out all lines that do not have four columns when we tokenize using the comma character using awk. Then, we get the 4th column using cut and delimiter the comma. We remove the first two characters of the input string using cut -c 3- and later we get only the first column by reusing cut and changing the delimiter to be the space character. With those results we perform a sort that eliminates duplicates and we pass the results to the parent process to perform other operations.

Following is the same code in one line


find /ServerLogs/ -iname "2016_*_*.log" -ipath "/ServerLogs/*/webapi/*" -exec sh -c 'grep "MASTER=" -s "$0" | awk "BEGIN {FS=\",\";} NF==4" | cut -d "," -f4 | cut -c 3- | cut -d " " -f1 | sort --unique' '{}' \; | sort --unique;