I did a website scrape for a conversion project. I’d like to do some statistics on the types of files in there — for instance, 400
.html files, 100
.gif, etc. What’s an easy way to do this? It has to be recursive.
Edit: With the script that maxschelpzig posted, I’m having some problems due to the architecture of the site I’ve scraped. Some of the files are of the name
*.php?blah=blah&foo=bar with various arguments, so it counts them all as unique. So the solution needs to consider
*.php* to be all of the same type, so to speak.
You could use
uniq for this, e.g.:
$ find . -type f | sed 's/.*.//' | sort | uniq -c 16 avi 29 jpg 136 mp3 3 mp4
findrecursively prints all filenames
seddeletes from every filename the prefix until the file extension
uniqassumes sorted input
-cdoes the counting (like a histogram).
Correct answer by maxschlepzig on September 13, 2020
I know this thread is old but, this is one of top results when searching for "bash count file extensions".
I encountered the same problem as you and created a script similar to maxschlepzig
Here is the command i made that counts the extensions of all files in the working directory recursively. This takes into account UPPER, and LOWER cases, merging them, removing false positive results, and counting the occurrences.
find . -type f | tr '[:upper:]' '[:lower:]' | grep -E ".*.[a-zA-Z0-9]*$" | sed -e 's/.*(.[a-zA-Z0-9]*)$/1/' | sort | | uniq -c | sort -n
Here is the github link if you'd like to see more documentation.
Answered by Andrew Hopkins on September 13, 2020
I've put a bash script into my
~/bin folder called
exhist with this content:
#!/bin/bash for d in */ ; do echo $d find $d -type f | sed -r 's/.*/([^/]+)/1/' | sed 's/^[^.]*$//' | sed -r 's/.*(.[^.]+)$/1/' | sort | uniq -c | sort -nr # files only | keep filename only | no ext -> '' ext | keep part after . (i.e. ext) | count | sort by count desc done
Whichever directory I'm in, I just type 'exh', tab auto-completes it, and I see something like this:
$ exhist src/ 7 .java 1 .txt target/ 42 .html 10 .class 4 .jar 3 .lst 2 1 .xml 1 .txt 1 .properties 1 .js 1 .css
P.S. Trimming the part after the question mark should be simple to do with another sed command probably after the last one (I haven't tried it):
Answered by Zsolt Katona on September 13, 2020
This one-liner seems to be a fairly robust method:
find . -type f -printf '%fn' | sed -r -n 's/.+(..*)$/1/p' | sort | uniq -c
find . -type f -printf '%fn' prints the basename of every regular file in the tree, with no directories. That eliminates having to worry about directories which may have
.'s in them in your
sed -r -n 's/.+(..*)$/1/p' replaces the incoming filename with only its extension. E.g.,
.ext. Note the initial
.+ in the regex; this results in any match needing at least one character before the extension's
.. This prevents filenames like
.gitignore from being treated as having no name at all and the extension '.gitignore', which is probably what you want. If not, replace the
.+ with a
The rest of the line is from the accepted answer.
Edit: If you want a nicely-sorted histogram in Pareto chart format, just add another
sort to the end:
find . -type f -printf '%fn' | sed -r -n 's/.+(..*)$/1/p' | sort | uniq -c | sort -bn
Sample output from a built Linux source tree:
1 .1992-1997 1 .1994-2004 1 .1995-2002 1 .1996-2002 1 .ac 1 .act2000 1 .AddingFirmware 1 .AdvancedTopics [...] 1445 .S 2826 .o 2919 .cmd 3531 .txt 19290 .h 23480 .c
Answered by Gary R. Van Sickle on September 13, 2020
print -rl -- **/?*.*(D.:e) | uniq -c |sort -n
**/?*.* matches all files that have an extension, in the current directory and its subdirectories recursively. The glob qualifier
zsh traverse even hidden directories and consider hidden files,
. selects only regular files. The history modifier retains only the file extension.
print -rl prints one match per line.
uniq -c counts consecutive identical items (the glob result is already sorted). The final call to
sort sorts the extensions by use count.
Answered by Gilles 'SO- stop being evil' on September 13, 2020
1 Asked on December 6, 2020 by tanya-shreedhar
0 Asked on December 6, 2020 by rarelynecessary
2 Asked on December 6, 2020 by dhamu
1 Asked on December 6, 2020 by robsch
0 Asked on December 6, 2020 by kolja
1 Asked on December 6, 2020
1 Asked on December 6, 2020 by marc
0 Asked on December 5, 2020 by yael
1 Asked on December 4, 2020 by flow2k
1 Asked on December 4, 2020 by beakal-begashaw
0 Asked on December 4, 2020
1 Asked on December 4, 2020 by bex
1 Asked on December 4, 2020 by elu
3 Asked on December 4, 2020 by sln
13 Asked on December 4, 2020 by kris
2 Asked on December 4, 2020 by mbiber
3 Asked on December 3, 2020 by aravind
Get help from others!