Fast line count

The following is a simple bash script used to calculate the number of lines in ginormous text files ( like 20G CSV).

#!/bin/bash

# get the line count of a file without reading the entire file
# accuracy can be adjusted by changing the $linenum parameter

path=$1  
linenum=$2  
head=$(head -$linenum $path | wc -c)  
tail=$(head -$linenum $path | wc -c)  
bStr=$(wc -c $path)  
totalBytes=$(echo $bStr | cut -d' ' -f1)

headAvg=$(($head/$linenum))  
tailAvg=$(($tail/$linenum))  
totalAvg=$((($headAvg+$tailAvg)/2))

estimatedlines=$(($totalBytes/$totalAvg))  
echo $estimatedlines  
Example of usage and benchmark

time ./lcount.sh ./somefile.txt 100000  
2234932

real    0m2.707s  
user    0m0.974s  
sys     0m1.509s

time wc -l ./somefile.txt  
2248443 ./somefile.txt

real    1m30.088s  
user    0m0.794s  
sys     0m6.408s

The accuracy, in this case, is 99.39%, but can be adjusted using the linenum argument.

Davide Andreazzini

Read more posts by this author.