csvkit command-line spreadsheet can convert and compute multiple Excel files

smiley and excel iconsSummary

How to deal with many Excel files

If  I had had the information I was looking for within a single Excel spreadsheet I would not have found csvkit because I would simply have copied the column containing the data I wanted, in that case a list of email. But the data was spread among 26 different .xlsx Excel files and there had to be a better way. (The mundane details of why/how/what/etc. can be found in a blog I wrote elsewhere: in2csv: the “Excel killer” is part of csvkit …)

in2csv and csvkit to the rescue

in2csv is a csvkit tool to easily convert the .xlsx file into a .csv file which is plain text and universal.

Once converted to .csv files other commands can be used to process the data such as extracting just one column, sorting data, or creating an output in various formats.

Docker implementation

Implementing csvkit is probably not very hard if one follows the instructions from the documentation (installing-csvkit.) However, the standard command given:  sudo pip install csvkit should work in most cases, but does not take into account what I personally find difficult with python: installing willy-nilly on a system without caution (fancy users use virtual environments but casual users usually don’t know about that.) Therefore I now prefer to use dockerized (containerized) versions of such software to avoid creating conflicts on my (Mac) system. Docker is a great method to implement a software without installing it directly on one’s computer. (See my Docker tutorials for Biologists to learn and practice.)

I found a well documented docker option to run in2csvhttps://hub.docker.com/r/thomasleplus/csv

On a Mac or Linux the command to launch the docker container while sharing the current directory (pwd) an making it the default directory (-w) within the container would be:

docker run --rm -it -v $(pwd):/data -w /data thomasleplus/csv

Note: on Windows the command would be:

For cmd:

docker run --rm -it -v "%cd%:/data" -w /data thomasleplus/csv

For PowerShell:

docker run --rm -it -v "${PWD}:/data" -w /data thomasleplus/csv

Note: docker has to be installed on your system. The docker image will be automatically pulled from the repository unless you pull it ahead of time.)

I was interested in a solution that can create a “loop” in the command line to operate on all files. Since I work on *Nix systems (Mac/Linux/Unix) the default would be the bash shell, but there are probably solutions for DOS (cmd) and PowerShell as well that would work in a similar way.

I found the loop solution within a stack overflow article: convert-xlsx-file-to-csv-using-batch that works. I only had to modify it to remove the first 2 lines of the files as they were blank. The first line on the reamining data was then the column header. The bash command then was like this to convert all .xlsx files into .csv files:

for x in $(ls *.xlsx); do x1=${x%".xlsx"}; in2csv $x | sed 1,2d > $x1.csv; echo "$x1.csv done."; done

After that, I could use another too to extract just the column I wanted after identifying which number it was on a list using one of the files: csvcut -n Registrations01.csv which identified the email column as column 17:

The output is a numbered list of column headers

 17: Email
 18: Title
 19: Classification

Then a simple loop could extract the 17th column for all files and redirect the result into a new text file:

for f in Registrations* do; csvcut -c 17 $f >> x; done

Now file x contains all emails and can be sorted as unique with bash command, for example:

fgrep @ x | sort -u > unique_emails.txt

This kit has many more commands in the categories: Input, Processing, Output and Analysis (see Reference page).

This is certainly a very nice tool!


Image credits: Smiley, Excel- icons