(Last edit: April 11, 2023)
Summary
csvkit
is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats. (csvkit
is written in python.)csvkit
Documentation: csvkit.rtfd.orgcsvkit
Repository: github.com/wireservice/csvkit
- A good docker container for
csvkit
: new name : leplusorg/csv (old name: thomasleplus/csv) - Inspiration: stack overflow article: convert-xlsx-file-to-csv-using-batch
How to deal with many Excel files
If I had had the information I was looking for within a single Excel spreadsheet I would not have found csvkit
because I would simply have copied the column containing the data I wanted, in that case a list of email. But the data was spread among 26 different .xlsx
Excel files and there had to be a better way. (The mundane details of why/how/what/etc. can be found in a blog I wrote elsewhere: in2csv: the “Excel killer” is part of csvkit …)
in2csv and csvkit to the rescue
in2csv is a csvkit tool to easily convert the .xlsx
file into a .csv
file which is plain text and universal.
Once converted to .csv
files other commands can be used to process the data such as extracting just one column, sorting data, or creating an output in various formats.
Docker implementation
Implementing csvkit is probably not very hard if one follows the instructions from the documentation (installing-csvkit.) However, the standard command given: sudo pip install csvkit
should work in most cases, but does not take into account what I personally find difficult with python: installing willy-nilly on a system without caution (fancy users use virtual environments but casual users usually don’t know about that.) Therefore I now prefer to use dockerized (containerized) versions of such software to avoid creating conflicts on my (Mac) system. Docker is a great method to implement a software without installing it directly on one’s computer. (See my Docker tutorials for Biologists to learn and practice.)
I found a well documented docker option to run in2csv
: https://hub.docker.com/r/thomasleplus/csv
On a Mac or Linux the command to launch the docker container while sharing the current directory (pwd
) an making it the default directory (-w
) within the container would be:
docker run --rm -it -v $(pwd):/data -w /data leplusorg/csv
Note: on Windows the command would be:
For
cmd
:
docker run --rm -it -v "%cd%:/data" -w /data leplusorg/csv
For PowerShell:
docker run --rm -it -v "${PWD}:/data" -w /data leplusorg/csv
Note: docker has to be installed on your system. The docker image will be automatically pulled from the repository unless you pull it ahead of time.)
I was interested in a solution that can create a “loop” in the command line to operate on all files. Since I work on *Nix systems (Mac/Linux/Unix) the default would be the bash
shell, but there are probably solutions for DOS (cmd
) and PowerShell as well that would work in a similar way.
I found the loop solution within a stack overflow article: convert-xlsx-file-to-csv-using-batch that works. I only had to modify it to remove the first 2 lines of the files as they were blank. The first line on the reamining data was then the column header. The bash
command then was like this to convert all .xlsx
files into .csv
files:
for x in $(ls *.xlsx); do x1=${x%".xlsx"}; in2csv $x | sed 1,2d > $x1.csv; echo "$x1.csv done."; done
After that, I could use another too to extract just the column I wanted after identifying which number it was on a list using one of the files: csvcut -n Registrations01.csv
which identified the email column as column 17
:
The output is a numbered list of column headers
17: Email
18: Title
19: Classification
Then a simple loop could extract the 17th column for all files and redirect the result into a new text file:
for f in Registrations*.csv; do csvcut -c 17 $f >> x; done
Now file x
contains all emails and can be sorted as unique with bash
command, for example:
fgrep @ x | sort -u > unique_emails.txt
This kit has many more commands in the categories: Input, Processing, Output and Analysis (see Reference page).
This is certainly a very nice tool!
EDITS: (April 9, 2021)
- Add
.csv
at the end for theRegistrations*
files. - Note: remove
x
withrm x
to re-run the command or previous data will remain - Note: to remove cancelled registrations:
rm x; for f in Registrations*.csv; do fgrep -v -i Cancelled $f | csvcut -c 17 >> x; done
Image credits: Smiley, Excel- icons