Python offers a great environment and rich set of libraries to developers while working with data. There are tons of useful libraries out there for novice or experienced developers or analysts for helping out with processing or visualizing datasets. Some of the libraries are really popular and used by millions of developers, for example – Pandas, Numpy, Scikit-learn, NTLK etc. Some of the libraries are not so well known and turned out to be handy in my experience. This article introduces 6 such Python libraries when working with data. Readers might already be familiarized with some of them, but I hope this article still proves to be useful.
mrjob is an useful Python library that lets you write MapReduce jobs using Python. It lets you write your own Mapper and Reducer, run/test your MapReduce job on local environment and deploy on EMR or your own Hadoop cluster. It can be easily installed using
pip install mrjob . mrjob is developed by Yelp and receive thousands of downloads every day. The Github page and the project page has lots of documentation that will help users set up their application quickly.
Working with and manipulating datetimes in Python is pain. If you have worked with Python’s default datetime library and multiple time zones in your application, you must have been frustrated at times. delorean makes your life easier by providing nicer abstractions over datetime and pytz. It has useful features for working with multiple timezones, normalizing timezones and shifting from one timezone to other. This package is maintained by Mahdi Yusuf and has great documentation on the project page.
sorted() method is efficient enough and will serve for most of your sorting needs. But if you ever had to sort a list like
['a2', 'a9', 'a1', 'a4', 'a10'] , you will either have to roll out your own solution or would like to look for external libraries. Thankfully, natsort is available for rescue. It helps you in sorting a list of strings that also contains integers. Normal Python sorting would sort the list lexicographically but that may not be something you would want. natsort provides a method
natsorted() , that lets you sort lists like this. Also, you can mix and match integers, float and strings when you sort. Official project page has more details on usage and documentation. This may be required for special cases, but definitely handy when you end up needing it.
Not always you need a huge multi-node database with running daemons for your application. TinyDB is a small document oriented database that will let you insert JSON documents in a local file and query on that. It has 1200 lines of code including documentation with simple and clean APIs. Although it lacks a lot of features like multithreading or data indexing, this should suffice if you are looking a small, hassle-free database for your small projects without the overhead of setting it up or configuring it. It can be installed using
pip install tinydb . Refer to this link for documentation and usage information.
PrettyTable lets you draw beautiful ASCII tables on console from multiple data sources. This is extremely useful when you are pretty printing tabular data on terminal. It has options to select specific columns to display, sort columns, align each column to left or right or print the table in MS-WORD friendly format or as a HTML table. PrettyTable can use existing data sources like CSV file or a database cursor. I have used this package on regular basis for past few years and liked it so much that I ported this as a Node module here . The original source code is hosted on Google Code here . The project readme is also available on a mirror Github repository .
Vincent is a cool visualization tool that takes Python data structures and translates them intoVega visualization grammar that works on top on d3js. This lets you create beautiful d3-based visualizations right out of your Python scripts. Vincent uses Pandas dataframes under the hood and currently supports wide range of visualizations – bar, line, scatter, area, stacked bar, grouped bar, pie/donut, map etc. The APIs are simple and coupled with other data analysis tools, Vincent allows you to make beautiful iPython notebooks. The Github page has some examples and the project documentation is available here .