神刀安全网

Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical ( factors), or string. Additionally, these columns must support missing (null) values.

Around this time, the open source community had just started the newApache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data. In discussing Arrow in the context of Python and R, we wanted to see if we could design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

  • Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
  • Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
  • High read and write performance. When possible, Feather operations should be bound by local disk performance.

Code Examples

The Feather API is designed to make reading and writing data frames as easy as possible. In R, the code might look like:

library(feather)   path <- "my_data.feather" write_feather(df, path)   df <- read_feather(path) 

Analogously, in Python, we have:

importfeather   path = 'my_data.feather'   feather.write_dataframe(df, path) df = feather.read_dataframe(path) 

How Fast is Feather?

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk.

To give you an idea of Feather’s speed, here is a Python benchmark writing an approximately 800MB pandas DataFrame to disk:

importfeather importpandasas pd importnumpyas np arr = np.random.randn(10000000) # 10% nulls arr[::10] = np.nan df = pd.DataFrame({'column_{0}'.format(i): arrfor i in range(10)}) feather.write_dataframe(df, 'test.feather') 

On Wes’s laptop (latest-gen Intel processor with SSD), this takes:

In [9]: %timedf = feather.read_dataframe('test.feather') CPUtimes: user 316 ms, sys: 944 ms, total: 1.26 s Walltime: 1.26 s   In [11]: 800 / 1.26 Out[11]: 634.9206349206349 

This is effective performance of over 600 MB/s. Of course, the performance you see will depend on your hardware configuration.

And in R (on Hadley’s laptop, which is very similar):

library(feather)   x <- runif(1e7) x[sample(1e7, 1e6)] <- NA # 10% NAs df <- as.data.frame(replicate(10, x)) write_feather(df, 'test.feather')   system.time(read_feather('test.feather')) #>   user  system elapsed #>  0.731   0.287   1.020 

How Can I Get Feather?

The Feather source code is hosted at http://github.com/wesm/feather .

Installing Feather for R

Feather is currently available from github, and you can install with:

devtools::install_github(“wesm/feather/R”) 

Feather uses C++11, so if you’re on Windows, you’ll need the new gcc 4.93 toolchain . (All going well, this toolchain will be included in R 3.3.0, which is scheduled for release within weeks. We’ll aim for a CRAN release soon after that.)

Installing Feather for Python

For Python, you can install Feather from PyPI like so:

$ pipinstallfeather-format 

Feather has only been tested on OS X and Linux, and requires a C++11 compiler (gcc 4.8 and higher or XCode 6 and up). We will look into providing more installation options, such as conda builds, in the future.

When Should You Not Use Feather?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable across versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

Feather, Apache Arrow, and the Community

One of the great parts of Feather is that the file format is language agnostic. Other languages, such as Julia or Scala (for Spark users), can read and write the format without knowledge of details of Python or R.

Feather is one of the first projects to bring the tangible benefits of the Arrow spec to users in the form of an efficient, language-agnostic representation of tabular data on disk. Since Arrow does not provide for a file format, we are using Google’sFlatbuffers library to serialize column types and related metadata in a language-independent way in the file.

The Python interface uses Cython to expose Feather’s C++11 core to users, while the R interface uses Rcpp for the same task.

We’re interested in evolving the Feather format to support the needs of more Python and R users, as well as other seeing bindings for Feather built in other data analysis languages. There will also be plenty of opportunities to get involved in improving Feather’s performance or storage efficiency through compression or other techniques.

Wes McKinney is a Software Engineer at Cloudera, the founder of the pandas and Ibis projects, and an Apache Arrow committer. 

Hadley Wickham is Chief Scientist at RStudio and Adjunct Professor of Statistics at Rice University.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮