Spache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark uses DAG Engine (directed acyclic graph) to optimize workflows and is built around one main concept: the Resilient Distributed Dataset (RDD).

Table of Subtopics

Spark Eco-system

大数据 - Spark_html_ad03d35bdeecaadd.png

  • Spark: 其核心是Spark Core,它是Spark应用的基础,包含RDD的API、RDD操作和其它动作(Action)。下面其它Spark的库都是构建在Spark Core之上的
  • Spark SQL:提供通过Apache Hive的SQL变体Hive查询语言(HiveQL)与Spark进行交互的API。每个数据库表被当做一个RDD,Spark SQL查询被转换为Spark操作。
  • Spark Streaming:对实时数据流进行处理和控制。Spark Streaming允许程序能够像普通RDD一样处理实时数据
  • MLlib:一个常用机器学习算法库,算法被实现为对RDD的Spark操作。这个库包含可扩展的学习算法,比如分类、回归等需要对大量数据集进行迭代的操作。
  • GraphX:控制图、并行图操作和计算的一组算法和工具的集合。GraphX扩展了RDD API,包含控制图、创建子图、访问路径上所有顶点的操作

Reference

Tags:
Created by Bin Chen on 2019/11/16 08:03
    

Need help?

If you need help with XWiki you can contact:

京ICP备19054609号-1

京公网安备 11010502039855号