Skip to main content

Published by Addison-Wesley Professional (June 18, 2018) © 2018

Jeffrey Aven
    VitalSource eTextbook (Lifetime access)
    €32,99
    ISBN-13: 9780134844879

    Data Analytics with Spark Using Python ,1st edition

    Access details

    • Instant access once purchased
    • Fulfilled by VitalSource

    Features

    • Add notes and highlights
    • Search by keyword or page

    Language: English

    Product Information

    Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools

    Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.

    Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.

    Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.

    Coverage includes:
    • Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
    • Create Spark clusters using various deployment modes
    • Control and optimize the operation of Spark clusters and applications
    • Master Spark Core RDD API programming techniques
    • Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
    • Efficiently integrate Spark with both SQL and nonrelational data stores
    • Perform stream processing and messaging with Spark Streaming and Apache Kafka
    • Implement predictive modeling with SparkR and Spark MLlib
    Preface     xi
    Introduction     1

    PART I:  SPARK FOUNDATIONS
    Chapter 1  Introducing Big Data, Hadoop, and Spark     5

    Introduction to Big Data, Distributed Computing, and Hadoop     5
         A Brief History of Big Data and Hadoop     6
         Hadoop Explained     7
    Introduction to Apache Spark     13
         Apache Spark Background     13
         Uses for Spark     14
         Programming Interfaces to Spark     14
         Submission Types for Spark Programs     14
         Input/Output Types for Spark Applications     16
         The Spark RDD     16
         Spark and Hadoop     16
    Functional Programming Using Python     17
         Data Structures Used in Functional Python Programming     17
         Python Object Serialization     20
         Python Functional Programming Basics     23
    Summary     25
    Chapter 2  Deploying Spark     27
    Spark Deployment Modes     27
         Local Mode     28
         Spark Standalone     28
         Spark on YARN     29
         Spark on Mesos     30
    Preparing to Install Spark     30
    Getting Spark     31
    Installing Spark on Linux or Mac OS X     32
    Installing Spark on Windows     34
    Exploring the Spark Installation     36
    Deploying a Multi-Node Spark Standalone Cluster     37
    Deploying Spark in the Cloud     39
         Amazon Web Services (AWS)     39
         Google Cloud Platform (GCP)     41
         Databricks     42
    Summary     43
    Chapter 3  Understanding the Spark Cluster Architecture     45
    Anatomy of a Spark Application     45
         Spark Driver     46
         Spark Workers and Executors     49
         The Spark Master and Cluster Manager     51
    Spark Applications Using the Standalone Scheduler     53
         Spark Applications Running on YARN     53
    Deployment Modes for Spark Applications Running on YARN     53
         Client Mode     54
         Cluster Mode     55
         Local Mode Revisited     56
    Summary     57
    Chapter 4  Learning Spark Programming Basics     59
    Introduction to RDDs     59
    Loading Data into RDDs     61
         Creating an RDD from a File or Files     61
         Methods for Creating RDDs from a Text File or Files     63
         Creating an RDD from an Object File     66
         Creating an RDD from a Data Source     66
         Creating RDDs from JSON Files     69
         Creating an RDD Programmatically     71
    Operations on RDDs     72
         Key RDD Concepts     72
         Basic RDD Transformations     77
         Basic RDD Actions     81
         Transformations on PairRDDs     85
         MapReduce and Word Count Exercise     92
         Join Transformations     95
         Joining Datasets in Spark     100
         Transformations on Sets     103
         Transformations on Numeric RDDs     105
    Summary     108

    PART II:  BEYOND THE BASICS
    Chapter 5  Advanced Programming Using the Spark Core API     111

    Shared Variables in Spark     111
         Broadcast Variables     112
         Accumulators     116
         Exercise: Using Broadcast Variables and Accumulators     119
    Partitioning Data in Spark     120
         Partitioning Overview     120
         Controlling Partitions     121
         Repartitioning Functions     123
         Partition-Specific or Partition-Aware API Methods     125
    RDD Storage Options     127
         RDD Lineage Revisited     127
         RDD Storage Options     128
         RDD Caching     131
         Persisting RDDs     131
         Choosing When to Persist or Cache RDDs     134
         Checkpointing RDDs     134
         Exercise: Checkpointing RDDs     136
    Processing RDDs with External Programs     138
    Data Sampling with Spark     139
    Understanding Spark Application and Cluster Configuration     141
         Spark Environment Variables     141
         Spark Configuration Properties     145
    Optimizing Spark     148
         Filter Early, Filter Often     149
         Optimizing Associative Operations     149
         Understanding the Impact of Functions and Closures     151
         Considerations for Collecting Data     152
         Configuration Parameters for Tuning and Optimizing Applications     152
         Avoiding Inefficient Partitioning     153
         Diagnosing Application Performance Issues     155
    Summary     159
    Chapter 6  SQL and NoSQL Programming with Spark     161
    Introduction to Spark SQL     161
         Introduction to Hive     162
         Spark SQL Architecture     166
         Getting Started with DataFrames     168
         Using DataFrames     179
         Caching, Persisting, and Repartitioning DataFrames     187
         Saving DataFrame Output     188
         Accessing Spark SQL     191
         Exercise: Using Spark SQL     194
    Using Spark with NoSQL Systems     195
         Introduction to NoSQL     196
         Using Spark with HBase     197
         Exercise: Using Spark with HBase     200
         Using Spark with Cassandra     202
         Using Spark with DynamoDB     204
         Other NoSQL Platforms     206
    Summary     206
    Chapter 7  Stream Processing and Messaging Using Spark     209
    Introducing Spark Streaming     209
         Spark Streaming Architecture     210
         Introduction to DStreams     211
         Exercise: Getting Started with Spark Streaming     218
         State Operations     219
         Sliding Window Operations     221
    Structured Streaming     223
         Structured Streaming Data Sources     224
         Structured Streaming Data Sinks     225
         Output Modes     226
         Structured Streaming Operations     227
    Using Spark with Messaging Platforms     228
         Apache Kafka     229
         Exercise: Using Spark with Kafka     234
         Amazon Kinesis     237
    Summary     240
    Chapter 8  Introduction to Data Science and Machine Learning Using Spark     243
    Spark and R     243
         Introduction to R     244
         Using Spark with R     250
         Exercise: Using RStudio with SparkR     257
    Machine Learning with Spark     259
         Machine Learning Primer     259
         Machine Learning Using Spark MLlib     262
         Exercise: Implementing a Recommender Using Spark MLlib     267
         Machine Learning Using Spark ML     271
    Using Notebooks with Spark     275
         Using Jupyter (IPython) Notebooks with Spark     275
         Using Apache Zeppelin Notebooks with Spark     278
    Summary     279
    Index     281
    Top