spark使用一 一、 环境准备在开始之前安装以下软件:1.1 JDK:推荐JDK8或JDK11(Spark 3.x对JDK11支持良好但JDK8最稳妥)。1.2 IntelliJ IDEA:Community(社区版)或Ultimate(旗舰版)均可。1.3 Maven:用于依赖管理和项目打包。1.4 Scala插件:打开IDEA-Settings-Plugins-搜索Scala并安装重启(此步骤在使用scala时做也可以)。二、 创建 Maven 项目2.1 打开idea左上角file》new》project2.2 创建一个New Project,输入名称name语言选择javabuild system选择maven(方便后续打包依赖)jdk建议选jdk82.3 在main目录上右键new一个directory名称叫scala2.4 在scala目录上右键mark directory as 为sources root2.5 配置pom.xml(核心)在pom.xml中配置Scala版本、Spark版本以及打包插件?xml version1.0 encodingUTF-8?projectxmlnshttp://maven.apache.org/POM/4.0.0xmlns:xsihttp://www.w3.org/2001/XMLSchema-instancexsi:schemaLocationhttp://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsdmodelVersion4.0.0/modelVersiongroupIdorg.example/groupIdartifactIdsparkmysqltohive/artifactIdversion1.0-SNAPSHOT/versionproperties!-- 【修改点 1】拆分为两个变量 --scala.binary.version2.12/scala.binary.version!-- 用于拼接 Spark 依赖后缀 --scala.version2.12.17/scala.version!-- 用于 Scala 库的精确版本 (Spark 3.4.x 推荐 2.12.17) --!--spark.version3.3.1/spark.version --spark.version3.4.2/spark.versionhive.version3.1.3/hive.versionhadoop.version3.2.2/hadoop.versionslf4j.version1.7.29/slf4j.versionmaven.compiler.source8/maven.compiler.sourcemaven.compiler.target8/maven.compiler.targetproject.build.sourceEncodingUTF-8/project.build.sourceEncodingproject.reporting.outputEncodingUTF-8/project.reporting.outputEncoding!-- scopecompile/scope--scopeprovided/scope/propertiesdependencies!--spark--dependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_${scala.binary.version}/artifactIdversion${spark.version}/versionscope${scope}/scope/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-sql_${scala.binary.version}/artifactIdversion${spark.version}/versionscope${scope}/scope/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-hive_${scala.binary.version}/artifactIdversion${spark.version}/versionscope${scope}/scope/dependency!--fastjson--dependencygroupIdcom.alibaba/groupIdartifactIdfastjson/artifactIdversion1.2.75/version/dependency!--log4j slf4j--dependencygroupIdorg.slf4j/groupIdartifactIdslf4j-api/artifactIdversion${slf4j.version}/version/dependencydependencygroupIdorg.slf4j/groupIdartifactIdslf4j-log4j12/artifactIdversion${slf4j.version}/version/dependencydependencygroupIdlog4j/groupIdartifactIdlog4j/artifactIdversion1.2.17/version/dependency/dependenciesbuildplugins!-- 【关键新增】1. Scala 编译插件没有它.scala 文件不会被编译 --plugingroupIdnet.alchim31.maven/groupIdartifactIdscala-maven-plugin/artifactIdversion4.8.1/versionexecutionsexecutiongoalsgoalcompile/goalgoaltestCompile/goal/goals/execution/executions/pluginplugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-assembly-plugin/artifactIdversion2.4/versionconfigurationappendAssemblyIdfalse/appendAssemblyIddescriptorRefs!-- 将依赖的jar包中的class文件打进生成的jar包--descriptorRefjar-with-dependencies/descriptorRef/descriptorRefs/configurationexecutionsexecutionidmake-assembly/idphasepackage/phasegoalsgoalassembly/goal/goals/execution/executions/plugin/pluginsresourcesresourcedirectorysrc/main/resources/directoryfilteringfalse/filteringincludes!-- 只打包你真正需要的文件比如某些 sql 脚本或 properties --include**/*.properties/include/includesexcludes!-- 【关键】排除本地调试用的 Hadoop/Hive/Kerberos 配置 --exclude**/*.xml/excludeexclude**/*.keytab/excludeexclude**/krb5.conf/excludeexclude**/log4j*.properties/exclude/excludes/resource/resources/build/project配置完成后点击右上角的Load Maven Changes(刷新图标)下载依赖。2.6 添加 Scala 框架支持因为创建的是Maven项目IDEA默认不知道要写Scala代码。2.6.1 在左侧项目目录树中右键点击项目根目录。2.6.2 选择 Add Framework Support…2.6.3 勾选Scala选择刚才配置的Scala版本如2.12.15如果没有就create一个点击OK。2.6.4 IDEA会自动创建src/main/scala和src/test/scala目录。2.7 创建一个Scala Objectimportorg.apache.spark.sql.SparkSessionobjectsparkdemo{defmain(args:Array[String]):Unit{valsparkSparkSession.builder().appName(SparkSQLDemo).master(local[*])// 本地测试用.getOrCreate()importspark.implicits._valdfSeq((1,张三,25),(2,李四,30)).toDF(id,name,age)df.createOrReplaceTempView(user)spark.sql(SELECT * FROM user WHERE age 26).show()spark.stop()}}解决本地运行的Provided坑点(⚠️极其重要)在pom.xml中我们将Spark依赖的scope设置为了provided(因为提交到集群时集群本身有Spark环境不需要打包进去以减小jar体积)。但这会导致在IDEA本地直接点击Run时报ClassNotFoundException。这是因为pom.xml中Spark相关依赖的被设置为provided(定义的${scope}变量值为provided)。这意味着这些依赖只在编译时使用不会被打包进最终的JAR中。这是Spark项目打包的标准做法因为集群环境中已经存在这些JAR无需重复打入还能避免版本冲突。当使用java -jar your.jar直接执行时JVM只会加载JAR包内的class而Spark的class不在其中因此报ClassNotFoundException。解决办法2.7.1 点击右上角的Run/Debug Configurations(或者在main方法左侧绿色三角形选择Modify Run Configuration)。2.7.2 找到Modify options(或AltM)。2.7.3 勾选Add dependencies with “provided” scope to classpath。2.7.4 点击Apply - OK。2.7.5 点击运行即可在IDEA控制台看到正确的输出结果。----------|id|name|age|----------|2|李四|30|----------