On Tuesday, March 15, Syncfusion hosted a webinar demonstrating how the Syncfusion Big Data Platform makes developing and deploying a big data solution much easier. Marissa Keller Outten, director of business development at Syncfusion, moderated the webinar, and Daniel Jebaraj, vice president of product development, demonstrated nearly every component in the platform to give attendees a well-rounded, high-level summary.
A recap of the webinar follows, and includes the question and answer session which may clarify any concerns you have about the platform. If you still have questions, you can email us at firstname.lastname@example.org, or call us at 1 888-9DOTNET. You can always download a free trial of the platform and see for yourself just what it can do.
The Syncfusion Big Data Platform is a complete distribution of Apache Hadoop for use in both production and development environments. It is an end-to-end solution that covers data storage, processing, visualization, and predictive modeling of that data.
Big Data Studio
The Big Data Studio provides a developer edition of Apache Hadoop for offline, local development. It includes a local service manager for controlling the Hadoop services running on your machine during development to better manage the resources it consumes. The services can be started and stopped individually or based on their category, such as whether they’re a core Hadoop service or an HBase service.
The second part of the Big Data Studio is the development environment. It can connect to any Hadoop cluster, local or remote, and provides views for browsing the Hadoop Distributed File System, interfacing with the Hadoop system, submitting jobs with Pig and Hive, using any of the higher-level functional primitives supported in Apache Spark, working with tables with the NoSQL database system HBase, and transferring data between relational databases and Hadoop with Sqoop.
The Cluster Manager is an easy-to-use interface for managing nodes and clusters in a production system. The primary view displays a complete list of all the nodes and clusters running, stopped, or otherwise, and also lists the services running on each. From the management page, clusters can be easily backed up and restored, and data nodes can be added and removed. Clusters can also be created, either manually or automatically. Either way, it only takes a few minutes to get a new cluster up and running. The monitoring page provides several views for examining the jobs running on a cluster, the status of different nodes, and more. The Cluster Manager also includes a full Oozie implementation for scheduling workflows, and IPython support for running visualizations.
What sets the Syncfusion Big Data Platform apart from other solutions is that it is the only one designed specifically for Windows. Most Windows developers are comfortable writing C#, and now they can use it to write Hadoop jobs. Beyond C#, Java, Python, Scala, and more are supported. In particular, Hive and HBase can be accessed directly using C#, making them much easier to work with if you’re in a .NET environment. Numerous samples ship with the Big Data Platform to demonstrate the full extent of its C# support.
Another exciting component of the Big Data Platform is its predictive analytics support and PMML library. After performing a statistical analysis in Spark like a linear regression, the model can be persisted to a PMML file. By using the PMML library, the persisted model can be executed at runtime inside existing C# applications, whether they’re on the desktop or the web. The library is a full implementation of many machine learning algorithms that can be used for plotting predictions, classifications, and regressions.
Questions and Answers
Edited for brevity. Responses from Daniel Jebaraj, vice president of product development.
Where is the data actually stored? Is the Syncfusion product sitting on Azure or Amazon data centers?
You have two options: On-premise or cloud data stores. Everything presented in the webinar was completely on-premise. It’s on my laptop, and I have a cluster that’s sitting near me. When you use the Syncfusion Big Data Platform with Azure, you get to control everything. Syncfusion is just providing the software, which is unique to our offering. It’s not a cloud offering. If you want to host it on data centers in Europe, you can certainly do that. You can use Amazon also, but you have to configure it manually.
Can any of the nodes in the cluster run on Windows Nano Server once it comes out?
It should be able to run on Nano Server. This is a use case for us basically, because we know there is a trend toward containerization.
Which database is best for big data?
Big data is complementary to relational data stores. If you have use cases currently that are satisfied by relational data stores, I don’t expect that to change at all. But if we are collecting a lot of data—so in our own example we are collecting gigabytes of log data from our website. We are bringing all of that information into our own cluster, and then we are able to draw these conclusions, to run these analyses, and that’s very powerful. That’s something that we haven’t been able to do in the past. But if you are really looking at a traditional use case with MVC and SQL Server, that will continue to be the same as before. But you can do so much more by taking that data and bringing it into Hadoop. That’s a parallel use case to consider. If you’re doing ETL workloads, if you’re doing machine learning, if you have SAS or Teradata or Tableau in-house, with the Syncfusion Big Data Platform, you can bring all of those into one thing. You get everything in one chart.
On Azure, how does the Syncfusion Big Data Platform work with HDInsight? Usually an HDInsight cluster also costs money, so how is it all priced together?
HDInsight is Microsoft’s implementation of Hadoop. They run it on their own. What we do is completely compatible with HDInsight, but that’s a Microsoft offering. Just from a perspective of being compatible, it is fully compatible. Now, when the Syncfusion Big Data Platform is used on Azure, it is a parallel offering to HDInsight. The Big Data Platform has numerous advantages. You control the resources directly, you have everything in your own account, and so on.
Regarding pricing, HDInsight is priced per hour. With the Syncfusion Big Data Platform, you’re running a plain virtual machine, and if you have a global license you’re not paying us anything more. It’s all included. You’re not paying a markup to use the services included in the platform.
Is Eclipse or NetBeans better for connecting to the database?
Different IDEs are suitable for different needs, so I don’t have a preference. I use Eclipse for the most part mainly because I work more with Scala. I use the Scala IDE, which is a packaged version of Eclipse. I also use IntelliJ quite often.
Do you have a list of neural networks you support?
Right now, Spark doesn’t actually support neural networks to my knowledge. Microsoft has a computational network toolkit (CNTK) that has a deep learning toolkit that supports deep learning. So that may be something that would be suitable. At some point, I expect there will be machine learning support with deep learning also on Spark. We do, as part of our PMML engine, support neural networks.