We use cookies to give you the best experience on our website. If you continue to browse, then you agree to our privacy policy and cookie policy. Image for the cookie policy date

Scheduling, Injesting data & Hardware

Hi,

We are currently evaluating the different distributions of hadoop (cloudera, hortonworks). The documentation is quite sparse on the Syncfusion distribution. We are looking at replacing a bunch of workstations 4-5 that do data processing using Base SAS & Custom scripts.

All of the processing is automated including:

1. Collecting the raw files via FTP
2. Processing in SAS
3. Analyzing in SAS
4. Pushing into a DB
5. Generating reports


I have a few questions:

1. Is there a command line interface to get data into hadoop, that can be scheduled? 
2. Is there a way to schedule tasks on a regular basis?
3. In the hardware faq you mention 'new' hardware. Is this a requirement or can we  use our existing workstation. They are quite high spec with about 8tb on each with RAID
4. If you have multiple drives on a workstation, without raid configured. How would you configure hadoop to make use of all of the drives. What we want to do is have the OS drive use SSD, and the data drives be normal large drives.
5. Would it work with an external drive bay?

2 Replies

IS issy August 27, 2015 08:07 PM UTC

Also, would we be able to access Hadoop from non windows clients? I see the studio is windows only. We have a mix of Mac, Windows, Linux clients.


DJ Daniel Jebaraj Syncfusion Team August 28, 2015 09:29 PM UTC

Hi Issy,

Thank you for your interest in the Syncfusion Big Data Platform. 
 

S.No

Query

Response

1

Is there a command line interface to get data into hadoop that can be scheduled?  

We have several options to achieve this.


·       By using Java, we can transfer data from FTP to HDFS directly. Java program can be easily scheduled on regular basis using Oozie. 

·       If files are accumulated through streams of activity (such as logging), Flume will be a good choice. We have a special implementation of Flume that we can provide.

·       Alternatively, If files from FTP are collected and stored in local system, below Hadoop command can be used for copying data from local to HDFS in command line interface. 


Hadoop Command line interface directory : C:\Syncfusion\BigDataSDK\<version>\SDK\Hadoop\bin 


hdfs dfs –copyFromLocal <local_ file_ location> <target_hdfs_location> 

2

Is there a way to schedule tasks on a regular basis? 

Yes. Hadoop tasks can be scheduled using Oozie on a regular basis. Please refer below link to learn in detail about Oozie. 

http://oozie.apache.org/

 

We have provided support for Oozie in our platform.

http://helpbdp.syncfusion.com/bigdata/big-data-studio/oozie

3

In the hardware FAQ you mention 'new' hardware. Is this a requirement or can we use our existing workstation. They are quite high spec with about 8tb on each with RAID

You can use your existing workstation. HDFS clusters do not benefit from using RAID for data node storage. HDFS handles replication between nodes by itself.

 

Hence it is not recommended to use RAID on any of data nodes or client machine for the requirement of forming Hadoop cluster.  But RAID can be used for name nodes.

 

Please refer following UG link for forming cluster. 

http://helpbdp.syncfusion.com/bigdata/cluster-manager/cluster-creation 

4

How would you configure Hadoop to make use of all of the drives? What we want to do is have the OS drive use SSD, and the data drives be normal large drives. 

By default, with Syncfusion cluster, data node will make use of all fixed type drives of a machine.


Hadoop data nodes can be configured to restrict drives, by changing dfs.datanode.data.dir property of hdfs-site.xml file in advanced setting provided in our cluster manager application when creating cluster.

5

Would it work with an external drive bay? 

Yes. With Syncfusion cluster manager, data nodes can detect all fixed type external drives and we can use it for Hadoop HDFS storage. It is just the volume has to be a fixed volume (and not a transient volume).

6

Would we be able to access Hadoop from non-windows?

The Syncfusion Big Data Studio is a Windows only tool.


However accessing Syncfusion Hadoop distribution that is running on Windows through native command line interface from non-windows clients is supported just as with any other cluster. We can assist with this.

 

Accessing thrift services such as Spark and Hive thrift servers with our solution is platform independent. We can access it using Java Thrift API from non-windows clients as well.

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Thrift

 
 The provided data processing requirement can be done using Syncfusion Big Data Platform. Following is a simple prototype for the requirement.
·        FTP ->Hadoop  - Use Java program
·        Processing and analyzing data – Use Pig / Hive / Spark scripts in our Big Data Studio to process data in Hadoop.
·        Pushing into DB – Using Sqoop we can import and export processed data with SQL server, MYSQL or ORACLE.
·        Scheduling with OOZIE - All these tasks can be scheduled using Oozie on a regular basis.
·        Spark also comes with a full machine learning library that can perform machine learning and can be used to build models. Once a model is built it can then be persisted as PMML. Syncfusion offers a PMML execution engine that can be used within your .NET applications. For other platforms alternate PMML engines are available.

Please refer following UG documentation link for more detail about PIG, HIVE, OOZIE, and SQOOP with our Syncfusion Big Data Platform.
http://helpbdp.syncfusion.com/bigdata/overview

We will be happy to provide a custom demo or assist you with the work to be performed.

Please let us know if you have any further queries on this. We look forward to working with you.
 
Best Regards,
Daniel


Loader.
Live Chat Icon For mobile
Up arrow icon