The following is an in-depth interview with Succinctly series author Dave Vickers, whose book Hadoop for Windows Succinctly was published recently. You can download the book from our ebook portal.
What should people know about Hadoop for Windows Succinctly? Why is it important?
The book lets people know Hadoop runs perfectly well on both Windows Server and Windows desktop operating systems. Installation on both platforms is covered, including setting up a multi-node cluster on Windows Server. I’ve always felt too many books only cover Hadoop installation and usage on a single machine—real Hadoop must include cluster creation. You cannot ignore the networking component of Hadoop.
I can’t think of another book on Hadoop for Windows—that’s why it’s important. It may be the first book of its kind. You don’t need to move your whole operation to Linux to access Hadoop anymore. You can simply run Hadoop in your Windows environment. With this book, you now have a viable choice.
The choices of Hadoop distributions for Windows are also covered, as well as the Hadoop ecosystem, performance considerations, and Hadoop maintenance.
When did you first become interested in Hadoop for Windows?
I was working for Cooper Gay Swett & Crawford (now Ed Broking), the world’s fifth largest reinsurance brokerage and the United States’ third-largest wholesaler. Situated in the financial district of the City of London, I was responsible for producing worldwide business intelligence. While our transaction data spanned most currencies and languages in the world, we needed a unified environment for our disparate data sources, one coherent view of our data. We didn’t need the headache of a separate Linux-based data environment, especially as our IT infrastructure was outsourced to Navisite.
Hortonworks had perfected Hadoop virtual machines for Windows, and their Hadoop distribution ran via a browser. They’d given Hadoop a colorful and interactive look and feel. It looked more like Windows than a Linux environment. I showed it to the group systems analyst and his reaction was positive, a change from the reaction people give to the black screen of Apache Hadoop. Running some of our data in the Hortonworks environment proved very positive. Our Windows business intelligence servers were not just in the UK, however. Connecting the USA servers to Linux would create added complexity. In addition, the USA servers were not hosted by Navisite. It was then I realized that if Hortonworks could work in Windows, things would be far easier to manage. The seed was sown in my mind.
Luckily, or so I thought, Hortonworks had released the beta of the Hortonworks Data Platform (HDP) for Windows. Because Windows Server was so dominant, it seemed Hadoop would be available to a wider group of people. The positive part was that Microsoft and Hortonworks had worked together on the project, and also it was an enterprise platform and worked on-premises.
Unfortunately, on HDP for Windows, the interactive environment of the Hortonworks Linux distribution wasn’t available. I found that confusing. Why run it on Windows if it doesn’t take advantage of the interactivity and presentation of Windows? Quite simply I preferred the Hortonworks Linux manifestation by far. If you look at the number of people using HDP for Windows, then the number of people using Hortonworks in Linux, you have your answer.
Over time I’ve seen no books by Microsoft or Hortonworks on HDP for Windows. Not a single authoritative text. Rather odd, I thought. They were needed to correct the legacy of some rather unimpressive emulators that ran Hadoop on Windows. The emulators did more harm than good. They reinforced the fact that Hadoop couldn’t run on Windows. Over time I think the perception stuck: not for production environments, a bit of a novelty perhaps.
I certainly believed in Hadoop for Windows, but I thought a better manifestation of it was required. One that actually worked in, and interacted with, Microsoft Windows. One that now exists is the Syncfusion Big Data Platform.
By writing this ebook, did you learn anything new yourself?
Yes. The vast amount of time and resources Microsoft put into adapting Windows to run Hadoop. Because of this, you can install the freely available Apache Hadoop onto systems as old as Windows 8, and you can also install it on Windows 10. I demonstrate this in the book. There’s no extra cost to using Hadoop on Windows desktops either. If you own a PC or laptop, Windows is probably already on it. There’s no separate outlay there.
I learned that Azure Data Studio can access Hadoop in real time via a desktop or tablet. The 80-MB installation file masks the power of this light and agile application. In a purely experimental move, I managed to get the Syncfusion Big Data Platform working with Azure Data Studio and Microsoft big data clusters. It highlighted the untapped potential of Hadoop on Windows. To realize this potential, an active Hadoop for Windows user community is imperative.
To encourage user communities to get involved, Microsoft has released tools like SQL Server on Linux. It’s their most successful SQL Server ever. Linux has provided a more active user community even for Microsoft products. Note that you can only create Linux clusters on HDInsight for this reason. You cannot create Windows clusters on Microsoft HDInsight anymore. In light of this, Microsoft now releases tools like Azure Data Studio for Linux, Mac, and Windows. It has also made these tools available for free.
It wouldn’t surprise me if Microsoft, having seen the success of SQL Server for Linux (7,000,000 downloads), goes one step further. Could Microsoft release an open source operating system to complement but not replace Windows? Please note that the Windows Subsystem for Linux (WSL) cannot run native Hadoop distributions.
How is Hadoop on Windows likely to change over the next few years?
Marquee applications will be built for Windows. Some use cases will suit the platform more than others. Industries that don’t have a fixation on Hadoop being exclusively for Linux will be at the forefront of this. This includes creative industries. You’ll also see more applications released that include Hadoop support for Windows. You may not see that support in applications for certain sectors such as financial services. In those industries your local applications often have to interact with external exchanges or financial transaction systems. You often hear the phrase “we’re aligning with industry best practice” which can sometimes mean you don’t have a choice. It must be remembered though that Hadoop on Windows is still Hadoop—not an alien system, just a different environment.
The first marquee applications will probably be in video on demand (VOD) and associated broadcast industries. You’ll be able to access masses of subscription and user data within Windows. You’ll no longer need to connect to Hadoop in Linux from Windows. You also won’t be limited to working with data extracts. You’ll be able to connect live. This means you don’t have to run, manage, and pay for two separate computing environments. This is part of the same equation that saw Tableau released on Linux. It’s the same concept but in the other direction. The Linux release of Tableau meant people working with Hadoop could use it in a unified Linux environment. There are also some benefits to be gained on the actual content delivery side. Again, they relate to being able to work in a single unified environment.
As I write this, EE in the UK has already gone live in certain areas of London. This means Windows-based competitors of tools like NGINX have a lifeline. 5G speeds make streaming unattractive. Why stream a movie for two hours when it could be downloaded in two minutes? Netflix has the advantage while streaming—not downloading—is the default.
With 5G speeds, digital rights management-protected (DRM) videos can be downloaded on an industrial scale from Windows environments. You’ll have both the download system and associated big data system running within Windows. This unified approach will provide seamless real-time information on what users are watching. You’ll also be able to see failed downloads or customer problems instantly.
Hadoop deployments will increase as people realize they can run Hadoop on Windows or Linux. Hadoop will be established as a multi-platform environment. It will no longer be a Linux exclusive. Those who have not embraced Hadoop because of Linux now have an alternative.
Newer products specifically designed to analyze big data will emerge, products like Arcadia Data that can automatically connect to Hadoop in Windows, no drivers required. When software vendors build this functionality into their products, you know things are changing. The presumption that Linux is the only platform for Hadoop is beginning to disappear.
Do you see Hadoop as part of a larger trend in software development?
Yes, but the Hadoop ecosystem may not always be part of development going forward. Data virtualization with Microsoft big data clusters may often be used instead of Sqoop for example. This allows live connectivity to Hadoop, not just data imports or exports. External tools with far greater query speed than Hive also have a part to play. More and more tools will be able to access HDFS directly and connect to Hadoop at a deeper level than Hive.
The base function of storing vast amounts of structured and unstructured data will continue. Better alternatives to the Hadoop ecosystem will leave parts of Hadoop underutilized. This will increase as connectors for Hadoop in Windows are included in programming environments.
What other books or resources on Hadoop do you recommend?
I don’t know of any books dealing with Hadoop for Windows in particular. On a wider note, try and find resources that teach you about Hadoop in a holistic fashion. If you can, spend more time building networks and developing big data solutions than you do reading about them. When you’ve built a network that’s blisteringly quick, you’ll feel very confident about creating Hadoop clusters. A resource I’d recommend is Cumulus Networks’ validated design guide for open networking big data solutions. It’s available at cumulusnetworks.com. You will have to join Cumulus Networks to get the guide. Cumulus Networks was founded by veteran networking engineers from Cisco and VMware. Their approach to Hadoop and networking is a perfect example of a holistic approach, one that includes the economics of Hadoop.
I’m aware many Hadoop users may not have heard of Cumulus Linux for Hadoop. Cumulus Linux runs best on bare metal hardware-accelerated switches. By switching frames and routing packets, they can process network traffic of hundreds of gigabits or terabits per second. As Cumulus Linux is based on open networking, you’re not locked into expensive vendor solutions. You can choose hardware suited to your budget from a large hardware compatibility list. The Facebook-designed Voyager and Minipack switches are Cumulus Linux compatible, and the latter came from work with Edgecore Networks. At this point, we’re talking about 12.8 Tbps commonly presented as 32×400GbE interfaces. As Hadoop is used more for real-time and streaming data than batch processing, full bandwidth availability between any pair of servers is beneficial (from an east-west traffic standpoint). This is one of the many benefits that Cumulus Linux brings to Hadoop.
Good books on networks and Active Directory are resources you’ll also find helpful. Mix that with books on the software/development side of Hadoop. If you can’t find books that deal with the particular type of Hadoop architecture you wish to build, contact companies that build those solutions (e.g., Cumulus Networks). They’ll be far more helpful than you think, especially if you’re a student. If you’re at university and only have a couple of pure programming modules a year, try to compensate for that with far more exposure to coding.
It’s then about putting the different components together to attain a balanced Hadoop knowledge. There isn’t really any single definitive resource on Hadoop that does this. It’s partly why Hadoop certification is refracted. Why be an administrator or a developer? Be both. Why be a networking specialist in isolation from Hadoop? The certification is often tied to a particular Hadoop distribution which has complications of its own. It’s partly why pay on Hadoop contracts is inflated. It’s not easy to find the right staff. To speak plainly, Hadoop in parts is too slow. Often companies want people to speed up operations and make Hadoop work for them. There’s now almost a cottage industry of products to produce faster results from Hadoop.
I would recommend you visit syncfusion.com for resources about Hadoop on Windows.
What are three key pieces of info that developers can take away from Hadoop for Windows Succinctly?
If the network and hardware environment for Hadoop is inadequate, there is little point proceeding with a Hadoop project. The manner in which Hadoop stores data and the law of diminishing returns—this is the point at which adding more nodes provides little improvement in processing and is very expensive.
Both the strengths and weakness of Hadoop for Windows are highlighted. There is no sugar coating. In addition, where Hadoop itself (regardless of platform) has shortcomings is highlighted too. This includes costs associated with deploying Hadoop in a production environment.
When working with third-party tools or programming languages, ensure they’re built specifically to work with big data platforms. Use them over tools adapted to work with big data but were originally built for use with relational databases. The performance difference is very noticeable. Arcadia Data, for example, can query Hadoop at speeds I haven’t seen before.
Do you have any tips for developers trying to learn Hadoop?
Ensure you understand the networking and scaling side of Hadoop. If you can, learn how to build your own network and clusters yourself. You’ll gain an understanding of the costs involved in setting up Hadoop and its ongoing maintenance. Often you can do things much better and far cheaper yourself. This is why I get frustrated at books covering only single-machine Hadoop installation and usage. You’re not learning much about Hadoop! If you can quickly build a test network, one that shows a client what kind of performance is possible, you’re on the right track. In the book I highlight certain servers, switches, network cards, and cabling for this reason. Also, make yourself aware of the energy and cooling costs of your prospective projects.
Hadoop vendors can’t predict your exact network set up. This means their instructions for installing clusters can seem a bit cloudy. Try to understand the capabilities of your switches and the myriad of configuration settings. Your software should let you see all nodes in your network at a bare-metal level. If nodes can’t “see” each other, you may need to re-install and re-configure network cards, flush DNS settings, etc. Learn TCP/IP commands to test your servers’ ability to connect with other servers. The commands are not Hadoop dependent and can be accessed from the command prompt on your servers.
Armed with your network and installation knowledge, you can tackle the software and development side from an informed perspective. For example, you’ll know from the outset if you have the computing power to ingest large amounts of data. When someone gives you a task, you’ll be ahead of the game. You’ll know the task size your installation can handle and the best route to adding nodes and computing power.
Use Hadoop in ways that are relevant to your sector or industry. If you’re in financial services, you can track share prices right back to their initial public offering (IPO). This could reveal that many brokers are recommending shares trading at up to 10 times less than their IPO value years ago. You’ll get the full picture about equities, not the minimal price rise picture of recent weeks a broker wants you to see. This can provide information leading to better portfolio acquisition choices. Hadoop lets you carry and investigate all that extra data right back to IPO. Use it!
If you need information in near-real time, you need to know the right tools to use with Hadoop. You may need to learn Spark alongside Hadoop, and it’s important to remember distributed in-memory tools have additional requirements for RAM. In all cases you must know how much data you want to store and for how long.
Try not to go straight to Azure or Amazon solutions before learning how to set up Hadoop yourself. Familiarize yourself with Apache Hadoop and various Hadoop distributions. Then you’ll know if Amazon or Azure solutions are right for your requirements. It’s important as these services have a cost, one you’ll have to be able to justify to your directors.
How long have you been working with Hadoop?
Since version 0.19.2, which was about 10 years ago. Hadoop was far from perfect. You could lose data from the file system and wonder if you were imagining it.
The McKesson Electronic Staff Record (ESR) system was thought to be the world’s largest HR and payroll system. I was asked to see if we could transfer data from ESR to a medical consultants system I was the IT consultant for. The issue was the size of the ESR system. We’d been told it couldn’t be done. I shared that view the first time I downloaded Hadoop—I couldn’t imagine a package well under 100 MB being of any use. Nevertheless, the ability of Hadoop to store unstructured data was an eye opener. It became clear Hadoop was the missing link. It could handle fairly hefty data extracts with ease. I was able to talk to McKesson knowing the job could be done. I had no doubts anymore. This gives you a head start when you’re in any consultation or negotiation. McKesson agreed it could be done. I got the “no” previously given to some very expensive consultants turned into a “yes.”
Sadly, it wasn’t a total happy ending. The McKesson ESR ran on the Oracle E-Business Suite, and my lack of familiarity with Hadoop meant that I was learning its capabilities as I worked. The performance of Hadoop compared to Oracle for complex queries was a disappointment. It meant Hadoop was a non-starter. I also couldn’t give any guarantees about data loss from the file system. It was frustrating as I could see the brilliant storage ability of Hadoop and its acceptable speed for simple queries. In the end a custom-built system was developed that transferred extracts from ESR to the medical consultant system at high speed.
That started my journey of looking for tools to get faster results from Hadoop. I feel I’m still on that journey today.
Who are your role models in the developer world?
People who perhaps reflect the way I see computing. I don’t see hardware and networks as separate from software—I see them as harmonious elements.
In driving development of the PC and Mac while developing their respective operating systems, Bill Gates and Steve Jobs are unsurpassed.
The philanthropy of Bill Gates, not just in the USA but in the developing world, is also inspirational.
What were the biggest challenges you faced when learning Hadoop?
Too many books focusing on installing and using Hadoop on a single machine. A shortage of material on enterprise Hadoop installation and cluster creation. Refracted education and certification material, particularly on the networking side.
The shortcomings with Hadoop itself—the slowness of joins, for example. I cover this in the book and suggest alternatives for Hadoop on Windows (big data clusters, etc.) and on Linux (Impala).
I came to Hadoop from a networking and hardware viewpoint first. Yet it seemed to be taught the other way around, which led to installations on unsuited hardware and painfully slow networks.
I suppose you could say putting the hardware/networking side and the software/development side together, that was the challenge.
It’s perhaps the reason that Amazon and Azure are popular. For a cost you can scale your application properly. That said, when those costs get too high people do scale back, and employees notice the difference in the cloud. To quote one infrastructure manager, “You have to convince the board of directors that it’s a strategic investment. It’s not easy as they know it’s a cost that’s always going to increase.”
Costs were another area that was cloudy. You had to do deep-dive investigations yourself. Some small companies with cloud providers had their accounts and data purged after failing to pay hosting costs. Some ISPs and data centers have also gone bust—what happens to your data then? Perhaps having worked for a major insurer it’s the skeptic in me, but the growth in products for service disruption, data loss, and hacking have exploded. In essence, some sound business, insurance, and accounting principles should be applied before ramping up your big data real estate.
Some cloud providers seem a bit pricey but understand these kinds of business issues. They’re often the only credible ones you can pick. The trouble is you have to work this out for yourself. Large multi-national companies live or die by their reputation. They won’t risk it to save a few hundred thousand dollars in hosting each year. You buy cheap, you buy twice.
How quickly can you learn a new language?
Depends on how cool or exciting I think a language is. If something really excites me I’ll stay up all night. It will be the first thing I think of in the morning and the last thing I think of at night.
With so many languages nowadays, some will fall by the wayside. For this reason I choose carefully. I’d say to anyone, learn Scala and Spark. They’re so wide ranging and flexible. Why manage separate live-streaming and query tools when Spark does both brilliantly? It also scales beautifully and can work with big data independent of Hadoop. You will, however, have to know Scala to exploit it fully.
Learning a new language is like learning a musical instrument, three to six months of frustration then something clicks. You start to identify errors in code. You start thinking, “I wouldn’t do it like that; it’s quicker like this.” At this time you’re perhaps only approaching intermediate level, further time investment depends on the product or language road map.
You can of course take shortcuts. There are people who do know a remarkable number of languages. When I’ve looked at their work closely, I’ve often observed they’re syntax learners. By this I mean they learn the syntax of a new language, then use the same coding techniques they used in other, older languages.
No matter your experience, if you’ve never coded a video streaming application before, you’d probably struggle. So you then have a threshold of what defines learning a language. Is it 60% knowledge, 70%? Then what type of applications have you coded? In the book, I created tables from compressed files, why? So people would know you don’t have to unzip data files in Hadoop to work with them. I wanted the reader to work using a Hadoop mindset, not use Hadoop the same way they would a relational database.
How do you stay up-to-date on industry news?
Industry for me, means international drama content delivery and associated technologies. Luckily, the broadcast content and broadcast technology worlds have excellent media publications. There’s a holistic approach whereby content and technology are presented hand in hand when appropriate.
I like to read international publications and Turkey is a very exciting place at the moment. For example, after the USA, Turkey is the second largest exporter of scripted drama in the world. Without great content, you’ll struggle to fund the technology. Turkey got the balance between the two just right.
For industry news you can rely on, you need direct contact with industry players. This can vary from directors at large content distributors, to product managers at telecommunications companies. They’re aware of products long before they’re released, so you should keep in contact long after you’ve worked for them.
The best open-source communities can be fountains of information. You get perspectives from people creating new technologies hands-on.
If you enjoyed this blog post, we think you’ll also like: